woai-art commited on
Commit
e20e2d9
Β·
1 Parent(s): 4a52a2a

Deploy PDF MCP Server to HF Spaces

Browse files
Files changed (4) hide show
  1. .gitignore +155 -0
  2. README.md +146 -5
  3. app.py +480 -0
  4. requirements.txt +4 -0
.gitignore ADDED
@@ -0,0 +1,155 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+ *.so
6
+ .Python
7
+ build/
8
+ develop-eggs/
9
+ dist/
10
+ downloads/
11
+ eggs/
12
+ .eggs/
13
+ lib/
14
+ lib64/
15
+ parts/
16
+ sdist/
17
+ var/
18
+ wheels/
19
+ share/python-wheels/
20
+ *.egg-info/
21
+ .installed.cfg
22
+ *.egg
23
+ MANIFEST
24
+
25
+ # PyInstaller
26
+ *.manifest
27
+ *.spec
28
+
29
+ # Installer logs
30
+ pip-log.txt
31
+ pip-delete-this-directory.txt
32
+
33
+ # Unit test / coverage reports
34
+ htmlcov/
35
+ .tox/
36
+ .nox/
37
+ .coverage
38
+ .coverage.*
39
+ .cache
40
+ nosetests.xml
41
+ coverage.xml
42
+ *.cover
43
+ *.py,cover
44
+ .hypothesis/
45
+ .pytest_cache/
46
+ cover/
47
+
48
+ # Translations
49
+ *.mo
50
+ *.pot
51
+
52
+ # Django stuff:
53
+ *.log
54
+ local_settings.py
55
+ db.sqlite3
56
+ db.sqlite3-journal
57
+
58
+ # Flask stuff:
59
+ instance/
60
+ .webassets-cache
61
+
62
+ # Scrapy stuff:
63
+ .scrapy
64
+
65
+ # Sphinx documentation
66
+ docs/_build/
67
+
68
+ # PyBuilder
69
+ .pybuilder/
70
+ target/
71
+
72
+ # Jupyter Notebook
73
+ .ipynb_checkpoints
74
+
75
+ # IPython
76
+ profile_default/
77
+ ipython_config.py
78
+
79
+ # pyenv
80
+ .python-version
81
+
82
+ # pipenv
83
+ Pipfile.lock
84
+
85
+ # poetry
86
+ poetry.lock
87
+
88
+ # pdm
89
+ .pdm.toml
90
+
91
+ # PEP 582
92
+ __pypackages__/
93
+
94
+ # Celery stuff
95
+ celerybeat-schedule
96
+ celerybeat.pid
97
+
98
+ # SageMath parsed files
99
+ *.sage.py
100
+
101
+ # Environments
102
+ .env
103
+ .venv
104
+ env/
105
+ venv/
106
+ ENV/
107
+ env.bak/
108
+ venv.bak/
109
+
110
+ # Spyder project settings
111
+ .spyderproject
112
+ .spyproject
113
+
114
+ # Rope project settings
115
+ .ropeproject
116
+
117
+ # mkdocs documentation
118
+ /site
119
+
120
+ # mypy
121
+ .mypy_cache/
122
+ .dmypy.json
123
+ dmypy.json
124
+
125
+ # Pyre type checker
126
+ .pyre/
127
+
128
+ # pytype static type analyzer
129
+ .pytype/
130
+
131
+ # Cython debug symbols
132
+ cython_debug/
133
+
134
+ # Temporary files
135
+ *.tmp
136
+ *.temp
137
+ temp/
138
+ tmp/
139
+
140
+ # OS generated files
141
+ .DS_Store
142
+ .DS_Store?
143
+ ._*
144
+ .Spotlight-V100
145
+ .Trashes
146
+ ehthumbs.db
147
+ Thumbs.db
148
+
149
+ # PDF test files
150
+ *.pdf
151
+ test_pdfs/
152
+
153
+ # Gradio cache
154
+ gradio_cached_examples/
155
+ flagged/
README.md CHANGED
@@ -1,13 +1,154 @@
1
  ---
2
- title: Pdf Mcp Server
3
- emoji: πŸ“š
4
- colorFrom: yellow
5
- colorTo: blue
6
  sdk: gradio
7
  sdk_version: 5.35.0
8
  app_file: app.py
9
  pinned: false
10
  license: mit
 
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: PDF MCP Server
3
+ emoji: πŸ“„
4
+ colorFrom: blue
5
+ colorTo: green
6
  sdk: gradio
7
  sdk_version: 5.35.0
8
  app_file: app.py
9
  pinned: false
10
  license: mit
11
+ short_description: Comprehensive PDF processing tools accessible via MCP protocol
12
  ---
13
 
14
+ # πŸ“„ PDF MCP Server
15
+
16
+ πŸš€ **Comprehensive PDF processing tools accessible via MCP protocol**
17
+
18
+ This Hugging Face Space provides a powerful PDF processing server that can be used as an MCP (Model Context Protocol) server for AI assistants like Cursor IDE.
19
+
20
+ ## 🌟 Features
21
+
22
+ - βœ… **Extract text** from PDF files (single page or all pages)
23
+ - βœ… **Get comprehensive PDF metadata** (title, author, pages, etc.)
24
+ - βœ… **Extract and encode images** from PDFs as base64
25
+ - βœ… **Render PDF pages** as high-quality images
26
+ - βœ… **Advanced text search** with case sensitivity options
27
+ - βœ… **Split PDF files** by page ranges
28
+ - βœ… **JSON-formatted responses** for easy integration
29
+ - βœ… **MCP protocol compatibility** for AI assistants
30
+
31
+ ## 🎯 Usage in Cursor IDE
32
+
33
+ Add this configuration to your Cursor IDE MCP settings:
34
+
35
+ ```json
36
+ {
37
+ "mcpServers": {
38
+ "pdf-server": {
39
+ "command": "npx",
40
+ "args": [
41
+ "mcp-remote",
42
+ "https://YOUR-USERNAME-pdf-mcp-server.hf.space/gradio_api/mcp/sse"
43
+ ]
44
+ }
45
+ }
46
+ }
47
+ ```
48
+
49
+ Replace `YOUR-USERNAME` with your actual HF username.
50
+
51
+ ## πŸ› οΈ Available MCP Tools
52
+
53
+ ### `extract_text_from_pdf(pdf_path, page_number=None)`
54
+ Extract text content from PDF files. If `page_number` is specified, extracts only that page; otherwise extracts all pages.
55
+
56
+ **Parameters:**
57
+ - `pdf_path` (str): Path to the PDF file
58
+ - `page_number` (int, optional): Specific page number (1-indexed)
59
+
60
+ **Returns:** JSON with extracted text and metadata
61
+
62
+ ### `get_pdf_metadata(pdf_path)`
63
+ Get comprehensive metadata information from PDF files.
64
+
65
+ **Parameters:**
66
+ - `pdf_path` (str): Path to the PDF file
67
+
68
+ **Returns:** JSON with title, author, creation date, page count, etc.
69
+
70
+ ### `extract_images_from_pdf(pdf_path, page_number=None)`
71
+ Extract images from PDF files and return them as base64 encoded strings.
72
+
73
+ **Parameters:**
74
+ - `pdf_path` (str): Path to the PDF file
75
+ - `page_number` (int, optional): Specific page number (1-indexed)
76
+
77
+ **Returns:** JSON with base64 encoded images and metadata
78
+
79
+ ### `render_pdf_page(pdf_path, page_number=1, zoom=2.0)`
80
+ Render a specific page of PDF as a high-quality image.
81
+
82
+ **Parameters:**
83
+ - `pdf_path` (str): Path to the PDF file
84
+ - `page_number` (int): Page number to render (1-indexed)
85
+ - `zoom` (float): Zoom factor for rendering quality
86
+
87
+ **Returns:** JSON with base64 encoded page image
88
+
89
+ ### `search_text_in_pdf(pdf_path, search_term, case_sensitive=False)`
90
+ Search for text within PDF files with optional case sensitivity.
91
+
92
+ **Parameters:**
93
+ - `pdf_path` (str): Path to the PDF file
94
+ - `search_term` (str): Text to search for
95
+ - `case_sensitive` (bool): Whether search should be case sensitive
96
+
97
+ **Returns:** JSON with search results including page numbers and coordinates
98
+
99
+ ### `split_pdf_pages(pdf_path, start_page, end_page, output_path)`
100
+ Extract specific page ranges from PDF files and save as new PDF.
101
+
102
+ **Parameters:**
103
+ - `pdf_path` (str): Path to the source PDF file
104
+ - `start_page` (int): Starting page number (1-indexed)
105
+ - `end_page` (int): Ending page number (1-indexed, inclusive)
106
+ - `output_path` (str): Path for the output PDF file
107
+
108
+ **Returns:** JSON with operation result and file information
109
+
110
+ ## πŸ“Š Web Interface
111
+
112
+ This Space also provides a user-friendly web interface where you can:
113
+
114
+ 1. **Upload PDF files** directly in your browser
115
+ 2. **Test all available operations** with real-time results
116
+ 3. **View JSON responses** in a formatted way
117
+ 4. **Experiment with different parameters** before using in your MCP client
118
+
119
+ ## πŸ”— MCP Protocol
120
+
121
+ The server implements the Model Context Protocol (MCP) which allows AI assistants to call these tools directly. The MCP endpoint is available at:
122
+
123
+ ```
124
+ https://YOUR-USERNAME-pdf-mcp-server.hf.space/gradio_api/mcp/sse
125
+ ```
126
+
127
+ ## πŸ›‘οΈ Technical Details
128
+
129
+ - **Framework:** Gradio with MCP support
130
+ - **PDF Processing:** PyMuPDF (fitz) for high-performance PDF operations
131
+ - **Image Processing:** PIL/Pillow for image handling
132
+ - **Protocol:** Server-Sent Events (SSE) for MCP communication
133
+ - **Format:** JSON responses for all operations
134
+
135
+ ## πŸ“‹ Example Usage
136
+
137
+ ```python
138
+ # Example: Extract text from first page
139
+ result = extract_text_from_pdf("/path/to/document.pdf", 1)
140
+
141
+ # Example: Search for text
142
+ result = search_text_in_pdf("/path/to/document.pdf", "important", True)
143
+
144
+ # Example: Get metadata
145
+ result = get_pdf_metadata("/path/to/document.pdf")
146
+ ```
147
+
148
+ ## 🀝 Contributing
149
+
150
+ This project is open source. Feel free to contribute improvements or report issues.
151
+
152
+ ## πŸ“„ License
153
+
154
+ MIT License - feel free to use this in your own projects!
app.py ADDED
@@ -0,0 +1,480 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ import base64
3
+ import fitz # PyMuPDF
4
+ import gradio as gr
5
+ from PIL import Image
6
+ import io
7
+ import os
8
+ from typing import Optional, List, Dict, Union
9
+
10
+ def extract_text_from_pdf(pdf_path: str, page_number: Optional[int] = None) -> str:
11
+ """
12
+ Extract text from a PDF file.
13
+
14
+ Args:
15
+ pdf_path (str): Path to the PDF file
16
+ page_number (int, optional): Specific page number to extract (1-indexed). If None, extracts all pages.
17
+
18
+ Returns:
19
+ str: JSON string containing extracted text and metadata
20
+ """
21
+ try:
22
+ if not os.path.exists(pdf_path):
23
+ return json.dumps({
24
+ "error": f"File not found: {pdf_path}",
25
+ "text": "",
26
+ "pages": 0
27
+ })
28
+
29
+ doc = fitz.open(pdf_path)
30
+ result = {
31
+ "text": "",
32
+ "pages": doc.page_count,
33
+ "page_data": {}
34
+ }
35
+
36
+ if page_number is not None:
37
+ # Extract specific page (convert to 0-indexed)
38
+ page_idx = page_number - 1
39
+ if 0 <= page_idx < doc.page_count:
40
+ page = doc[page_idx]
41
+ text = page.get_text()
42
+ result["text"] = text
43
+ result["page_data"] = {
44
+ str(page_number): {
45
+ "text": text,
46
+ "char_count": len(text),
47
+ "blocks": len(page.get_text("blocks"))
48
+ }
49
+ }
50
+ else:
51
+ result["error"] = f"Page {page_number} not found. PDF has {doc.page_count} pages."
52
+ else:
53
+ # Extract all pages
54
+ all_text = []
55
+ for page_num in range(doc.page_count):
56
+ page = doc[page_num]
57
+ text = page.get_text()
58
+ all_text.append(text)
59
+ result["page_data"][str(page_num + 1)] = {
60
+ "text": text,
61
+ "char_count": len(text),
62
+ "blocks": len(page.get_text("blocks"))
63
+ }
64
+ result["text"] = "\n\n--- PAGE BREAK ---\n\n".join(all_text)
65
+
66
+ doc.close()
67
+ return json.dumps(result, ensure_ascii=False, indent=2)
68
+
69
+ except Exception as e:
70
+ return json.dumps({
71
+ "error": f"Error processing PDF: {str(e)}",
72
+ "text": "",
73
+ "pages": 0
74
+ })
75
+
76
+ def get_pdf_metadata(pdf_path: str) -> str:
77
+ """
78
+ Get metadata information from a PDF file.
79
+
80
+ Args:
81
+ pdf_path (str): Path to the PDF file
82
+
83
+ Returns:
84
+ str: JSON string containing PDF metadata
85
+ """
86
+ try:
87
+ if not os.path.exists(pdf_path):
88
+ return json.dumps({
89
+ "error": f"File not found: {pdf_path}"
90
+ })
91
+
92
+ doc = fitz.open(pdf_path)
93
+ metadata = doc.metadata
94
+
95
+ result = {
96
+ "title": metadata.get("title", ""),
97
+ "author": metadata.get("author", ""),
98
+ "subject": metadata.get("subject", ""),
99
+ "creator": metadata.get("creator", ""),
100
+ "producer": metadata.get("producer", ""),
101
+ "creation_date": metadata.get("creationDate", ""),
102
+ "modification_date": metadata.get("modDate", ""),
103
+ "pages": doc.page_count,
104
+ "encrypted": doc.is_encrypted,
105
+ "file_size": os.path.getsize(pdf_path)
106
+ }
107
+
108
+ doc.close()
109
+ return json.dumps(result, ensure_ascii=False, indent=2)
110
+
111
+ except Exception as e:
112
+ return json.dumps({
113
+ "error": f"Error getting PDF metadata: {str(e)}"
114
+ })
115
+
116
+ def extract_images_from_pdf(pdf_path: str, page_number: Optional[int] = None) -> str:
117
+ """
118
+ Extract images from a PDF file and return them as base64 encoded strings.
119
+
120
+ Args:
121
+ pdf_path (str): Path to the PDF file
122
+ page_number (int, optional): Specific page number to extract images from (1-indexed). If None, extracts from all pages.
123
+
124
+ Returns:
125
+ str: JSON string containing base64 encoded images and metadata
126
+ """
127
+ try:
128
+ if not os.path.exists(pdf_path):
129
+ return json.dumps({
130
+ "error": f"File not found: {pdf_path}",
131
+ "images": []
132
+ })
133
+
134
+ doc = fitz.open(pdf_path)
135
+ result = {
136
+ "images": [],
137
+ "total_images": 0,
138
+ "pages_processed": []
139
+ }
140
+
141
+ pages_to_process = []
142
+ if page_number is not None:
143
+ page_idx = page_number - 1
144
+ if 0 <= page_idx < doc.page_count:
145
+ pages_to_process = [page_idx]
146
+ else:
147
+ result["error"] = f"Page {page_number} not found. PDF has {doc.page_count} pages."
148
+ doc.close()
149
+ return json.dumps(result)
150
+ else:
151
+ pages_to_process = list(range(doc.page_count))
152
+
153
+ for page_idx in pages_to_process:
154
+ page = doc[page_idx]
155
+ page_num = page_idx + 1
156
+ result["pages_processed"].append(page_num)
157
+
158
+ image_list = page.get_images()
159
+ for img_index, img in enumerate(image_list):
160
+ try:
161
+ xref = img[0]
162
+ pix = fitz.Pixmap(doc, xref)
163
+ if pix.n - pix.alpha < 4: # GRAY or RGB
164
+ img_data = pix.tobytes("png")
165
+ img_b64 = base64.b64encode(img_data).decode()
166
+
167
+ result["images"].append({
168
+ "page": page_num,
169
+ "index": img_index,
170
+ "width": pix.width,
171
+ "height": pix.height,
172
+ "format": "png",
173
+ "base64": img_b64
174
+ })
175
+ pix = None
176
+ except Exception as img_error:
177
+ result["images"].append({
178
+ "page": page_num,
179
+ "index": img_index,
180
+ "error": f"Could not extract image: {str(img_error)}"
181
+ })
182
+
183
+ result["total_images"] = len(result["images"])
184
+ doc.close()
185
+ return json.dumps(result, ensure_ascii=False, indent=2)
186
+
187
+ except Exception as e:
188
+ return json.dumps({
189
+ "error": f"Error extracting images from PDF: {str(e)}",
190
+ "images": []
191
+ })
192
+
193
+ def render_pdf_page(pdf_path: str, page_number: int = 1, zoom: float = 2.0) -> str:
194
+ """
195
+ Render a specific page of PDF as an image.
196
+
197
+ Args:
198
+ pdf_path (str): Path to the PDF file
199
+ page_number (int): Page number to render (1-indexed)
200
+ zoom (float): Zoom factor for rendering quality
201
+
202
+ Returns:
203
+ str: JSON string containing base64 encoded image
204
+ """
205
+ try:
206
+ if not os.path.exists(pdf_path):
207
+ return json.dumps({
208
+ "error": f"File not found: {pdf_path}"
209
+ })
210
+
211
+ doc = fitz.open(pdf_path)
212
+
213
+ page_idx = page_number - 1
214
+ if page_idx < 0 or page_idx >= doc.page_count:
215
+ doc.close()
216
+ return json.dumps({
217
+ "error": f"Page {page_number} not found. PDF has {doc.page_count} pages."
218
+ })
219
+
220
+ page = doc[page_idx]
221
+ mat = fitz.Matrix(zoom, zoom)
222
+ pix = page.get_pixmap(matrix=mat)
223
+ img_data = pix.tobytes("png")
224
+ img_b64 = base64.b64encode(img_data).decode()
225
+
226
+ result = {
227
+ "page": page_number,
228
+ "width": pix.width,
229
+ "height": pix.height,
230
+ "zoom": zoom,
231
+ "format": "png",
232
+ "base64": img_b64
233
+ }
234
+
235
+ doc.close()
236
+ return json.dumps(result, ensure_ascii=False, indent=2)
237
+
238
+ except Exception as e:
239
+ return json.dumps({
240
+ "error": f"Error rendering PDF page: {str(e)}"
241
+ })
242
+
243
+ def search_text_in_pdf(pdf_path: str, search_term: str, case_sensitive: bool = False) -> str:
244
+ """
245
+ Search for text in a PDF file.
246
+
247
+ Args:
248
+ pdf_path (str): Path to the PDF file
249
+ search_term (str): Text to search for
250
+ case_sensitive (bool): Whether search should be case sensitive
251
+
252
+ Returns:
253
+ str: JSON string containing search results
254
+ """
255
+ try:
256
+ if not os.path.exists(pdf_path):
257
+ return json.dumps({
258
+ "error": f"File not found: {pdf_path}",
259
+ "matches": []
260
+ })
261
+
262
+ doc = fitz.open(pdf_path)
263
+ result = {
264
+ "search_term": search_term,
265
+ "case_sensitive": case_sensitive,
266
+ "matches": [],
267
+ "total_matches": 0,
268
+ "pages_searched": doc.page_count
269
+ }
270
+
271
+ search_flags = 0 if case_sensitive else fitz.TEXT_DEHYPHENATE
272
+
273
+ for page_num in range(doc.page_count):
274
+ page = doc[page_num]
275
+ text_instances = page.search_for(search_term, flags=search_flags)
276
+
277
+ for instance in text_instances:
278
+ result["matches"].append({
279
+ "page": page_num + 1,
280
+ "coordinates": {
281
+ "x0": instance.x0,
282
+ "y0": instance.y0,
283
+ "x1": instance.x1,
284
+ "y1": instance.y1
285
+ },
286
+ "context": page.get_textbox(instance)
287
+ })
288
+
289
+ result["total_matches"] = len(result["matches"])
290
+ doc.close()
291
+ return json.dumps(result, ensure_ascii=False, indent=2)
292
+
293
+ except Exception as e:
294
+ return json.dumps({
295
+ "error": f"Error searching PDF: {str(e)}",
296
+ "matches": []
297
+ })
298
+
299
+ def split_pdf_pages(pdf_path: str, start_page: int, end_page: int, output_path: str) -> str:
300
+ """
301
+ Extract specific pages from a PDF and save as a new PDF file.
302
+
303
+ Args:
304
+ pdf_path (str): Path to the source PDF file
305
+ start_page (int): Starting page number (1-indexed)
306
+ end_page (int): Ending page number (1-indexed, inclusive)
307
+ output_path (str): Path for the output PDF file
308
+
309
+ Returns:
310
+ str: JSON string containing operation result
311
+ """
312
+ try:
313
+ if not os.path.exists(pdf_path):
314
+ return json.dumps({
315
+ "error": f"File not found: {pdf_path}",
316
+ "success": False
317
+ })
318
+
319
+ doc = fitz.open(pdf_path)
320
+
321
+ # Convert to 0-indexed and validate
322
+ start_idx = start_page - 1
323
+ end_idx = end_page - 1
324
+
325
+ if start_idx < 0 or end_idx >= doc.page_count or start_idx > end_idx:
326
+ doc.close()
327
+ return json.dumps({
328
+ "error": f"Invalid page range. PDF has {doc.page_count} pages.",
329
+ "success": False
330
+ })
331
+
332
+ # Create new document with selected pages
333
+ new_doc = fitz.open()
334
+ new_doc.insert_pdf(doc, from_page=start_idx, to_page=end_idx)
335
+
336
+ # Ensure output directory exists
337
+ os.makedirs(os.path.dirname(output_path) if os.path.dirname(output_path) else ".", exist_ok=True)
338
+
339
+ new_doc.save(output_path)
340
+ new_doc.close()
341
+ doc.close()
342
+
343
+ result = {
344
+ "success": True,
345
+ "output_file": output_path,
346
+ "pages_extracted": end_page - start_page + 1,
347
+ "source_file": pdf_path,
348
+ "page_range": f"{start_page}-{end_page}"
349
+ }
350
+
351
+ return json.dumps(result, ensure_ascii=False, indent=2)
352
+
353
+ except Exception as e:
354
+ return json.dumps({
355
+ "error": f"Error splitting PDF: {str(e)}",
356
+ "success": False
357
+ })
358
+
359
+ def upload_and_process_pdf(file, operation="extract_text", page_number=None, search_term="", case_sensitive=False, zoom=2.0):
360
+ """Handle file upload and process according to selected operation"""
361
+ if file is None:
362
+ return "Please upload a PDF file first."
363
+
364
+ try:
365
+ # Save uploaded file temporarily
366
+ temp_path = file.name
367
+
368
+ if operation == "extract_text":
369
+ result = extract_text_from_pdf(temp_path, page_number)
370
+ elif operation == "metadata":
371
+ result = get_pdf_metadata(temp_path)
372
+ elif operation == "extract_images":
373
+ result = extract_images_from_pdf(temp_path, page_number)
374
+ elif operation == "render_page":
375
+ page_num = page_number if page_number else 1
376
+ result = render_pdf_page(temp_path, page_num, zoom)
377
+ elif operation == "search_text":
378
+ if not search_term:
379
+ return "Please enter a search term."
380
+ result = search_text_in_pdf(temp_path, search_term, case_sensitive)
381
+ else:
382
+ result = json.dumps({"error": "Invalid operation"})
383
+
384
+ return result
385
+
386
+ except Exception as e:
387
+ return json.dumps({"error": f"Error processing file: {str(e)}"})
388
+
389
+ # Create Gradio interface optimized for HF Spaces
390
+ def create_hf_interface():
391
+ with gr.Blocks(title="PDF MCP Server - HF Space", theme=gr.themes.Soft()) as demo:
392
+ gr.Markdown("# πŸ“„ PDF MCP Server")
393
+ gr.Markdown("πŸš€ **Comprehensive PDF processing tools accessible via MCP protocol**")
394
+ gr.Markdown("🌐 **Now running on Hugging Face Spaces!**")
395
+
396
+ with gr.Row():
397
+ gr.Markdown("πŸ”— **MCP Endpoint**: Use this space's URL + `/gradio_api/mcp/sse` in your MCP client")
398
+
399
+ with gr.Tab("πŸ“€ Upload & Process"):
400
+ with gr.Row():
401
+ with gr.Column():
402
+ file_input = gr.File(label="πŸ“„ Upload PDF", file_types=[".pdf"])
403
+ operation = gr.Dropdown(
404
+ choices=["extract_text", "metadata", "extract_images", "render_page", "search_text"],
405
+ value="extract_text",
406
+ label="⚑ Operation"
407
+ )
408
+
409
+ with gr.Row():
410
+ page_number = gr.Number(label="πŸ“„ Page Number (optional)", value=None, precision=0)
411
+ zoom = gr.Number(label="πŸ” Zoom Factor", value=2.0, minimum=0.5, maximum=5.0)
412
+
413
+ with gr.Row():
414
+ search_term = gr.Textbox(label="πŸ” Search Term", placeholder="Enter text to search")
415
+ case_sensitive = gr.Checkbox(label="Aa Case Sensitive", value=False)
416
+
417
+ process_btn = gr.Button("πŸš€ Process PDF", variant="primary", size="lg")
418
+
419
+ with gr.Column():
420
+ output = gr.Textbox(label="πŸ“Š Result (JSON)", lines=20, max_lines=30)
421
+
422
+ process_btn.click(
423
+ upload_and_process_pdf,
424
+ inputs=[file_input, operation, page_number, search_term, case_sensitive, zoom],
425
+ outputs=output
426
+ )
427
+
428
+ with gr.Tab("πŸ”§ MCP Tools"):
429
+ gr.Markdown("### πŸ› οΈ Available MCP Tools:")
430
+ gr.Markdown("""
431
+ - **extract_text_from_pdf**(pdf_path, page_number=None) - Extract text content
432
+ - **get_pdf_metadata**(pdf_path) - Get PDF metadata and info
433
+ - **extract_images_from_pdf**(pdf_path, page_number=None) - Extract images as base64
434
+ - **render_pdf_page**(pdf_path, page_number=1, zoom=2.0) - Render page as image
435
+ - **search_text_in_pdf**(pdf_path, search_term, case_sensitive=False) - Search text
436
+ - **split_pdf_pages**(pdf_path, start_page, end_page, output_path) - Split PDF pages
437
+ """)
438
+
439
+ gr.Markdown("### 🎯 Usage in Cursor IDE:")
440
+ gr.Code('''
441
+ {
442
+ "mcpServers": {
443
+ "pdf-server": {
444
+ "command": "npx",
445
+ "args": [
446
+ "mcp-remote",
447
+ "https://YOUR-USERNAME-pdf-mcp-server.hf.space/gradio_api/mcp/sse"
448
+ ]
449
+ }
450
+ }
451
+ }
452
+ ''', language="json")
453
+
454
+ gr.Markdown("### πŸ“± Features:")
455
+ gr.Markdown("""
456
+ - βœ… Extract text from PDF files (single page or all pages)
457
+ - βœ… Get comprehensive PDF metadata
458
+ - βœ… Extract and encode images from PDFs
459
+ - βœ… Render PDF pages as high-quality images
460
+ - βœ… Advanced text search with case sensitivity options
461
+ - βœ… Split PDF files by page ranges
462
+ - βœ… JSON-formatted responses for easy integration
463
+ - βœ… MCP protocol compatibility for AI assistants
464
+ """)
465
+
466
+ return demo
467
+
468
+ # Create and launch the interface
469
+ demo = create_hf_interface()
470
+
471
+ if __name__ == "__main__":
472
+ # For HF Spaces - optimized configuration
473
+ demo.launch(
474
+ mcp_server=True,
475
+ server_name="0.0.0.0",
476
+ server_port=7860,
477
+ share=False,
478
+ show_error=True,
479
+ show_api=True
480
+ )
requirements.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ gradio[mcp]>=5.0.0
2
+ PyMuPDF>=1.24.0
3
+ pillow>=10.0.0
4
+ typing-extensions