Spaces:

dzianisBY
/

pdf-mcp-server

Sleeping

App Files Files Community

woai-art commited on Jul 3, 2025

Commit

e20e2d9

1 Parent(s): 4a52a2a

Deploy PDF MCP Server to HF Spaces

Browse files

Files changed (4) hide show

.gitignore +155 -0
README.md +146 -5
app.py +480 -0
requirements.txt +4 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,155 @@

+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+# Translations
+*.mo
+*.pot
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+# Flask stuff:
+instance/
+.webassets-cache
+# Scrapy stuff:
+.scrapy
+# Sphinx documentation
+docs/_build/
+# PyBuilder
+.pybuilder/
+target/
+# Jupyter Notebook
+.ipynb_checkpoints
+# IPython
+profile_default/
+ipython_config.py
+# pyenv
+.python-version
+# pipenv
+Pipfile.lock
+# poetry
+poetry.lock
+# pdm
+.pdm.toml
+# PEP 582
+__pypackages__/
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+# SageMath parsed files
+*.sage.py
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Pyre type checker
+.pyre/
+# pytype static type analyzer
+.pytype/
+# Cython debug symbols
+cython_debug/
+# Temporary files
+*.tmp
+*.temp
+temp/
+tmp/
+# OS generated files
+.DS_Store
+.DS_Store?
+._*
+.Spotlight-V100
+.Trashes
+ehthumbs.db
+Thumbs.db
+# PDF test files
+*.pdf
+test_pdfs/
+# Gradio cache
+gradio_cached_examples/
+flagged/

README.md CHANGED Viewed

@@ -1,13 +1,154 @@
 ---
-title: Pdf Mcp Server
-emoji: 📚
-colorFrom: yellow
-colorTo: blue
 sdk: gradio
 sdk_version: 5.35.0
 app_file: app.py
 pinned: false
 license: mit
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: PDF MCP Server
+emoji: 📄
+colorFrom: blue
+colorTo: green
 sdk: gradio
 sdk_version: 5.35.0
 app_file: app.py
 pinned: false
 license: mit
+short_description: Comprehensive PDF processing tools accessible via MCP protocol
 ---
+# 📄 PDF MCP Server
+🚀 **Comprehensive PDF processing tools accessible via MCP protocol**
+This Hugging Face Space provides a powerful PDF processing server that can be used as an MCP (Model Context Protocol) server for AI assistants like Cursor IDE.
+## 🌟 Features
+- ✅ **Extract text** from PDF files (single page or all pages)
+- ✅ **Get comprehensive PDF metadata** (title, author, pages, etc.)
+- ✅ **Extract and encode images** from PDFs as base64
+- ✅ **Render PDF pages** as high-quality images
+- ✅ **Advanced text search** with case sensitivity options
+- ✅ **Split PDF files** by page ranges
+- ✅ **JSON-formatted responses** for easy integration
+- ✅ **MCP protocol compatibility** for AI assistants
+## 🎯 Usage in Cursor IDE
+Add this configuration to your Cursor IDE MCP settings:
+```json
+{
+  "mcpServers": {
+    "pdf-server": {
+      "command": "npx",
+      "args": [
+        "mcp-remote",
+        "https://YOUR-USERNAME-pdf-mcp-server.hf.space/gradio_api/mcp/sse"
+      ]
+    }
+  }
+}
+```
+Replace `YOUR-USERNAME` with your actual HF username.
+## 🛠️ Available MCP Tools
+### `extract_text_from_pdf(pdf_path, page_number=None)`
+Extract text content from PDF files. If `page_number` is specified, extracts only that page; otherwise extracts all pages.
+**Parameters:**
+- `pdf_path` (str): Path to the PDF file
+- `page_number` (int, optional): Specific page number (1-indexed)
+**Returns:** JSON with extracted text and metadata
+### `get_pdf_metadata(pdf_path)`
+Get comprehensive metadata information from PDF files.
+**Parameters:**
+- `pdf_path` (str): Path to the PDF file
+**Returns:** JSON with title, author, creation date, page count, etc.
+### `extract_images_from_pdf(pdf_path, page_number=None)`
+Extract images from PDF files and return them as base64 encoded strings.
+**Parameters:**
+- `pdf_path` (str): Path to the PDF file
+- `page_number` (int, optional): Specific page number (1-indexed)
+**Returns:** JSON with base64 encoded images and metadata
+### `render_pdf_page(pdf_path, page_number=1, zoom=2.0)`
+Render a specific page of PDF as a high-quality image.
+**Parameters:**
+- `pdf_path` (str): Path to the PDF file
+- `page_number` (int): Page number to render (1-indexed)
+- `zoom` (float): Zoom factor for rendering quality
+**Returns:** JSON with base64 encoded page image
+### `search_text_in_pdf(pdf_path, search_term, case_sensitive=False)`
+Search for text within PDF files with optional case sensitivity.
+**Parameters:**
+- `pdf_path` (str): Path to the PDF file
+- `search_term` (str): Text to search for
+- `case_sensitive` (bool): Whether search should be case sensitive
+**Returns:** JSON with search results including page numbers and coordinates
+### `split_pdf_pages(pdf_path, start_page, end_page, output_path)`
+Extract specific page ranges from PDF files and save as new PDF.
+**Parameters:**
+- `pdf_path` (str): Path to the source PDF file
+- `start_page` (int): Starting page number (1-indexed)
+- `end_page` (int): Ending page number (1-indexed, inclusive)
+- `output_path` (str): Path for the output PDF file
+**Returns:** JSON with operation result and file information
+## 📊 Web Interface
+This Space also provides a user-friendly web interface where you can:
+1. **Upload PDF files** directly in your browser
+2. **Test all available operations** with real-time results
+3. **View JSON responses** in a formatted way
+4. **Experiment with different parameters** before using in your MCP client
+## 🔗 MCP Protocol
+The server implements the Model Context Protocol (MCP) which allows AI assistants to call these tools directly. The MCP endpoint is available at:
+```
+https://YOUR-USERNAME-pdf-mcp-server.hf.space/gradio_api/mcp/sse
+```
+## 🛡️ Technical Details
+- **Framework:** Gradio with MCP support
+- **PDF Processing:** PyMuPDF (fitz) for high-performance PDF operations
+- **Image Processing:** PIL/Pillow for image handling
+- **Protocol:** Server-Sent Events (SSE) for MCP communication
+- **Format:** JSON responses for all operations
+## 📋 Example Usage
+```python
+# Example: Extract text from first page
+result = extract_text_from_pdf("/path/to/document.pdf", 1)
+# Example: Search for text
+result = search_text_in_pdf("/path/to/document.pdf", "important", True)
+# Example: Get metadata
+result = get_pdf_metadata("/path/to/document.pdf")
+```
+## 🤝 Contributing
+This project is open source. Feel free to contribute improvements or report issues.
+## 📄 License
+MIT License - feel free to use this in your own projects!

app.py ADDED Viewed

	@@ -0,0 +1,480 @@

+import json
+import base64
+import fitz  # PyMuPDF
+import gradio as gr
+from PIL import Image
+import io
+import os
+from typing import Optional, List, Dict, Union
+def extract_text_from_pdf(pdf_path: str, page_number: Optional[int] = None) -> str:
+    """
+    Extract text from a PDF file.
+    Args:
+        pdf_path (str): Path to the PDF file
+        page_number (int, optional): Specific page number to extract (1-indexed). If None, extracts all pages.
+    Returns:
+        str: JSON string containing extracted text and metadata
+    """
+    try:
+        if not os.path.exists(pdf_path):
+            return json.dumps({
+                "error": f"File not found: {pdf_path}",
+                "text": "",
+                "pages": 0
+            })
+        doc = fitz.open(pdf_path)
+        result = {
+            "text": "",
+            "pages": doc.page_count,
+            "page_data": {}
+        }
+        if page_number is not None:
+            # Extract specific page (convert to 0-indexed)
+            page_idx = page_number - 1
+            if 0 <= page_idx < doc.page_count:
+                page = doc[page_idx]
+                text = page.get_text()
+                result["text"] = text
+                result["page_data"] = {
+                    str(page_number): {
+                        "text": text,
+                        "char_count": len(text),
+                        "blocks": len(page.get_text("blocks"))
+                    }
+                }
+            else:
+                result["error"] = f"Page {page_number} not found. PDF has {doc.page_count} pages."
+        else:
+            # Extract all pages
+            all_text = []
+            for page_num in range(doc.page_count):
+                page = doc[page_num]
+                text = page.get_text()
+                all_text.append(text)
+                result["page_data"][str(page_num + 1)] = {
+                    "text": text,
+                    "char_count": len(text),
+                    "blocks": len(page.get_text("blocks"))
+                }
+            result["text"] = "\n\n--- PAGE BREAK ---\n\n".join(all_text)
+        doc.close()
+        return json.dumps(result, ensure_ascii=False, indent=2)
+    except Exception as e:
+        return json.dumps({
+            "error": f"Error processing PDF: {str(e)}",
+            "text": "",
+            "pages": 0
+        })
+def get_pdf_metadata(pdf_path: str) -> str:
+    """
+    Get metadata information from a PDF file.
+    Args:
+        pdf_path (str): Path to the PDF file
+    Returns:
+        str: JSON string containing PDF metadata
+    """
+    try:
+        if not os.path.exists(pdf_path):
+            return json.dumps({
+                "error": f"File not found: {pdf_path}"
+            })
+        doc = fitz.open(pdf_path)
+        metadata = doc.metadata
+        result = {
+            "title": metadata.get("title", ""),
+            "author": metadata.get("author", ""),
+            "subject": metadata.get("subject", ""),
+            "creator": metadata.get("creator", ""),
+            "producer": metadata.get("producer", ""),
+            "creation_date": metadata.get("creationDate", ""),
+            "modification_date": metadata.get("modDate", ""),
+            "pages": doc.page_count,
+            "encrypted": doc.is_encrypted,
+            "file_size": os.path.getsize(pdf_path)
+        }
+        doc.close()
+        return json.dumps(result, ensure_ascii=False, indent=2)
+    except Exception as e:
+        return json.dumps({
+            "error": f"Error getting PDF metadata: {str(e)}"
+        })
+def extract_images_from_pdf(pdf_path: str, page_number: Optional[int] = None) -> str:
+    """
+    Extract images from a PDF file and return them as base64 encoded strings.
+    Args:
+        pdf_path (str): Path to the PDF file
+        page_number (int, optional): Specific page number to extract images from (1-indexed). If None, extracts from all pages.
+    Returns:
+        str: JSON string containing base64 encoded images and metadata
+    """
+    try:
+        if not os.path.exists(pdf_path):
+            return json.dumps({
+                "error": f"File not found: {pdf_path}",
+                "images": []
+            })
+        doc = fitz.open(pdf_path)
+        result = {
+            "images": [],
+            "total_images": 0,
+            "pages_processed": []
+        }
+        pages_to_process = []
+        if page_number is not None:
+            page_idx = page_number - 1
+            if 0 <= page_idx < doc.page_count:
+                pages_to_process = [page_idx]
+            else:
+                result["error"] = f"Page {page_number} not found. PDF has {doc.page_count} pages."
+                doc.close()
+                return json.dumps(result)
+        else:
+            pages_to_process = list(range(doc.page_count))
+        for page_idx in pages_to_process:
+            page = doc[page_idx]
+            page_num = page_idx + 1
+            result["pages_processed"].append(page_num)
+            image_list = page.get_images()
+            for img_index, img in enumerate(image_list):
+                try:
+                    xref = img[0]
+                    pix = fitz.Pixmap(doc, xref)
+                    if pix.n - pix.alpha < 4:  # GRAY or RGB
+                        img_data = pix.tobytes("png")
+                        img_b64 = base64.b64encode(img_data).decode()
+                        result["images"].append({
+                            "page": page_num,
+                            "index": img_index,
+                            "width": pix.width,
+                            "height": pix.height,
+                            "format": "png",
+                            "base64": img_b64
+                        })
+                    pix = None
+                except Exception as img_error:
+                    result["images"].append({
+                        "page": page_num,
+                        "index": img_index,
+                        "error": f"Could not extract image: {str(img_error)}"
+                    })
+        result["total_images"] = len(result["images"])
+        doc.close()
+        return json.dumps(result, ensure_ascii=False, indent=2)
+    except Exception as e:
+        return json.dumps({
+            "error": f"Error extracting images from PDF: {str(e)}",
+            "images": []
+        })
+def render_pdf_page(pdf_path: str, page_number: int = 1, zoom: float = 2.0) -> str:
+    """
+    Render a specific page of PDF as an image.
+    Args:
+        pdf_path (str): Path to the PDF file
+        page_number (int): Page number to render (1-indexed)
+        zoom (float): Zoom factor for rendering quality
+    Returns:
+        str: JSON string containing base64 encoded image
+    """
+    try:
+        if not os.path.exists(pdf_path):
+            return json.dumps({
+                "error": f"File not found: {pdf_path}"
+            })
+        doc = fitz.open(pdf_path)
+        page_idx = page_number - 1
+        if page_idx < 0 or page_idx >= doc.page_count:
+            doc.close()
+            return json.dumps({
+                "error": f"Page {page_number} not found. PDF has {doc.page_count} pages."
+            })
+        page = doc[page_idx]
+        mat = fitz.Matrix(zoom, zoom)
+        pix = page.get_pixmap(matrix=mat)
+        img_data = pix.tobytes("png")
+        img_b64 = base64.b64encode(img_data).decode()
+        result = {
+            "page": page_number,
+            "width": pix.width,
+            "height": pix.height,
+            "zoom": zoom,
+            "format": "png",
+            "base64": img_b64
+        }
+        doc.close()
+        return json.dumps(result, ensure_ascii=False, indent=2)
+    except Exception as e:
+        return json.dumps({
+            "error": f"Error rendering PDF page: {str(e)}"
+        })
+def search_text_in_pdf(pdf_path: str, search_term: str, case_sensitive: bool = False) -> str:
+    """
+    Search for text in a PDF file.
+    Args:
+        pdf_path (str): Path to the PDF file
+        search_term (str): Text to search for
+        case_sensitive (bool): Whether search should be case sensitive
+    Returns:
+        str: JSON string containing search results
+    """
+    try:
+        if not os.path.exists(pdf_path):
+            return json.dumps({
+                "error": f"File not found: {pdf_path}",
+                "matches": []
+            })
+        doc = fitz.open(pdf_path)
+        result = {
+            "search_term": search_term,
+            "case_sensitive": case_sensitive,
+            "matches": [],
+            "total_matches": 0,
+            "pages_searched": doc.page_count
+        }
+        search_flags = 0 if case_sensitive else fitz.TEXT_DEHYPHENATE
+        for page_num in range(doc.page_count):
+            page = doc[page_num]
+            text_instances = page.search_for(search_term, flags=search_flags)
+            for instance in text_instances:
+                result["matches"].append({
+                    "page": page_num + 1,
+                    "coordinates": {
+                        "x0": instance.x0,
+                        "y0": instance.y0,
+                        "x1": instance.x1,
+                        "y1": instance.y1
+                    },
+                    "context": page.get_textbox(instance)
+                })
+        result["total_matches"] = len(result["matches"])
+        doc.close()
+        return json.dumps(result, ensure_ascii=False, indent=2)
+    except Exception as e:
+        return json.dumps({
+            "error": f"Error searching PDF: {str(e)}",
+            "matches": []
+        })
+def split_pdf_pages(pdf_path: str, start_page: int, end_page: int, output_path: str) -> str:
+    """
+    Extract specific pages from a PDF and save as a new PDF file.
+    Args:
+        pdf_path (str): Path to the source PDF file
+        start_page (int): Starting page number (1-indexed)
+        end_page (int): Ending page number (1-indexed, inclusive)
+        output_path (str): Path for the output PDF file
+    Returns:
+        str: JSON string containing operation result
+    """
+    try:
+        if not os.path.exists(pdf_path):
+            return json.dumps({
+                "error": f"File not found: {pdf_path}",
+                "success": False
+            })
+        doc = fitz.open(pdf_path)
+        # Convert to 0-indexed and validate
+        start_idx = start_page - 1
+        end_idx = end_page - 1
+        if start_idx < 0 or end_idx >= doc.page_count or start_idx > end_idx:
+            doc.close()
+            return json.dumps({
+                "error": f"Invalid page range. PDF has {doc.page_count} pages.",
+                "success": False
+            })
+        # Create new document with selected pages
+        new_doc = fitz.open()
+        new_doc.insert_pdf(doc, from_page=start_idx, to_page=end_idx)
+        # Ensure output directory exists
+        os.makedirs(os.path.dirname(output_path) if os.path.dirname(output_path) else ".", exist_ok=True)
+        new_doc.save(output_path)
+        new_doc.close()
+        doc.close()
+        result = {
+            "success": True,
+            "output_file": output_path,
+            "pages_extracted": end_page - start_page + 1,
+            "source_file": pdf_path,
+            "page_range": f"{start_page}-{end_page}"
+        }
+        return json.dumps(result, ensure_ascii=False, indent=2)
+    except Exception as e:
+        return json.dumps({
+            "error": f"Error splitting PDF: {str(e)}",
+            "success": False
+        })
+def upload_and_process_pdf(file, operation="extract_text", page_number=None, search_term="", case_sensitive=False, zoom=2.0):
+    """Handle file upload and process according to selected operation"""
+    if file is None:
+        return "Please upload a PDF file first."
+    try:
+        # Save uploaded file temporarily
+        temp_path = file.name
+        if operation == "extract_text":
+            result = extract_text_from_pdf(temp_path, page_number)
+        elif operation == "metadata":
+            result = get_pdf_metadata(temp_path)
+        elif operation == "extract_images":
+            result = extract_images_from_pdf(temp_path, page_number)
+        elif operation == "render_page":
+            page_num = page_number if page_number else 1
+            result = render_pdf_page(temp_path, page_num, zoom)
+        elif operation == "search_text":
+            if not search_term:
+                return "Please enter a search term."
+            result = search_text_in_pdf(temp_path, search_term, case_sensitive)
+        else:
+            result = json.dumps({"error": "Invalid operation"})
+        return result
+    except Exception as e:
+        return json.dumps({"error": f"Error processing file: {str(e)}"})
+# Create Gradio interface optimized for HF Spaces
+def create_hf_interface():
+    with gr.Blocks(title="PDF MCP Server - HF Space", theme=gr.themes.Soft()) as demo:
+        gr.Markdown("# 📄 PDF MCP Server")
+        gr.Markdown("🚀 **Comprehensive PDF processing tools accessible via MCP protocol**")
+        gr.Markdown("🌐 **Now running on Hugging Face Spaces!**")
+        with gr.Row():
+            gr.Markdown("🔗 **MCP Endpoint**: Use this space's URL + `/gradio_api/mcp/sse` in your MCP client")
+        with gr.Tab("📤 Upload & Process"):
+            with gr.Row():
+                with gr.Column():
+                    file_input = gr.File(label="📄 Upload PDF", file_types=[".pdf"])
+                    operation = gr.Dropdown(
+                        choices=["extract_text", "metadata", "extract_images", "render_page", "search_text"],
+                        value="extract_text",
+                        label="⚡ Operation"
+                    )
+                    with gr.Row():
+                        page_number = gr.Number(label="📄 Page Number (optional)", value=None, precision=0)
+                        zoom = gr.Number(label="🔍 Zoom Factor", value=2.0, minimum=0.5, maximum=5.0)
+                    with gr.Row():
+                        search_term = gr.Textbox(label="🔍 Search Term", placeholder="Enter text to search")
+                        case_sensitive = gr.Checkbox(label="Aa Case Sensitive", value=False)
+                    process_btn = gr.Button("🚀 Process PDF", variant="primary", size="lg")
+                with gr.Column():
+                    output = gr.Textbox(label="📊 Result (JSON)", lines=20, max_lines=30)
+            process_btn.click(
+                upload_and_process_pdf,
+                inputs=[file_input, operation, page_number, search_term, case_sensitive, zoom],
+                outputs=output
+            )
+        with gr.Tab("🔧 MCP Tools"):
+            gr.Markdown("### 🛠️ Available MCP Tools:")
+            gr.Markdown("""
+            - **extract_text_from_pdf**(pdf_path, page_number=None) - Extract text content
+            - **get_pdf_metadata**(pdf_path) - Get PDF metadata and info
+            - **extract_images_from_pdf**(pdf_path, page_number=None) - Extract images as base64
+            - **render_pdf_page**(pdf_path, page_number=1, zoom=2.0) - Render page as image
+            - **search_text_in_pdf**(pdf_path, search_term, case_sensitive=False) - Search text
+            - **split_pdf_pages**(pdf_path, start_page, end_page, output_path) - Split PDF pages
+            """)
+            gr.Markdown("### 🎯 Usage in Cursor IDE:")
+            gr.Code('''
+{
+  "mcpServers": {
+    "pdf-server": {
+      "command": "npx",
+      "args": [
+        "mcp-remote",
+        "https://YOUR-USERNAME-pdf-mcp-server.hf.space/gradio_api/mcp/sse"
+      ]
+    }
+  }
+}
+            ''', language="json")
+            gr.Markdown("### 📱 Features:")
+            gr.Markdown("""
+            - ✅ Extract text from PDF files (single page or all pages)
+            - ✅ Get comprehensive PDF metadata
+            - ✅ Extract and encode images from PDFs
+            - ✅ Render PDF pages as high-quality images
+            - ✅ Advanced text search with case sensitivity options
+            - ✅ Split PDF files by page ranges
+            - ✅ JSON-formatted responses for easy integration
+            - ✅ MCP protocol compatibility for AI assistants
+            """)
+    return demo
+# Create and launch the interface
+demo = create_hf_interface()
+if __name__ == "__main__":
+    # For HF Spaces - optimized configuration
+    demo.launch(
+        mcp_server=True,
+        server_name="0.0.0.0",
+        server_port=7860,
+        share=False,
+        show_error=True,
+        show_api=True
+    )

requirements.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+gradio[mcp]>=5.0.0
+PyMuPDF>=1.24.0
+pillow>=10.0.0
+typing-extensions