Fast PDF Text Extraction for Embeddings - Switching from Unstructured to PyMuPDF

I needed to build a semantic search system for audio equipment manuals. The goal was to extract text from approximately 200 PDF manuals, generate embeddings using a sentence transformer model, and store them in a vector database for retrieval. The manuals ranged from 20 to 300 pages each, totaling around 15,000 pages of technical documentation.

Initial Approach with Unstructured

I started with the unstructured library, which promised comprehensive document parsing with layout preservation and element detection. The implementation was straightforward:

from unstructured.partition.pdf import partition_pdf

def extract_with_unstructured(pdf_path: str) -> str:
    elements = partition_pdf(pdf_path)
    text = "\n\n".join([str(el) for el in elements])
    return text

The library worked correctly and extracted text with good accuracy. However, processing time became a problem. For a single 150-page manual, extraction took approximately 13 minutes. With 200+ manuals in the pipeline, the total processing time would exceed 45 hours.

The Performance Bottleneck

The unstructured library performs extensive document analysis including:

Layout detection and element classification
Table extraction with structure preservation
Image detection and OCR integration
Complex PDF structure parsing

While these features are valuable for document understanding tasks, they introduced overhead I didn’t need. My use case required clean text extraction without layout analysis or element classification.

Switching to PyMuPDF

PyMuPDF is a lightweight PDF parser built on the MuPDF library. It focuses on speed and direct text extraction without the additional processing layers.

Here’s the implementation I switched to:

import pymupdf
from typing import List, Dict
from pathlib import Path

def extract_text_from_pdf(pdf_path: str) -> Dict[str, any]:
    """
    Extract text from PDF with metadata using PyMuPDF.

    Args:
        pdf_path: Path to the PDF file

    Returns:
        Dictionary containing extracted text and metadata
    """
    doc = pymupdf.open(pdf_path)

    # Extract metadata
    metadata = {
        "title": doc.metadata.get("title", ""),
        "author": doc.metadata.get("author", ""),
        "page_count": len(doc),
        "filename": Path(pdf_path).name
    }

    # Extract text from all pages
    text_blocks = []
    for page_num in range(len(doc)):
        page = doc[page_num]
        text = page.get_text()

        # Clean and filter
        if text.strip():
            text_blocks.append({
                "page": page_num + 1,
                "content": text.strip()
            })

    doc.close()

    # Combine all text
    full_text = "\n\n".join([block["content"] for block in text_blocks])

    return {
        "text": full_text,
        "metadata": metadata,
        "blocks": text_blocks
    }


def process_manual_for_embeddings(pdf_path: str, chunk_size: int = 512) -> List[Dict]:
    """
    Process PDF manual and prepare text chunks for embedding generation.

    Args:
        pdf_path: Path to the PDF file
        chunk_size: Target size for each text chunk in characters

    Returns:
        List of text chunks with metadata
    """
    result = extract_text_from_pdf(pdf_path)

    # Simple chunking strategy - split by paragraphs and combine to target size
    chunks = []
    current_chunk = ""
    current_page = 1

    for block in result["blocks"]:
        paragraphs = block["content"].split("\n\n")

        for para in paragraphs:
            para = para.strip()
            if not para:
                continue

            if len(current_chunk) + len(para) > chunk_size and current_chunk:
                chunks.append({
                    "text": current_chunk.strip(),
                    "page": current_page,
                    "metadata": result["metadata"]
                })
                current_chunk = para
                current_page = block["page"]
            else:
                current_chunk += "\n\n" + para if current_chunk else para
                current_page = block["page"]

    # Add remaining chunk
    if current_chunk:
        chunks.append({
            "text": current_chunk.strip(),
            "page": current_page,
            "metadata": result["metadata"]
        })

    return chunks

Performance Comparison

I measured extraction performance on a representative sample of 10 manuals with varying page counts:

Manual	Pages	Unstructured	PyMuPDF	Speedup
Manual A	50	4.2 min	1.8 sec	140x
Manual B	120	10.1 min	3.2 sec	189x
Manual C	200	16.8 min	4.9 sec	205x
Manual D	85	7.1 min	2.4 sec	177x
Manual E	150	12.6 min	3.8 sec	199x
Manual F	45	3.8 min	1.5 sec	152x
Manual G	180	15.2 min	4.3 sec	212x
Manual H	95	8.0 min	2.7 sec	178x
Manual I	160	13.4 min	4.0 sec	201x
Manual J	110	9.2 min	3.0 sec	184x
Average	119.5	10.04 min	3.16 sec	~190x

Processing the complete dataset of 215 manuals:

Unstructured: 36 hours (estimated)
PyMuPDF: 11.3 minutes (actual)

The speedup enabled rapid iteration during development. I could reprocess the entire dataset with chunking adjustments or preprocessing changes in minutes rather than waiting hours.

Text Quality Comparison

Text extraction quality was comparable between both libraries for standard text content. PyMuPDF handled:

Multi-column layouts correctly
Headers and footers
Embedded fonts and special characters
Mixed text encodings

For my use case, which involved technical manuals with primarily text content and simple layouts, PyMuPDF produced equivalent results to unstructured without the processing overhead.

Implementation Notes

A few considerations when using PyMuPDF:

Memory efficiency: PyMuPDF loads documents into memory. For very large PDFs (500+ pages), process in batches or use page.get_text("blocks") to extract incrementally.
Text ordering: PyMuPDF returns text in reading order, which works correctly for most documents. For complex layouts, verify text sequence manually.
Missing features: PyMuPDF doesn’t provide OCR or advanced table structure extraction. If you need these features, consider using PyMuPDF for text extraction and targeted tools for specific requirements.
Dependencies: PyMuPDF has minimal dependencies compared to unstructured, which requires multiple heavyweight libraries.

Results

After switching to PyMuPDF, the complete pipeline processed all 215 manuals in under 15 minutes including:

Text extraction: 11.3 minutes
Chunking and preprocessing: 2.1 minutes
Embedding generation: 8.4 minutes (separate batch job)
Vector database insertion: 1.8 minutes

The processed corpus contained approximately 42,000 text chunks, which enabled semantic search across the entire manual collection with sub-second query response times.

For PDF text extraction where speed matters and complex layout analysis isn’t required, PyMuPDF delivers significant performance advantages over more comprehensive document parsing libraries.