Skip to content
Go back

Fast PDF Text Extraction for Embeddings - Switching from Unstructured to PyMuPDF

I needed to build a semantic search system for audio equipment manuals. The goal was to extract text from approximately 200 PDF manuals, generate embeddings using a sentence transformer model, and store them in a vector database for retrieval. The manuals ranged from 20 to 300 pages each, totaling around 15,000 pages of technical documentation.

Initial Approach with Unstructured

I started with the unstructured library, which promised comprehensive document parsing with layout preservation and element detection. The implementation was straightforward:

from unstructured.partition.pdf import partition_pdf

def extract_with_unstructured(pdf_path: str) -> str:
    elements = partition_pdf(pdf_path)
    text = "\n\n".join([str(el) for el in elements])
    return text

The library worked correctly and extracted text with good accuracy. However, processing time became a problem. For a single 150-page manual, extraction took approximately 13 minutes. With 200+ manuals in the pipeline, the total processing time would exceed 45 hours.

The Performance Bottleneck

The unstructured library performs extensive document analysis including:

While these features are valuable for document understanding tasks, they introduced overhead I didn’t need. My use case required clean text extraction without layout analysis or element classification.

Switching to PyMuPDF

PyMuPDF is a lightweight PDF parser built on the MuPDF library. It focuses on speed and direct text extraction without the additional processing layers.

Here’s the implementation I switched to:

import pymupdf
from typing import List, Dict
from pathlib import Path

def extract_text_from_pdf(pdf_path: str) -> Dict[str, any]:
    """
    Extract text from PDF with metadata using PyMuPDF.

    Args:
        pdf_path: Path to the PDF file

    Returns:
        Dictionary containing extracted text and metadata
    """
    doc = pymupdf.open(pdf_path)

    # Extract metadata
    metadata = {
        "title": doc.metadata.get("title", ""),
        "author": doc.metadata.get("author", ""),
        "page_count": len(doc),
        "filename": Path(pdf_path).name
    }

    # Extract text from all pages
    text_blocks = []
    for page_num in range(len(doc)):
        page = doc[page_num]
        text = page.get_text()

        # Clean and filter
        if text.strip():
            text_blocks.append({
                "page": page_num + 1,
                "content": text.strip()
            })

    doc.close()

    # Combine all text
    full_text = "\n\n".join([block["content"] for block in text_blocks])

    return {
        "text": full_text,
        "metadata": metadata,
        "blocks": text_blocks
    }


def process_manual_for_embeddings(pdf_path: str, chunk_size: int = 512) -> List[Dict]:
    """
    Process PDF manual and prepare text chunks for embedding generation.

    Args:
        pdf_path: Path to the PDF file
        chunk_size: Target size for each text chunk in characters

    Returns:
        List of text chunks with metadata
    """
    result = extract_text_from_pdf(pdf_path)

    # Simple chunking strategy - split by paragraphs and combine to target size
    chunks = []
    current_chunk = ""
    current_page = 1

    for block in result["blocks"]:
        paragraphs = block["content"].split("\n\n")

        for para in paragraphs:
            para = para.strip()
            if not para:
                continue

            if len(current_chunk) + len(para) > chunk_size and current_chunk:
                chunks.append({
                    "text": current_chunk.strip(),
                    "page": current_page,
                    "metadata": result["metadata"]
                })
                current_chunk = para
                current_page = block["page"]
            else:
                current_chunk += "\n\n" + para if current_chunk else para
                current_page = block["page"]

    # Add remaining chunk
    if current_chunk:
        chunks.append({
            "text": current_chunk.strip(),
            "page": current_page,
            "metadata": result["metadata"]
        })

    return chunks

Performance Comparison

I measured extraction performance on a representative sample of 10 manuals with varying page counts:

ManualPagesUnstructuredPyMuPDFSpeedup
Manual A504.2 min1.8 sec140x
Manual B12010.1 min3.2 sec189x
Manual C20016.8 min4.9 sec205x
Manual D857.1 min2.4 sec177x
Manual E15012.6 min3.8 sec199x
Manual F453.8 min1.5 sec152x
Manual G18015.2 min4.3 sec212x
Manual H958.0 min2.7 sec178x
Manual I16013.4 min4.0 sec201x
Manual J1109.2 min3.0 sec184x
Average119.510.04 min3.16 sec~190x

Processing the complete dataset of 215 manuals:

The speedup enabled rapid iteration during development. I could reprocess the entire dataset with chunking adjustments or preprocessing changes in minutes rather than waiting hours.

Text Quality Comparison

Text extraction quality was comparable between both libraries for standard text content. PyMuPDF handled:

For my use case, which involved technical manuals with primarily text content and simple layouts, PyMuPDF produced equivalent results to unstructured without the processing overhead.

Implementation Notes

A few considerations when using PyMuPDF:

  1. Memory efficiency: PyMuPDF loads documents into memory. For very large PDFs (500+ pages), process in batches or use page.get_text("blocks") to extract incrementally.

  2. Text ordering: PyMuPDF returns text in reading order, which works correctly for most documents. For complex layouts, verify text sequence manually.

  3. Missing features: PyMuPDF doesn’t provide OCR or advanced table structure extraction. If you need these features, consider using PyMuPDF for text extraction and targeted tools for specific requirements.

  4. Dependencies: PyMuPDF has minimal dependencies compared to unstructured, which requires multiple heavyweight libraries.

Results

After switching to PyMuPDF, the complete pipeline processed all 215 manuals in under 15 minutes including:

The processed corpus contained approximately 42,000 text chunks, which enabled semantic search across the entire manual collection with sub-second query response times.

For PDF text extraction where speed matters and complex layout analysis isn’t required, PyMuPDF delivers significant performance advantages over more comprehensive document parsing libraries.


Share this post on:

Next Post
Using AWS Pandas Layer (AWS Wrangler) in Serverless Framework