PyCon DE & PyData 2026

Simplifying RAG Document Pipelines with Multimodal Embeddings
, Ferrum [2nd Floor]

In RAG-based systems, the main challenge is often not tuning the LLM itself, but making documents available in a form that can be retrieved reliably. In enterprise settings, the dominant input format is still PDF, ranging from text-heavy reports to slide decks, scanned documents, and visually dense presentations.

Traditional document processing pipelines rely on OCR and layout analysis to extract text, followed by chunking and embedding. While this works well for text-heavy documents, much of the original structure is often lost—especially for presentations, multi-column layouts, and visually driven content. Images, charts, and diagrams typically require separate processing, increasing pipeline complexity and fragility.

Recent multi-modal embedding models enable a different approach: embedding entire PDF pages directly as images. This preserves layout, visual hierarchy, and embedded graphics in a single representation and significantly simplifies document ingestion.

This talk compares classical OCR-based document processing pipelines with multi-modal page embeddings, drawing on benchmarks conducted on real-world enterprise documents across different models. It highlights where this approach performs well, where its limitations lie, and how to design practical, cost-aware retrieval systems in Python.


This talk provides an overview of how document processing for RAG systems can be simplified using multimodal embeddings, grounded in benchmarks on real-world enterprise documents.

What the talk covers

  1. Motivation: Why RAG Is Still Hard
    Why PDFs remain challenging in enterprise RAG systems, and where current document processing approaches break down—especially for presentations and visually structured documents.

  2. The Classical Approach: PDF → Text → Chunks
    An overview of traditional OCR- and layout-based pipelines, including their strengths, typical failure modes, and why they tend to grow into complex and fragile systems over time.

  3. A New Paradigm: Multimodal Page Embeddings
    How embedding entire PDF pages as images changes the ingestion model, what information is preserved compared to text-only approaches, and what this means for retrieval quality and system simplicity.

  4. Benchmark Setup
    How the benchmark comparing classical pipelines and multimodal page embeddings was designed, using anonymized, real-world enterprise documents across multiple document types. Different models and vendors are referenced only as examples, not as the focus.

  5. Results and Key Findings
    Where multimodal page embeddings outperform text-based pipelines, where they do not, and how hybrid approaches can emerge as a practical solution.

  6. Production Best Practices
    Practical guidance for deploying these approaches in real systems, including index design, quality monitoring, cost control, and how to integrate multimodal retrieval cleanly into Python-based RAG architectures.

Attendees will leave with a clear understanding of when multimodal embeddings are a strong replacement for classical PDF pipelines, and how to reason about the trade-offs involved.


Expected audience expertise in your talk's domain:: Intermediate Expected audience expertise in Python:: Novice

Worked on multi-modal retrieval-augmented generation (RAG) and agentic LLM systems. Designed ingestion and retrieval pipelines across text, video, and structured data to integrate common knowledge platforms such as Microsoft SharePoint. Focused on scalable Azure-based infrastructure, multilingual and multimodal document processing, and continuous evaluation for reliability. Gathered experience in building browser-driven agents using modern orchestration frameworks and MCP integration.