Arne Grobrügge
Worked on multi-modal retrieval-augmented generation (RAG) and agentic LLM systems. Designed ingestion and retrieval pipelines across text, video, and structured data to integrate common knowledge platforms such as Microsoft SharePoint. Focused on scalable Azure-based infrastructure, multilingual and multimodal document processing, and continuous evaluation for reliability. Gathered experience in building browser-driven agents using modern orchestration frameworks and MCP integration.
Session
In RAG-based systems, the main challenge is often not tuning the LLM itself, but making documents available in a form that can be retrieved reliably. In enterprise settings, the dominant input format is still PDF, ranging from text-heavy reports to slide decks, scanned documents, and visually dense presentations.
Traditional document processing pipelines rely on OCR and layout analysis to extract text, followed by chunking and embedding. While this works well for text-heavy documents, much of the original structure is often lost—especially for presentations, multi-column layouts, and visually driven content. Images, charts, and diagrams typically require separate processing, increasing pipeline complexity and fragility.
Recent multi-modal embedding models enable a different approach: embedding entire PDF pages directly as images. This preserves layout, visual hierarchy, and embedded graphics in a single representation and significantly simplifies document ingestion.
This talk compares classical OCR-based document processing pipelines with multi-modal page embeddings, drawing on benchmarks conducted on real-world enterprise documents across different models. It highlights where this approach performs well, where its limitations lie, and how to design practical, cost-aware retrieval systems in Python.