Apache Tika 4.x: Engineered for RAG and Agentic Search
2026-05-06 , Main Stage

Apache Tika 4.x is a generational leap — nearly two decades of battle-tested document processing, engineered for the modern AI stack. Purpose-built for RAG and hybrid search, it delivers scalable, fault-tolerant pipelines, VLM hooks for OCR and image understanding, and structure-aware chunking and embedding integrations for RAG-ready output.


Content extraction is the unglamorous foundation of every search and RAG pipeline — and Apache Tika 4.x makes it faster, more reliable, and more intelligent. As RAG has become a dominant architecture for search, a long-standing challenge has become a first-class problem: pipelines need logical, coherent segments — not just raw extracted text soup. This talk covers the pipes-based architecture at the heart of Tika 4.0.0: how it scales horizontally across enterprise document collections, handles failures gracefully, recovers without data loss, and sustains throughput across heterogeneous content — from simple text files to deeply nested, multi-format document bundles. We'll make this concrete with a demo that fans out across an S3 bucket and emits structured, enriched output to Solr, OpenSearch, or Elasticsearch.

Beyond pipeline mechanics, Tika 4.0.0 moves the intelligence layer closer to the source — and the depth of those features is only possible because of nearly two decades of hard-won document processing experience. Tika has long known how to unpack a PDF attached to an email buried in a ZIP file, and the new VLM integrations inherit that same depth: images and scanned documents — including embedded and attached files, not just top-level documents — can be routed inline to VLM APIs for OCR and image labeling, with results folded back into extracted content before it ever hits the index. Structure-aware chunking lets Tika segment content during extraction rather than leaving that problem to downstream tools. And hooks into popular multimodal embedding services round out the pipeline, making Tika 4.0.0 a complete production-ready on-ramp to RAG and hybrid search.


Level: Intermediate

Tim has been working in content/metadata extraction (and evaluation), advanced search and relevance tuning for more than 20 years. Tim currently works at elastic as a search relevance engineer. He is a member of the Apache Software Foundation (ASF), the chair/VP of Apache Tika, and a committer on Apache OpenNLP, Nutch, Stormcrawler, Lucene/Solr, PDFBox and Apache POI. Tim holds a Ph.D. in Classical Studies, and in a former life, he was a professor of Latin and Greek.