Haystack US 2026

0.9.6 Haystack US 2026 haystack-us-2026 2026-05-06 2026-05-07 2 00:05 https://pretalx.com https://pretalx.com/media/haystack-us-2026/img/haystack-logo-transparent_CgMhIsT_yead7FX.webp America/New_York Main Stage Welcome Talk 2026-05-06T09:00:00-04:00 09:00 00:30 Welcome haystack-us-2026-93598-welcome en Welcome false https://pretalx.com/haystack-us-2026/talk/QAH3KH/ https://pretalx.com/haystack-us-2026/talk/QAH3KH/feedback/ Main Stage AI Governance: Crafting Your Own AI Experiences Talk 2026-05-06T09:30:00-04:00 09:30 00:45 In a world where AI shapes what we see, think, and do, true enginering lies not in using tools—but in designing them. haystack-us-2026-93359-ai-governance-crafting-your-own-ai-experiences Ángel Maldonado en This talk explores how individuals, organisations and communities can move beyond passive consumption of AI, reclaiming agency through AI governance and stewardship. We'll examine how to build AI experiences that reflect our values and protect our autonomy. false https://pretalx.com/haystack-us-2026/talk/RRYPX7/ https://pretalx.com/haystack-us-2026/talk/RRYPX7/feedback/ Main Stage Learning to Understand: A Missing Stage of Modern Retrieval Talk 2026-05-06T10:20:00-04:00 10:20 00:45 We introduce “Learning to Understand” as a corollary to the well known “Learning to Rank” process. By using evals to learn domain-specific query interpretation and rewriting rules and combining with semantic statistics from your index, it’s possible to significantly improve search quality beyond typical BM25, vector, and hybrid search techniques. haystack-us-2026-92265-learning-to-understand-a-missing-stage-of-modern-retrieval Trey Grainger en In this talk, we introduce “Learning to Understand” as a corollary to the well known “Learning to Rank” process for building ranking classifiers. BM25, Dense Vector Search, and even Hybrid Search approaches all share one thing in common: they focus primarily on algorithmic ranking of search results, as opposed to query understanding for first identifying and matching the right set of potential query interpretations. Ranking search results is important, but in many cases implementing a query understanding and rewriting layer prior to executing a query provides substantially more value. By identifying the right query interpretation upfront, you can better avoid false positives, better identify multiple interpretations of ambiguous queries, and better target your actual domain-specific terminology and dataset. We’ll walk through how to properly segment the “query understanding” phase from the “matching and ranking” phase, as well as how to use evals to learn domain-specific query patterns and integrate tools like semantic knowledge graphs to generate high-quality, in-domain query understanding prior to rewriting and executing a much more intelligent query for matching and ranking. By using evals to learn domain-specific query interpretation rules and combining them with semantic statistics from your index, it’s possible to significantly improve search quality beyond typical BM25, vector, and hybrid search techniques. false https://pretalx.com/haystack-us-2026/talk/JSZT33/ https://pretalx.com/haystack-us-2026/talk/JSZT33/feedback/ Main Stage From 0 - Production with BBQ at GitHub Talk 2026-05-06T11:15:00-04:00 11:15 00:45 Rolling out semantic search is easy right? Just turn on some vectors and bim bam boom you have vector search... Right? Turns out when you're GitHub sized it's not quite that easy. We'll walk through the process we took, the lessons we've learned, and how you can build a plan to deploy vector search easier. haystack-us-2026-88911-from-0-production-with-bbq-at-github David Tippett en Lets turn back the clock and walk through the steps we took to get vector search with BBQ quantization rolled out at GitHub. ##### Timeline: - **Lets do semantic search** - by the way what does semantic search do? - **Our MVP** - taught us nothing - **What do you mean we need capacity?** - the challenges of calculating compute for search at scale - **Indexing in prod** - now we index a large subset of our data with BBQ... or we would've but we hit ALL the bugs with BBQ and Elasticsearch's reindex - **We built a new MVP!** - here is where our learnings about search with vector search started to become material (including: linear retrievers, scoring, oversampling, num_candidates, and more!) - **Our fifth? attempt to ingest data** - where our sharding strategy for FTS collided with our index settings and the way Lucene merges work for vectors. Through all these, I'll go through the tools and techniques we used to determine what was happening and how we got through it. Finally, I'll show you the roadmap that we now have in place so other (*internal*) users at GitHub can begin to create semantic search experiences across the entire company. false https://pretalx.com/haystack-us-2026/talk/KEYMMP/ https://pretalx.com/haystack-us-2026/talk/KEYMMP/feedback/ Main Stage Apache Tika 4.x: Engineered for RAG and Agentic Search Talk 2026-05-06T13:30:00-04:00 13:30 00:45 Apache Tika 4.x is a generational leap — nearly two decades of battle-tested document processing, engineered for the modern AI stack. Purpose-built for RAG and hybrid search, it delivers scalable, fault-tolerant pipelines, VLM hooks for OCR and image understanding, and structure-aware chunking and embedding integrations for RAG-ready output. haystack-us-2026-92173-apache-tika-4-x-engineered-for-rag-and-agentic-search Tim Allison en Content extraction is the unglamorous foundation of every search and RAG pipeline — and Apache Tika 4.x makes it faster, more reliable, and more intelligent. As RAG has become a dominant architecture for search, a long-standing challenge has become a first-class problem: pipelines need logical, coherent segments — not just raw extracted text soup. This talk covers the pipes-based architecture at the heart of Tika 4.0.0: how it scales horizontally across enterprise document collections, handles failures gracefully, recovers without data loss, and sustains throughput across heterogeneous content — from simple text files to deeply nested, multi-format document bundles. We'll make this concrete with a demo that fans out across an S3 bucket and emits structured, enriched output to Solr, OpenSearch, or Elasticsearch. Beyond pipeline mechanics, Tika 4.0.0 moves the intelligence layer closer to the source — and the depth of those features is only possible because of nearly two decades of hard-won document processing experience. Tika has long known how to unpack a PDF attached to an email buried in a ZIP file, and the new VLM integrations inherit that same depth: images and scanned documents — including embedded and attached files, not just top-level documents — can be routed inline to VLM APIs for OCR and image labeling, with results folded back into extracted content before it ever hits the index. Structure-aware chunking lets Tika segment content during extraction rather than leaving that problem to downstream tools. And hooks into popular multimodal embedding services round out the pipeline, making Tika 4.0.0 a complete production-ready on-ramp to RAG and hybrid search. false https://pretalx.com/haystack-us-2026/talk/HXXJYX/ https://pretalx.com/haystack-us-2026/talk/HXXJYX/feedback/ Main Stage Why your B2B search engine doesn’t understand your users Talk 2026-05-06T14:20:00-04:00 14:20 00:45 This talk uses a real-world B2B search case to show how a decision-based tree helps quickly diagnose why search fails and how to improve relevance without rebuilding the system. haystack-us-2026-88383-why-your-b2b-search-engine-doesn-t-understand-your-users Maëlly Dubois en E-commerce search engines are often optimized for simple, product-centric queries. In B2B contexts, however, users search differently: they describe their needs, use cases, and constraints through long, highly domain-specific queries. The result is predictable: zero results, irrelevant products, degraded relevance—even though the right products do exist in the catalog. In this talk, we start from a real-world, large-scale B2B e-commerce search case to challenge a common misconception: the problem is not the ranking. Instead, it lies in a combination of overly strict retrieval, poor query understanding, a single search strategy applied to multiple intents, and product data that is misaligned with real-world usage. Using a decision-based diagnostic tree, we will show how to precisely identify where search fails (strict AND logic, stopwords, field weighting, intent-based routing, data normalization), and how to design targeted experiments to improve relevance without “rebuilding everything from scratch.” false https://pretalx.com/haystack-us-2026/talk/RVCFQ8/ https://pretalx.com/haystack-us-2026/talk/RVCFQ8/feedback/ Main Stage AutoReSEARCH – Ranking coded by agents Talk 2026-05-06T15:20:00-04:00 15:20 00:45 Could AI code generation replace Learning to Rank? AI coding tools can generate rankers, but only up to a point. What techniques matter when building an agent coded ranker? And where do traditional search techniques still work? haystack-us-2026-92146-autoresearch-ranking-coded-by-agents Doug Turnbull en Why can’t I just go to Claude Code and say: > build a function that returns the most relevant results possible. In this talk we’ll give an AI coding agent a few basic primitives - BM25 retrieval, vector retrieval, query to category similarity. Then we let the agent code search functions building on these tools. We measure whether relevance improves on a test and holdout and continue to iterate Informed by data, we'll discuss the best techniques found on several open datasets. We'll see the promise and limitations of agentic rerankers. Where does traditional search experience still matter? Where does it fall apart? Can an approach like this replace learning to rank? We'll see where code generation stops being vibe coding and evolves to become actual *model training* - with code as the model. false https://pretalx.com/haystack-us-2026/talk/WQQXNP/ https://pretalx.com/haystack-us-2026/talk/WQQXNP/feedback/ Main Stage Lightning Talks Talk 2026-05-06T16:10:00-04:00 16:10 01:15 Lightning Talks haystack-us-2026-93608-lightning-talks en Lightning Talks false https://pretalx.com/haystack-us-2026/talk/MGKQYY/ https://pretalx.com/haystack-us-2026/talk/MGKQYY/feedback/ Main Stage Welcome Back Talk 2026-05-07T09:00:00-04:00 09:00 00:15 We welcome you back for the second conference day of Haystack US 2026. haystack-us-2026-93609-welcome-back en You can expect a look back at day one and a preview of what's still to come. false https://pretalx.com/haystack-us-2026/talk/MLMQAJ/ https://pretalx.com/haystack-us-2026/talk/MLMQAJ/feedback/ Main Stage Agentic Tuning: Search Relevance on Autopilot Talk 2026-05-07T09:15:00-04:00 09:15 00:45 Search relevance tuning is notoriously difficult, often requiring a deep understanding of Lucene scoring, complex query DSLs, and iterative manual testing. This session introduces Agentic Relevance Tuning—a framework that leverages LLM-based agents to automate the full search lifecycle making search tuning faster, more accurate, and accessible. haystack-us-2026-88903-agentic-tuning-search-relevance-on-autopilot Daniel Wrigley en Search relevance tuning is notoriously difficult, often requiring a deep understanding of Lucene scoring, complex query DSLs, and iterative manual testing. While tools like OpenSearch User Behavior Insights (UBI) and the Search Relevance Workbench provide the data and the environment for improvement, the leap from "analyzing data" to "deploying a fix" remains a significant hurdle for many. This session introduces Agentic Relevance Tuning (ART)—a framework that leverages LLM-based agents to automate the full search lifecycle. We demonstrate how to move beyond manual experimentation by building an infrastructure of specialized agents that identify issues, hypothesize improvements, and orchestrate offline and online evaluation. Attendees will learn how to: - Identify Opportunities: Use OpenSearch UBI to automatically detect relevance gaps through user signals. - Automate Evaluation: Leverage the Search Relevance Workbench for offline "judge" agents to run automated benchmarks and identify winning hypotheses. - Close the Loop: Transition from issue to resolution using a conversational interface that lowers the technical barrier for non-experts. - Validate in Production: Deploy agent-orchestrated interleaved A/B testing to ensure real-world improvements. By combining the analytical power of modern search engines with the reasoning capabilities of agents, we can make search tuning faster, more accurate, and accessible to the entire business—not just the search experts. false https://pretalx.com/haystack-us-2026/talk/K7N7XT/ https://pretalx.com/haystack-us-2026/talk/K7N7XT/feedback/ Main Stage Evolution of Relevance Engineering to Context Engineering Talk 2026-05-07T10:05:00-04:00 10:05 00:45 As search powers RAG and agentic systems, relevance goals shift from ranking documents to assembling effective context. This talk explores how traditional lexical, semantic, and hybrid relevance changes when feeding LLMs, with lessons on chunking and snippet extraction, diversification, evaluation, and more. haystack-us-2026-90926-evolution-of-relevance-engineering-to-context-engineering Kathleen DeRusso en As RAG and agentic search mature, retrieval extends from being primarily a relevance concern to something that also drives latency, cost, and system reliability at scale. We’re very good at traditional search relevance. We know how to tune BM25, semantic, and hybrid search, and rerank aggressively. But when search results are used as context for LLMs, many of our familiar assumptions start to break down. Once retrieval feeds a reasoning system instead of a human, the definition of “relevant” quietly changes. The goal is no longer to rank the best documents, but to assemble the right context. Chunking, snippet extraction, and diversification become critical relevance skills. On top of that, relevance misses have real cost, cascading into additional tool calls, increased token utilization, added latency, and hallucinations. In this talk, we’ll explore how relevance shifts when the objective moves from “return the best documents” to “construct effective context for reasoning”. We’ll walk through how traditional lexical, semantic, and hybrid relevance techniques behave when used in RAG and agentic workflows, highlighting both where they still work and where they fail in subtle and surprising ways. Along the way, we’ll cover chunking and snippet extraction, result diversification, and how evaluation needs to evolve when the ranked list is no longer the end product. The talk closes with lessons learned from real-world systems, common roadblocks teams encounter when making this transition, and concrete tips for adapting existing search pipelines to serve LLM-driven applications more effectively. false https://pretalx.com/haystack-us-2026/talk/JGYD3Y/ https://pretalx.com/haystack-us-2026/talk/JGYD3Y/feedback/ Main Stage LLMs as Rerankers: A Case Study on Hybrid Email Search Talk 2026-05-07T11:00:00-04:00 11:00 00:45 Purpose-built rerankers are faster and cheaper, but are they better? We argue LLM rerankers win on what matters most in production: instruction-following and iteration speed, with more-than-acceptable tradeoffs on cost and latency. Our discussion is backed by a case study from Superhuman's production hybrid email search system. haystack-us-2026-92199-llms-as-rerankers-a-case-study-on-hybrid-email-search William BarberAgustín Bernardo en The conventional wisdom, backed by vendor benchmarks, is that purpose-built rerankers are more accurate than LLMs at ranking. **We challenge this.** In our experience building and maintaining production search systems, LLM rerankers deliver better search results and faster improvements to user experience, primarily because they are flexible tools that excel at following complex instructions. This talk makes the case for LLMs as rerankers through three lenses: iterability, capability, and cost. Each lens is supported by results from the Superhuman case study. ### Iterability Day-to-day work on a production search system means triaging a list of failure cases. With a traditional reranker, fixing these means preparing data, fine-tuning, and deploying a custom model. With an LLM reranker, you edit a prompt. This difference sounds incremental but compounds quickly. When improving results is as easy as refining an instruction, teams naturally spend more time examining their data and shipping fixes. Results improve week over week instead of quarter over quarter. In a landscape where user expectations of AI products shift constantly, this iteration speed is a decisive advantage that reranking benchmarks do not capture. Superhuman improved search results by running fast ablation cycles. The hypothesis → config change → rerun → measure loop is much more practical when relevance logic lives in prompts rather than a model training pipeline. ### Capability A traditional reranker takes a query and a document and returns a scalar score. An LLM reranker can do that and much more. The same model pass that ranks your documents can also consolidate facts across them, flag contradictions, discard distractor segments, or annotate specific passages. Your "reranker" becomes a reasoning layer, not just a sorting function. This flexibility extends to instruction-following. Negation instructions are a useful illustration: "ignore documents where the only relevant segment is a table of contents" is straightforward to express in a prompt but notoriously difficult for smaller instruction-following rerankers to handle reliably. The gap between LLMs and specialized rerankers on complex, nuanced instructions reflects fundamental differences in model scale, training data breadth, and the ability to leverage test-time compute. Superhuman's tests reveal that LLM rerankers enable *safe over-retrieval* with instruction-aware filtering downstream. ### Cost Given the scale difference - 100B+ parameter LLMs vs. sub-4B parameter rerankers — you might expect dramatically higher costs. In practice, batch inference, sparse mixture-of-experts architectures, prompt caching, and competitive pricing dynamics have narrowed the gap considerably. The primary remaining tradeoff is latency; we'll discuss when that tradeoff matters and when it doesn't. Superhuman’s results make the economics feel less abstract: the biggest quality gains came from increasing retrieval depth and rebalancing hybrid weighting. The “expensive” part was simply letting the system consider more candidates and then using an LLM to make the final call. This is often a good trade in production because compute spent on reranking scales with *retrieval depth*, and you can tune that knob directly based on latency budgets and observed recall/precision needs. ### Case study: Improving recall in hybrid email search We will share findings from Superhuman's email search system, where systematic ablation experiments across retrieval depth, vector-keyword weighting, recency bias, and filtering strategies revealed that the largest recall gains came from loosening upstream retrieval constraints and trusting the LLM reranker to handle relevance downstream. We'll walk through the experimental setup, the failure modes uncovered, and how the results informed changes to their production pipeline. ### What attendees will take away This talk is structured to leave the audience with three concrete things: 1. **A decision framework** for when LLM rerankers are the right choice over dedicated ones, centered on how ambiguous and fast-evolving your relevance criteria are. 2. **Engineering patterns** for making LLM reranking production-viable, including prompt design, latency management, and output structuring. 3. **Experimental evidence** from a real production system that made the switch, including the methodology for running your own comparison. false https://pretalx.com/haystack-us-2026/talk/HRG9BY/ https://pretalx.com/haystack-us-2026/talk/HRG9BY/feedback/ Main Stage Do we still need search engines? Talk 2026-05-07T13:15:00-04:00 13:15 00:45 Search has a new User Interface! All search will be Agentic/RAG and delivered through a chat interface! The days of the monolithic search engine are over! haystack-us-2026-93360-do-we-still-need-search-engines Trey GraingerRené KrieglerJon Handler en Well, there’s certainly a lot of hype that would say that. How much is true? As we go forward into a world dominated by AI-powered search, what are the key elements of search that will remain? Do we still need the inverted index? Will agents replace user behavior capture and signal boosting? Do we still need BM25 and other ranking algorithms, or will multi-agent swarms and search workflows obviate the need for statistical methods of ranking? Where does the search function live in the stack of tomorrow and how can we build for the future today? Come to this panel discussion to hear our take on these fundamental questions. false https://pretalx.com/haystack-us-2026/talk/GA9ZAR/ https://pretalx.com/haystack-us-2026/talk/GA9ZAR/feedback/ Main Stage Managing Search Teams: Field Stories & Practical Takeaways Talk 2026-05-07T14:05:00-04:00 14:05 00:45 Even though search teams are structured differently across the industry, they share common challenges like balancing learning with delivery and nurturing a culture built for continuous iteration. This talk distills a decade of organizational lessons from building Yelp’s AI-powered search into repeatable patterns for any team facing similar hurdles. haystack-us-2026-92127-managing-search-teams-field-stories-practical-takeaways Cem Aksoy en Search quality is never “done.” Especially in the AI-powered search world, search teams run on a cycle of research, infrastructure changes, and model refinements whose results dictate the next move. That built-in uncertainty makes managing a search team very different from managing a deterministic product-feature team. Drawing on a decade of building Yelp’s AI-powered search, this talk offers real-world lessons on four fronts: • Team structure: Trade-offs between different org shapes (joint pods vs. split infra/relevance groups) and their impact on day-to-day operations • Project execution: Ways to balance open-ended research projects with hard product delivery dates • Stakeholder management: Tactics for setting expectations on long-running machine learning explorations • Culture: Practices that develop product-minded relevance engineers The takeaways are anchored in concrete stories and we’ll close with fresh observations on how LLM workstreams have revised some of these lessons. Search teams may not share a single canonical shape, but the underlying patterns in this talk should translate to any organization steering the moving target of modern, AI-driven search. false https://pretalx.com/haystack-us-2026/talk/MFVLQU/ https://pretalx.com/haystack-us-2026/talk/MFVLQU/feedback/ Main Stage Adaptive Relevance with Agentic Search Talk 2026-05-07T15:05:00-04:00 15:05 00:45 Traditional search pipelines rely heavily on static query parsing and after-the-fact relevance analysis. In this session, we present a new paradigm: using LangGraph with OpenSearch to create an agentic-based system that can tune hybrid search in real time. haystack-us-2026-92106-adaptive-relevance-with-agentic-search Kevin M. Butler en We’ll review a multi-stage pipeline where live retrieval diagnostics—such as confidence gaps and score variance—dynamically adjust the blend of lexical and semantic search, trigger query rewrites, and rerank results. Then we’ll show how this measures up by benchmarking this adaptive approach against a static lexical baseline and a standard hybrid model. Attendees will leave with actionable frameworks for real-time query refinement of search systems. Speaker Bio: Kevin Butler is a Search & Data Infrastructure Consultant at KMW Technology. He specializes in OpenSearch pipelines, adaptive hybrid search, and agentic LangGraph-driven workflows. His work focuses on bridging retrieval models with dynamic real-time decision-making. Technical Level: Intermediate to Advanced (Familiarity with OpenSearch/Elasticsearch and basic agentic AI concepts assumed) false https://pretalx.com/haystack-us-2026/talk/KMXBMM/ https://pretalx.com/haystack-us-2026/talk/KMXBMM/feedback/ Main Stage When BM25 Scores Disagree: A Corpus-Independent Alternative Talk 2026-05-07T16:00:00-04:00 16:00 00:45 In distributed search, BM25 returns different results across nodes because IDF and average document length vary with each node's corpus state. StableTfl replaces these with a term-length rarity heuristic, eliminating all corpus dependency. On 22 BEIR datasets, it retains ~90% of BM25's NDCG@10 while guaranteeing identical rankings across nodes. haystack-us-2026-92159-when-bm25-scores-disagree-a-corpus-independent-alternative Tianxiao Wei en BM25 relies on two corpus-level statistics: inverse document frequency and average document length. In a distributed search system where nodes index independently and converge only eventually, these statistics differ across nodes — and the same query produces different rankings depending on which node serves it. For retrieval pipelines that expect deterministic results, particularly RAG systems and hybrid search architectures that fuse lexical and vector scores, this inconsistency is a production-grade problem. StableTfl is a drop-in BM25 replacement built as a Lucene Similarity that eliminates all corpus-level dependencies. It replaces IDF with a synthetic term-rarity function based on term character length — longer terms are rarer in natural language, so character count serves as a proxy for inverse document frequency. Document length normalization is folded into the same function rather than relying on the corpus average. The result: scoring depends only on the query term, its frequency in the document, and the document's length. Two nodes with completely different corpora will always produce identical rankings. Benchmarked against BM25Okapi on 22 BEIR datasets with identical tokenization, StableTfl retains roughly 90% of BM25's average NDCG@10 (0.299 vs 0.331). BM25 wins on 19 datasets, but StableTfl matches or beats BM25 on argument retrieval, COVID-19 literature search, and open-domain QA — domains where term-level rarity appears to matter more than collection-specific frequency patterns. There is no additional runtime overhead compared to BM25, since term rarity values can be precomputed into a 256-entry lookup table. The talk will cover: (1) why corpus statistics break consistency in distributed search, (2) how StableTfl works as a Lucene Similarity, (3) where the quality trade-off hurts most and where it's minimal, and (4) how to evaluate whether corpus-independent scoring fits your retrieval stack — especially if you're building hybrid or RAG pipelines where result consistency is as important as raw relevance. false https://pretalx.com/haystack-us-2026/talk/WW3HWZ/ https://pretalx.com/haystack-us-2026/talk/WW3HWZ/feedback/ Main Stage Closing Talk 2026-05-07T16:45:00-04:00 16:45 00:15 Closing haystack-us-2026-93610-closing en Closing false https://pretalx.com/haystack-us-2026/talk/MJHXAM/ https://pretalx.com/haystack-us-2026/talk/MJHXAM/feedback/