PyCon DE & PyData 2025

Information Retrieval Without Feeling Lucky: The Art and Science of Search
2025-04-23 , Helium3

Search is everywhere, yet effective Information Retrieval remains one of the most underestimated challenges in modern technology. While Retrieval-Augmented Generation has captured significant attention, the foundational element - Information Retrieval - often remains underexplored.

In this talk, we put Information Retrieval center stage by asking:
How do we know that user queries and data 'speak' the same language?
How do we evaluate the relevance and completeness of search results? And how do we prioritize what gets displayed? Or do we even want to hide specific content?

We try to answer these questions by introducing the audience to the art and science of Information Retrieval, exploring metrics such as precision, recall, and desirability. We’ll examine key challenges, including ambiguity, query relaxation, and the interplay between sparse and dense search techniques. Through a live demo using public content from Sendung mit der Maus, we show how hybrid search improves upon vector and keyword based search in isolation.


Information Retrieval goes beyond keyword matching - it’s about intent, context, and delivering relevant and accurate results. As RAG applications gain traction, understanding the retrieval process becomes more crucial for developers, data scientists, and search engineers.

We start with the Why. People have different needs for search - lookup, research, and inspiration. Each of these needs can be influenced and affected by the key IR metrics of search engines: precision, recall, and desirability. Having introduced these fundamentals, we go into common retrieval challenges, such as ambiguity, mismatched vocabularies, and the impact of context.

Aiming to solve these challenges, we then go into advanced search techniques, comparing sparse (keyword-based) and dense (vector-based) retrieval, highlighting their strengths and limitations. We’ll explore hybrid search as a powerful approach that blends these techniques. In a live demo, using crawled data from the Sendung mit der Maus, we’ll showcase a hybrid search setup leveraging tools like Mistral, Elasticsearch, and Streamlit. While the dataset language is German, the core concepts and search dynamics should hopefully be easily understandable also for non native speakers.

The talk concludes with key takeaways on building effective search systems and a look ahead at future developments in contextualized search.

Tentative Outline:
1. Introduction to Information Retrieval (~ 5 min)
  * Why do we search? Lookup, research, inspiration
  * Core metrics: precision, recall, desirability

  1. Challenges in Search and Retrieval (~ 5 min)
      * Ambiguity
      * Discrepancy in query and content
      * The impact of context

  2. Search Techniques (~ 10 min)
      * Sparse vs dense retrieval: comparing keyword and vector search (semantic search, embeddings, synsets, decompounders)
      * Hybrid search: Combining sparse and dense approaches

  3. Hybrid Search in Action (< 10 min)
      * Setting up a hybrid search with Mistral, Elasticsearch, and Streamlit
      * Live Demo: exploring search in Lach- & Sachgeschichten from Sendung mit der Maus

  4. Takeaways & Outlook (< 5 min)
    * hybrid search systems combine semantics, precision and explainability
    * contextualized search

The talk is directed at anyone interested in building or improving search systems. Attendees will gain a deeper understanding of the tools, methodologies, and metrics essential for building robust and explainable search systems.


Expected audience expertise: Domain:

None

Expected audience expertise: Python:

Novice

I received my PhD in Machine Learning (ML) and Natural Language Processing (NLP) from the University of Bonn and Fraunhofer IAIS where I was member of the Text Mining group. Now I work on AI and data driven products, mostly focused on applications in the medical and healthcare domain.
My main passion is in NLP, especially for the German language, and Information Retrieval (IR). Sometimes I build Recommender Systems.