Towards a deeper understanding of retrieval and vector databases PyData Paris 2024

Towards a deeper understanding of retrieval and vector databases
.ical
2024-09-26 13:50–14:20, Louis Armand 2 - Ouest

Retrieval is the process of searching for a given item (image, text, …) in a large database that are similar to one or more query items. A classical approach is to transform the database items and the query item into vectors (also called embeddings) with a trained model so that they can be compared via a distance metric. It has many applications in various fields, e.g. to build a visual recommendation system like Google Lens or a RAG (Retrieval Augmented Generation), a technique used to inject specific knowledge into LLMs depending on the query.
Vector databases ease the management, serving and retrieval of the vectors in production and implement efficient indexes, to rapidly search through millions of vectors. They gained a lot of attention over the past year, due to the rise of LLMs and RAGs.

In this talk, we will detail two examples of real-life projects (Deduplication of real estate adverts using the image embedding model DinoV2 and RAG for a medical company using the text embedding model Ada-2) and deep dive into retrieval and vector databases to demystify the key aspects and highlight the limitations: HSNW index, comparison of the providers, metadata filtering (the related plunge of performance when filtering too many nodes and how indexing partially helps it), partitioning, reciprocal rank fusion, the performance and limitations of the representations created by SOTA image and text embedding models, …

Retrieval is the process of searching for a given item (image, text, …) in a large database that are similar to one or more query items. A classical approach is to transform the database items and the query item into vectors (also called embeddings) with a trained model so that they can be compared via a distance metric. It has many applications in various fields, e.g. to build a visual recommendation system like Google Lens or a RAG (Retrieval Augmented Generation), a technique consisting of retrieving the documents the most similar to a given question, to then answer the question with an LLM given the retrieved documents.

Vector databases ease the management, serving and retrieval of the vectors in production and implement efficient indexes, to rapidly search through millions of vectors. They gained a lot of attention over the past year, due to the rise of LLMs and RAGs.

Although people working with LLMs are increasingly familiar with the basic principles of vector databases, the finer details and nuances often remain obscure. This lack of clarity hinders the ability to make optimal use of these systems. For instance which differences matter between the different providers? How mature metadata filtering is, when to use it and what are the limitations? What is partitioning and what is the best way to do it? Which functionalities beyond simple vector search can be helpful? …

In this talk, we will clarify these grey areas to help better leverage these systems. More specifically, we will:

1/ Introduce the main concepts:
- How retrieval works
- What are vector databases
- What are the use cases. We will introduce two real-life projects we did using vector databases, which will allow us to provide practical examples through the talk
  - Deduplication of real estate adverts using the image embedding model DinoV2
  - RAG for a medical company using the text embedding model Ada-2
2/ Highlight the pros and cons of the different providers, with a highlight on the differences between dedicated vector databases and more general-purpose databases.
3/ Deep dive into the technicality of vector search:
- How does the HSNW index work and how to configure it
- The necessity of metadata filtering and the related limitations: mostly when filtering out more than 80-90% of the nodes the precision plunges. Hence indexing becomes necessary but it is still imperfect in complex cases.
- The performance and limitations of the representations created by state-of-the-art image and text embedding models

Noé Achache

I am an Engineering Manager (for Data Science projects) at Sicara, where I worked on a wide range of projects mostly related to vector databases, computer vision, prediction with structured data and more recently LLMs.
I am currently leading the GenAI development in the company.
You can find all my talks and articles here: https://www.sicara.fr/en/noe-achache, e.g.
- https://www.sicara.fr/blog-technique/how-to-choose-your-vector-database-in-2023
- https://www.youtube.com/watch?v=aX_hdQEintc

Towards a deeper understanding of retrieval and vector databases .ical 2024-09-26 13:50–14:20, Louis Armand 2 - Ouest

Towards a deeper understanding of retrieval and vector databases
.ical
2024-09-26 13:50–14:20, Louis Armand 2 - Ouest