Architecting Scalable Multi-Modal Video Search PyData Paris 2025

Architecting Scalable Multi-Modal Video Search
.ical
2025-10-01 16:00–16:30, Gaston Berger

The exponential growth of video data presents significant challenges for effective content discovery. Traditional keyword search falls short when dealing with visual nuances. This talk addresses the design and implementation of a robust system for large-scale, multi-modal video retrieval, enabling search across petabytes of data using diverse inputs like text descriptions (e.g., appearance, actions) and query images (e.g., faces). We will explore an architecture combining efficient batch preprocessing for feature extraction (including person detection, face/CLIP-style embeddings) with optimized vector database indexing. Attendees will learn about strategies for managing massive datasets, optimizing ML inference pipelines for speed and cost-efficiency (touching upon lightweight models and specialized runtimes), and building interactive systems that bridge pre-computed indexes with real-time analysis capabilities for enhanced insights.

Searching through vast archives of video content (potentially petabytes) requires more than just metadata indexing; it demands deep understanding of the visual and temporal information within. This presentation details our journey in building a system capable of handling such scale and complexity, focusing on enabling sophisticated, multi-modal search queries crucial for various analytical tasks.
We will outline the core architectural components:
1. Scalable Preprocessing Pipeline:
- Strategies for efficiently chunking massive video volumes.
- Extracting rich features: detecting and cropping relevant entities (e.g., people), generating multi-purpose embeddings (e.g., CLIP-style for joint text/image retrieval, specific face embeddings for recognition).
- Attaching relevant metadata at different granularities (video, chunk, entity).
- Learnings: Balancing processing depth vs. computational cost; handling diverse video formats and quality.
2. Optimized Indexing and Retrieval:
- Leveraging vector databases for storing and querying high-dimensional embeddings at scale.
- Designing indexing strategies for hybrid search (combining vector similarity with metadata filtering).
- Learnings: Schema design for complex video metadata; performance tuning for low-latency retrieval.
3. Efficiency and Optimization:
- The critical role of model selection: balancing accuracy and inference speed (e.g., exploring efficient object detectors and embedding models).
- Techniques for optimizing inference performance on available hardware, discussing the potential benefits of frameworks like TensorRT or inference engines like vLLM for transformer models.
- Fine-tuning strategies for adapting foundation models to specific domain requirements without excessive computational overhead.
- Learnings: Practical approaches to model optimization; cost-benefit analysis of different optimization techniques.
4. Bridging Retrieval and Real-time Analysis:
- Designing a user interface to present retrieved video segments effectively.
- Integrating cutting-edge multi-modal models (like Video LLMs) for on-demand, deeper analysis of retrieved candidates, assisting analysts with contextual understanding or summarization without pre-computing everything.
- Learnings: Architecting systems that combine batch and real-time processing; managing user interaction flows.

What Attendees Will Learn:
Architectural patterns for building scalable video analysis and retrieval systems.
Practical techniques for multi-modal feature extraction and embedding generation from video.
Strategies for utilizing vector databases effectively for complex video queries.
Approaches to optimizing ML pipelines (model choice, inference acceleration) for large-scale deployments.
Considerations for integrating pre-computed indexes with real-time AI analysis.
This talk is aimed at data scientists, ML engineers, and system architects dealing with large-scale unstructured data, particularly video, who are interested in practical solutions for search, retrieval, and analysis. We will focus on the generalizable techniques and challenges, providing valuable insights applicable across various domains requiring deep video understanding.

Pietro Piccini

Sebastiano Milardo

Sebastiano Milardo received his Bachelor’s and Master’s degrees in Computer Engineering from the University of Catania in 2011 and 2013, respectively, and earned a Ph.D. in Information and Communication Technologies from the University of Palermo in 2018. From 2014 to 2015, he was a Researcher at the Italian National Consortium of Telecommunications, contributing to the NEWCOM# and SIGMA projects. He served as a Postdoctoral Fellow at the MIT Senseable City Laboratory from 2018 to 2021, where he worked on interdisciplinary research at the intersection of urban science, networks, and data-driven technologies. Since 2021, he has been working as a freelance researcher and consultant, collaborating on projects involving advanced data analytics and artificial intelligence.

His research interests include software-defined networks, network protocols for the Internet of Things, and big data. More recently, his focus has expanded to artificial intelligence, with particular attention to large language models (LLMs), machine learning pipelines, and the practical application of AI technologies in complex, real-world scenarios.

Architecting Scalable Multi-Modal Video Search .ical 2025-10-01 16:00–16:30, Gaston Berger

Architecting Scalable Multi-Modal Video Search
.ical
2025-10-01 16:00–16:30, Gaston Berger