06-13, 11:50–12:30 (Europe/Berlin), Palais Atelier
The National Audiovisual Institute (INA) is a repository of all French audiovisual archives, being responsible for archiving over 180 radio and television services, 24/7, since 1995. The generated metadata describing this content currently represents the equivalent of over 50 million documents (e.g.: images, audio and video fragments, text excerpts, etc.). Due to the heterogeneity of the content, the data model is directly inspired from the conceptual models of cultural heritage, represented by a large graph with complex relations between generic entities.
The challenge for building a global search engine for this particular use case is twofold: on one hand, the capacity to index and maintain the entire set of documents updated in a reasonable amount of time, and on the other hand the implementation of complex full text search capabilities with high performance.
Our talk describes the key choices for the graph representation, facilitating the indexing process of the documents, as well as the technical framework set up around Elasticsearch, implementing dedicated search APIs required by different functional areas.
We also briefly mention the implementation optimisations that lead to a full process of 50 million documents in less than 48 hours, for an equivalent of 800GB Elasticsearch index.
Radu is providing Consulting Services as Solutions Architect at Adelean. He handles projects around Elasticsearch and Adelean’s A2 search technology. He oversees the integration and evolution of search engines within large e-commerce platforms and marketplaces. Prior to joining Adelean, Radu acquired a solid experience in Web archiving, operating large scale crawling systems in the context of several European research projects. He holds a PhD in Computer Science and a MSc in Distributed Systems.