Edoardo Tosca

Edoardo is an experienced software craftsman. He is obsessed with business problems and desperate to find the best technology to solve them. He is an open source enthusiast with a particular interest in search engines and machine learning.

Currently he is Head of Technology at Signal AI.


Entity Linking at scale with Lucene
Edoardo Tosca

Signal AI offers a sophisticated platform to support businesses in their decision making. Customers define searches across billions of documents by using an extensive DSL that includes concepts like entities and topics amongst them.
This metadata is being extracted from over 5 million documents each day and is made available to the end users within 30 seconds from its ingestion via a mix of machine learning and text retrieval techniques.

Entity Linking is one of the core capabilities in the Signal AI data processing platform. It is a complex system that uses various strategies to achieve the highest quality while retaining excellent throughput characteristics.

Back in 2019, one of the existing components of the Entity Linking system was rapidly reaching its limits and could not scale anymore.
To overcome the limitation, the team took an innovative approach and used Apache Lucene with its inverted index and term vectors capabilities to enable the identification of rule-based entities.
By choosing a percolator model the team had to revisit the previous architecture, breaking it down into smaller components that follow the Single Responsibility Principle for microservices.

This talk will take the audience through the evolution of this service, from its inception until today. It will provide details around the technical decisions and trade-offs that make this component one of the most resilient, fast and cost effective solutions, capable of handling 20 times more the number of rules at a fraction of the cost. It will also discuss how the same technology is used to reprocess the entire dataset every night in approximately 15 minutes.

Frannz Salon