Ilaria Petreti

Ilaria is an Information Retrieval/Machine Learning engineer at Sease. Strongly believing in the power of Big Data and Digital Transformation, she got a master in Data Science.
She loves the application of data mining and machine learning methods to information retrieval problems. Currently, she is involved in Learning to Rank projects.


Word2Vec model to generate synonyms on the fly in Apache Lucene
Daniele Antuzi, Ilaria Petreti

If you want to expand your query/documents with synonyms in Apache Lucene, you need to have a predefined file containing the list of terms that share the same semantic.
It's not always easy to find a list of basic synonyms for a language and, even if you find it, this doesn’t necessarily match with your contextual domain.
The term "daemon" in the domain of operating system articles is not a synonym of "devil" but it's closer to the term "process".

Word2Vec is a two-layer neural network that takes as input a text and outputs a vector representation for each word in the dictionary.
Two words with similar meanings are identified with two vectors close to each other.

This talk explores our contribution to Apache Lucene that integrates this technique with the text analysis pipeline.
We will show how you can automatically generate synonyms on the fly from an Apache Lucene index and how you can use this new feature along with Apache Solr with practical examples!

