Umesh Dangat is a Principal Engineer and Group Tech Lead for the market engineering platform at Yelp. Umesh joined Yelp in 2015 and has since architected and led Yelp’s ranking infrastructure evolution into its third generation. This group at Yelp is responsible for providing search and ranking infrastructure to all of Yelp’s search and ranking needs in a cost efficient, scalable and extensible way.
Prior to Yelp, Umesh has worked at various companies for over a decade mostly solving search, streaming and data ingestion problems for large datasets and building backend systems.
Umesh is also an open source contributor for popular search projects like Elasticsearch, learning to rank and most recently Nrtsearch.
Search and ranking are part of many important features on the Yelp platform - from looking for a plumber to showing relevant photos of the dish you search for. These varied use cases led to the creation of Yelp’s Elasticsearch-based ranking platform which we presented at Berlin Buzzwords 2019, allowing real-time indexing, learning-to-rank, and lesser maintenance overhead, as well as enabling access to search functionality to more teams at Yelp. We recently built Nrtsearch, a Lucene-based search engine, to replace Elasticsearch. We have open sourced this search engine under the Apache 2.0 license.
This talk will detail
Challenges associated with scaling Elasticsearch costs and performance.
Mainly issues related to the document-based replication approach.
Difficulties with real time auto scaling of Elasticsearch.
Inefficient usage of resources due to hot and cold node issues.
Architecture of Nrtsearch
Uses Lucene’s near-real-time (NRT) segment replication
Primary-Replica architecture: Primary does all writing including segment merges while replicas simply copy over segments using Lucene's NRT APIs and serve search queries.
Cluster orchestration, availability and management of nodes is left to systems like Kubernetes that excel at resource management and scheduling.
Truly stateless architecture: Deployed as a standard microservice using Kubernetes. State is committed to s3, upon a restart of a primary or replica, the most recent state from s3 is pulled down.
Benefits of this architecture
Performance increased by up to 50%
Cluster costs lowered by up to 50%
Use of standard tools (k8s) to manage operational aspects of the cluster, relieving ranking infrastructure teams to focus on search-related problems.
Challenges involved in rolling this out to production
Lucene’s segment replication approach and the code itself is not widely used in the industry so had some rough edges. Exciting performance bugs!
Enhance feature support via extensible plugins like vector-embeddings
Continue to simplify and open source deployment tooling to help others deploy NrtSearch in their own cloud environments.