Architecting Solr indexing pipelines in Google Cloud Platform
06-13, 17:20–18:00 (Europe/Berlin), Maschinenhaus

The ubiquity of public cloud platforms has made it easy to offload operational overhead of maintaining on-premise systems and leverage the ability to scale these systems on-demand in a matter of minutes. But architecting a secure scalable systems in the public cloud comes with its own challenges. This problem is further complicated when you are migrating from an on-premise system. Such migrations often require infrastructure to operate in a hybrid state where some parts of the system have been migrated to the cloud while remaining components continue to run on-premise. We must also ensure that the migration is invisible to the user and there is no impact to overall availability of the system during this transition. Recently Box Search underwent such a migration for our Solr indexing pipeline and document store which involved migrating hundreds of terabytes of customer data from on-premise to GCP. In this talk we present the overall system architecture, the migration process and some of the challenges we encountered when running this system in a hybrid state.

Shubhro Roy is a Staff Engineer and Tech Lead on the Search Team at Box. His team is responsible powering search and discovery capabilities for Box which involves running and maintaining a petabyte scale search index on Solr. Prior to Box, he was building query engines for the Database group at Oracle. He has been working on distributed systems and information retrieval for 10+ years after graduating from Carnegie Mellon with Masters in Information Systems and Machine Learning.