Berlin Buzzwords 2022 :: pretalx

To see our schedule with full functionality, like timezone conversion and personal scheduling, please enable JavaScript and go here.

Sunday, June 12, 2022

Monday, June 13, 2022

Tuesday, June 14, 2022

14:30

14:30

30min

Barcamp registration

Palais Atelier

15:00

Barcamps are informal sessions, a kind of "un-conference", with a schedule decided on the day. It is all driven by the interests and expertise of those who attend so each one is different, but ours are always great!

Although the barcamp doesn't have a strict schedule, it won't be completely devoid of structure! #bbuzz barcamps are dynamic events, focused on the overall Berlin Buzzwords topics, tackling the same challenges but in a different format. At the barcamp each session runs for 30 minutes giving enough time to get into the meat of a topic, but without a chance of anyone getting bored. These are participatory sessions and more inclusive than regular conference talks, with everyone taking part. You can help by leading the session, by giving some insights, by asking some great questions, or maybe just with your enthusiasm.

The barcamp will be coordinated and moderated by Nick Burch.

Registration starts from 2:30pm

09:00

09:00

45min

Registration starts

Kesselhaus

09:00

45min

Registration starts

Palais Atelier

09:00

45min

Registration starts

Maschinenhaus

09:00

45min

Registration starts

Frannz Salon

09:45

09:45

10min

Welcome

Kesselhaus

10:00

Meet the people fighting surveillance capitalism

What does it mean for democracy when we live in a world where hyper personalised misinformation and bot armies manipulate public opinion? This propaganda is fueled by social media companies, who's business depends on growing their user base, increasing engagement and improving targeting. Just getting visibility on what users are being shown is challenging, even with current EU regulations. As is often the case, users in the global south are most vulnerable, without robust regulation and with fewer moderators per user for many languages.

As technologists we are well positioned to understand this threat. How might we leverage this to create positive change? By exploring examples of people who blew whistles, enabled regulation, or taught others how to stay safe online, we can take back hope and get inspired to fight back against surveillance capitalism.

Get your ticket now!

Register for Berlin Buzzwords in our ticket shop! We also have online tickets and reduced tickets for students available and you can find more information about our Diversity Ticket Initiative here!

11:00

Apache Kafka simply explained

You’re curious about what Apache Kafka does and how it works, but between the terminology and explanations that seem to start at a complex level, it's been difficult to embark. This session is different. We'll talk about what Kafka is, what it does and how it works in simple terms with easy to understand and funny examples that you can share later at a dinner table with your family.

This session is for curious minds, who might have never worked with distributed streaming systems before, or are beginners to event streaming applications.

But let simplicity not deceive you - by the end of the session you’ll be equipped to create your own Apache Kafka event stream!

Luxuries, necessities, and the challenges that remain: some experiences with accelerated data science

Sophie Watson, William Benton

The promise of accelerated computing presents an interesting paradox: while no one complains when new compute infrastructure is dramatically faster than its predecessor, few people realize how much they’d benefit from acceleration until they have it. It is perhaps unsurprising that a data scientist’s daily work consists of tasks that they can accomplish with their available computing resources, but simply running our existing work faster makes acceleration into a mere luxury. For accelerated computing to fulfill its promise, we need it to transform our work by enabling us to do new things that wouldn’t have been feasible without it. In this talk, we’ll discuss our experiences accelerating data science with specialized hardware and by scaling out on clusters. We’ll present examples of previously-impossible techniques becoming feasible, of the pleasant luxury of improved performance, and of the data science tasks that aren’t likely to justify additional hardware or implementation effort. You’ll leave this talk with a better understanding of how accelerated and scale-out computing can fit into your data science practice, a catalog of techniques that are still well served by standard hardware, and some actionable advice for how to take advantage of parallel and distributed computing across your workflow.

The future of Lucene's MMapDirectory: Why use it and what's coming with Java 19 and later?

Since version 3 of Apache Lucene and Solr and from the early beginning of Elasticsearch, the general recommendation was to use MMapDirectory as the implementation for index access on disk. But why is this so important?

This talk will first introduce the user about the technical details of memory mapping and why using other techniques slows down index access by a significant amount. Of course we no longer need to talk about 32/64bit Java VMs - everybody uses now 64 bits with Elasticsearch and Solr, but with current Java versions, Lucene still has some 32bit-like limitations on accessing the on-disk index with memory mapping. We will discuss those limitations especially with growing index size up to terabytes, and afterwards, Uwe will give an introduction to the new Java Foreign Memory Access API (JEP 370, JEP 383, JEP 393, JEP 412, JEP 419), that first appeared with Java 14, but still incubating.

This talk will give an overview of the the foreign memory API to be finalized and released to general availability in Java 19 and will present the current state of implementation in Lucene 10. Uwe will show how future versions of Lucene will be backed by next generation memory mapping and what needs to be done to make this usable in Solr and Elasticsearch - bringing you memory mapping for indexes with tens or maybe hundreds of Terabytes in the future!

The Search track is presented by OpenSource Connections

11:50

Cross-Platform Data Lineage with OpenLineage

There are more data tools available than ever before, and it's easier to build a pipeline than it's ever been. This has resulted in an explosion of innovation, but it also means that data within today's organizations has become increasingly distributed. It can't be contained within a single brain, a single team, or a single platform.

Data lineage can help by tracing the relationships between datasets and providing a map of your entire data universe. OpenLineage provides a standard for lineage collection that spans multiple platforms, including Apache Airflow, Apache Spark, Flink, and dbt. This empowers teams to diagnose and address widespread data quality and efficiency issues in real time.

In this session, Julien Le Dem from Datakin will show how to trace data lineage across Apache Spark and Apache Airflow. He will walk through the OpenLineage architecture and provide a live demo of a running pipeline with real-time data lineage.

Live build: How to harness streaming data in real time to track, transform and build on heart rate data

Tomáš Neubauer, Javier Blanco Cordero

This case study offers an entertaining way to learn about the possibilities of stream processing, which can be applied to projects in fields that require easy access to current information, such as finance, mobility and energy. We’ll use the Quix platform to set up a series of open source data sets and code samples that collect, transform and deliver data under a machine learning model that learns to handle real-time heart rate data. We’ll show how to include complex transformations to the data, such as how to calculate calories burned with Python.

Scaling an online search engine to thousands of physical stores

An online e-commerce search engine is easy to put in place. Scaling it to serve millions of users, adding a marketplace to provide thousands of products, supporting multiple offers, prices and stocks on the same product are additional challenges more difficult to address. And what if, in addition, you mix your online search engine with the activity of thousands of physical stores?

In this talk we explain how we addressed all these challenges in the context of the largest retail group and online grocery store in France. The constraint of multiple physical stores backed by the online search engine introduces additional challenges that we emphasize and address in detail. Our point of view, as we explain the challenges and solutions, is both technical and functional.

The Search track is presented by OpenSource Connections

Searching through large graphs using Elasticsearch

The National Audiovisual Institute (INA) is a repository of all French audiovisual archives, being responsible for archiving over 180 radio and television services, 24/7, since 1995. The generated metadata describing this content currently represents the equivalent of over 50 million documents (e.g.: images, audio and video fragments, text excerpts, etc.). Due to the heterogeneity of the content, the data model is directly inspired from the conceptual models of cultural heritage, represented by a large graph with complex relations between generic entities.

The challenge for building a global search engine for this particular use case is twofold: on one hand, the capacity to index and maintain the entire set of documents updated in a reasonable amount of time, and on the other hand the implementation of complex full text search capabilities with high performance.

Our talk describes the key choices for the graph representation, facilitating the indexing process of the documents, as well as the technical framework set up around Elasticsearch, implementing dedicated search APIs required by different functional areas.

We also briefly mention the implementation optimisations that lead to a full process of 50 million documents in less than 48 hours, for an equivalent of 800GB Elasticsearch index.

The Search track is presented by OpenSource Connections

12:40

Offline Ranking Validation - Predicting A/B Test Results

Andrea Schuett, Yunus Lutz

Implementing a machine learning model for ranking in an ecommerce search requires a well-designed approach to how the target metric is defined. In our team we validate our target metrics with online tests on live traffic. This requires both long preparation times and long enough runtimes to yield valid results. Having to choose only a few candidates for the next A/B test is hard and slows us down significantly. So what if we had a way to evaluate the candidates beforehand to make a more informed decision?

We came up with an approach to predict how a certain ranking will perform in an onsite test. We leverage historic user interaction data from search events and try to correlate them with ranking metrics like NDCG. This gives us insights on how well the ranking meets the user intent. This is not meant to be a replacement for a real A/B test, but allows us to narrow down the field of candidates to a manageable number. In this talk we will share our approach to offline ranking validation and how it performed in practice.

The Search track is presented by OpenSource Connections

Why a Search Engine Makes a Great Log Analytics Solution

Search Engine technologies, like OpenSearch, have continued to grow in popularity for a number of different use cases. Features like full-text search, fast ingestion, scalability, faceting, and extensible plugin frameworks were often enhanced with the aim to improve the search use case. However, the side effect of these improvements provided much of the foundation that led people to adopting these technologies for other uses like click stream analytics, log analytics, security analytics, and more.

In this talk we will explore how features that started as search enhancements opened the door for new use cases and why we continue to see affinity between search engines and broader analytics workloads.

The Search track is presented by OpenSource Connections

13:00

13:00

60min

Lunch Break

Kesselhaus

13:00

60min

Lunch Break

Palais Atelier

13:00

60min

Lunch Break

Maschinenhaus

13:00

60min

Lunch Break

Frannz Salon

14:00

AI-powered Semantic Search; A story of broken promises?

Jo Kristian Bergum

Semantic search using AI-powered vector embeddings of text, where relevancy is measured using a vector similarity function, has been a hot topic for the last few years. As a result, platforms and solutions for vector search have been springing up like mushrooms. Even traditional search engines like Elasticsearch and Apache Solr ride the semantic vector search wave and now support fast but approximative vector search, a building block for supporting AI-powered semantic search at scale.

Undoublty, sizeable pre-trained language models like BERT have revolutionized the state-of-the-art on data-rich text search relevancy datasets. However, the question search practitioners are asking themself is, do these models deliver on their promise of an improved search experience when applied to their domain? Furthermore, is semantic search the silver bullet which outcompetes traditional keyword-based search across many search use cases? This talk delves into these questions and demonstrates how these semantic models can dramatically fail to deliver their promise when used on unseen data in new domains.

The Search track is presented by OpenSource Connections

Benefits of MQTT for IoT Messaging and Beyond

IoT applications run on IoT devices and can be created to be specific to almost every industry and vertical, from small devices to large ones, including healthcare, industrial automation, smart homes and buildings, automotive, and wearable technology. The possibilities are limitless. Increasingly, IoT applications are using AI and machine learning to add intelligence to devices. Among all of the variables in the IoT ecosystem, one common theme is the need to be able to handle the constrained operating environment, such as unreliable network connectivity, limited bandwidth, low battery power, and so on. We will take a look into the MQTT protocol, how it has evolved from its early days which was intended for the connection of oil pipelines via satellite, to now the ever-increasing demand in IoT and M2M applications, to how this protocol will evolve to meet the modern needs especially in the current cloud computing era. We will study a few outstanding MQTT libraries that are available in the market, such as the Java-based HiveMQ, and open source libraries such as Eclipse Mosquitto.

Kafka Monitoring: What Matters!

Due to Apache Kafka's widespread integration into enterprise-level infrastructures, monitoring Kafka performance at scale has become an increasingly important task. It can be challenging to understand what is happening in Kafka - both at the application level and lag performance, to successfully root cause/troubleshoot problems. To perform effective diagnosis, meaningful insights and visibility throughout all levels of the cluster are a must.

This talk will take a dive into what metrics or indicators matter most while running Kafka at Scale focusing on Lag performance. How to interpret and correlate these indicators, build dashboards and configure meaningful alerts to identify a probable issue to take place. This talk concludes with the idea of doing trend analysis to detect anomalies for long-running Kafka pipelines.

What we learned from reading 100+ Kubernetes Post-Mortems

When building our Kubernetes-native product, we wanted to find the most common sources of failures, anti-patterns and root causes for Kubernetes outages, so we got to work. We rolled up our sleeves and read 100+ Kubernetes post-mortems. This is what we discovered.

A smart person learns from their own mistakes, but a truly wise person learns from the mistakes of others.

When launching our product, we wanted to learn as much as possible about typical pains in our ecosystem, and did so by reviewing many post-mortems (100+!) to discover the recurring patterns, anti-patterns, and root causes of typical outages in Kubernetes-based systems.

In this talk we have aggregated for you the insights we gathered, and in particular will review the most obvious DON’Ts and some less obvious ones, that may help you prevent your next production outage by learning from others’ real world (horror) stories.

14:50

Help! I Need To UnSQLize My Application

More and more people are moving from old-school relational databases to a variant of NoSQL. If starting a green-field project with a document database is easy, it can be a different story when migrating from one to the other. Simply porting SQL tables to a collection might cause you more harm than good. In this talk, the attendees will learn about the basic concepts of document databases, such as documents and collections. They will then learn about some of the standard data schemas available. Finally, the speaker will show real-life examples of data migration and how they can be applied to adopt a new NoSQL database.

Neural Search - Let's talk about quality

Maximilian Werk, Florian Hoenicke

Context

In the past year the interest in Neural Search and vector search engines increased heavily. They promise to solve multi modal, cross modal and semantic search problems with ease. However, when quickly trying Neural search with off-the-shelf pre-trained models the results are quite disillusioning. They lacking knowledge about the data at hand. In order to explicitly solve model finetuning for search problems we implemented an open-source finetuner. It is directly usable with several vector databases due to the underlying data structure.

Presentation

In our talk we present our methodology and performance on an example dataset. Afterwards, we show how well the approach transfers to other datasets, such as deepfashion, geolocation geoguessr and more. It will give hands-on guidance on how you can finetune a model in order to make your data better searchable.

The Search track is presented by OpenSource Connections

Reproducible and shareable notebooks across a data science team

Mike Tapi Nzali, Pascal Godbillot

At CybelAngel we scan the internet looking for sensitive data leaks belonging to our clients.
As the volume of alerts could count billions of samples, we use machine learning to throw away as much noise as possible to reduce the analysts' workload.

We are a growing team of data scientists and a machine learning engineer, planning to double in size. Each of us contributes to projects and we use Notebooks before code industrialisation. As for many other data science teams, a lot of effort and valuable work is encapsulated in a format that is tricky to share, hardly reproducible and simply not built for production purposes. During the talk, we will present what we did to overcome some of these issues and our feedback about notebook versioning and implementation in Google Cloud Platform using open JupyterHub and Jupytext.

This talk is addressed to a technical audience but all roles gravitating around a data team are welcome to grasp the challenges of the interaction of data science within the organisation.

Using Solr unconventionally to serve 26bn+ documents

Richard Goodman

Learn how the Data Infrastructure team at Brandwatch rearchitected a group of their current Solr clusters and took a new approach in an unconventional manner. By splitting up the reads and writes, experimenting with Solr plugins, using S3, an application written in Rust and adopting the Solr Operator to spin up a cluster on Kubernetes, we were able to achieve our goal of having a cloud-based cluster which comfortably serves 26bn+ documents.

You'll understand the whys of our approach, things we discovered, what we have planned, and why rearchitecting things can be a difficult and strenuous task.

The Search track is presented by OpenSource Connections

15:30

15:30

30min

Coffee Break

Kesselhaus

15:30

30min

Coffee Break

Palais Atelier

15:30

30min

Coffee Break

Maschinenhaus

15:30

30min

Coffee Break

Frannz Salon

16:00

Changelog Stream Processing with Apache Flink

We all know that the world is constantly changing. Data is continuously produced and thus should be consumed in a similar fashion by enterprise systems. Message queues and logs such as Apache Kafka can be found in almost every architecture, while databases and other batch systems still provide the foundation. Change Data Capture (CDC) has become popular to capture committed changes from a database and propagate those changes to downstream consumers.

In this talk, we will introduce Apache Flink as a general data processor for various kind of use cases on both finite and infinite streams. We demonstrate Flink's SQL engine as a changelog processor that is shipped with an ecosystem tailored to process CDC data and maintain materialized views. We will use Kafka as an upsert log, Debezium for connecting to databases, and enrich streams of various sources using different kinds of joins.

Finally, we illustrate how to combine Flink's Table API with DataStream API for event-driven applications beyond SQL.

Neural Search Comes to Apache Solr: Approximate Nearest Neighbor, BERT and More (Buzzwords)!

Alessandro Benedetti

The first integrations of machine learning techniques with search allowed to improve the ranking of your search results (Learning To Rank) - but one limitation has always been that documents had to contain the keywords that the user typed in the search box in order to be retrieved.
For example, the query “tiger” won’t retrieve documents containing only the terms “panthera tigris”.
This is called the vocabulary mismatch problem and over the years it has been mitigated through query and document expansion approaches.
Neural search is an Artificial Intelligence technique that allows a search engine to reach those documents that are semantically similar to the user’s query without necessarily containing those terms; it avoids the need for long lists of synonyms by automatically learning the similarity of terms and sentences in your collection through the utilisation of deep neural networks and numerical vector representation.
This talk explores the first Apache Solr official contribution about this topic, available from Apache Solr 9.0.
During the talk we will give an overview of neural search (Don’t worry - we will keep it simple!): we will describe vector representations for queries and documents, and how Approximate K-Nearest Neighbor (KNN) vector search works.
We will show how neural search can be used along with deep learning techniques (e.g, BERT) or directly on vector data, and how we implemented this feature in Apache Solr, giving usage examples!
Join us as we explore this new exciting Apache Solr feature and learn how you can leverage it to improve your search experience!

The Search track is presented by OpenSource Connections

The life of a search engine administrator

Vincent Bréhin, Lucian Precup

Defining the KPIs, keeping an eye on the customer satisfaction and sales, defining the backlog, configuring the search engine, debugging relevance issues, preventing regressions … These are a few tasks on the list of a search engine administrator. A search engine is a living thing. Seasonality, levels of stocks, lifecycle of the products, marketing events, news, etc. are a few of the many factors that force the search engine to constantly evolve. In this context, the life of a search engine manager is tough. In this talk we describe the processes and tools that we put in place and help manage a search engine. We also address the limits between what can be automated and what still needs human supervision.

The Search track is presented by OpenSource Connections

The perils of building a democratic data platform

Andre Jasiskis, Joaquim Torres

It is clearly beneficial for an organization to make data-driven decisions,
decentralize access to data processing and empower every team to generate valuable information.

There are many ways to achieve these goals, but in an environment of rapid growth, building an accessible Data Platform is just the first step. What happens next determines its long-term success or its dramatic demise.

In this presentation, we discuss the main perils of building a platform that
processes over 80000 unique datasets built by 1000 people across different
teams, how to avoid them, and where to go from there.

16:50

A smooth ride: Online car buying and selling at mobile.de

Mobile.de is Germany's largest online vehicle marketplace. Under the hood, there are more data products and machine learning solutions than one could imagine when thinking about an online classifieds platform. In this talk, we will present the main decision-making checkpoints in the car buying and selling scenarios, and which are mobile.de data products support users in their journey. Our talk will present an overview of all data topics and provide a deeper look on a few of them.

This talk is sponsored by mobile.de

Change data capture with Debezium…and without

Petros Angelatos

"Change Data Capture (CDC) has become a mundane commodity, much in part due to the ever-rising success of Debezium. But what happens when you want to keep track of changes in your upstream database without having a message broker in your stack? In this talk, we’ll walk through how we built a direct Postgres CDC connector at Materialize to provide an alternative to our CDC support through Kafka+Debezium."

Logging Apache Spark - How we made it easy

Are you familiar with the following Scenario?

You're running your Apache Spark app on EMR, and the log file gets pretty heavy. You try and open it through the AWS UI, or download it straight to your computer. You end up connecting to the server running your driver or any of your executors, relentlessly searching your logs while simultaneously looking at Ganglia and the Spark UI for additional logs and metrics.

If you are, this talk is exactly for you.

Let me tell you how made it all easy with just some bootstrap actions, some bash scripts, Beats and Elastic. Customizable per app logging, with less searching of big log files and more looking into useful Kibana dashboards. This architecture is not nice to have, it's essential.

The Search track is presented by OpenSource Connections

The Race to the Bottom - Low Latency in the age of the Transformer

So you want to deploy a large language model, and keep your latency SLA? NLP adds enormous value to customers, but getting it to work efficiently is fraught with uncertainty and high cost. As transformers and other big neural network architectures make their way into your platform, you may be finding it difficult to get the speed and throughput you need within your budget, or even understand why it is so expensive.

This talk will give an overview of the latency and throughput challenges, and how to solve them. We will give an overview in the product and cost implications as well as the technical improvements that can be used to get things running fast. We will compare solutions and help make sense of difficult to understand technology.

The audience will walk away with the information they need to decide on the best direction for inference in their production platform.

Keywords: MLOps, Inference, Latency

17:20

Architecting Solr indexing pipelines in Google Cloud Platform

Shubhro Jyoti Roy

The ubiquity of public cloud platforms has made it easy to offload operational overhead of maintaining on-premise systems and leverage the ability to scale these systems on-demand in a matter of minutes. But architecting a secure scalable systems in the public cloud comes with its own challenges. This problem is further complicated when you are migrating from an on-premise system. Such migrations often require infrastructure to operate in a hybrid state where some parts of the system have been migrated to the cloud while remaining components continue to run on-premise. We must also ensure that the migration is invisible to the user and there is no impact to overall availability of the system during this transition. Recently Box Search underwent such a migration for our Solr indexing pipeline and document store which involved migrating hundreds of terabytes of customer data from on-premise to GCP. In this talk we present the overall system architecture, the migration process and some of the challenges we encountered when running this system in a hybrid state.

The Search track is presented by OpenSource Connections

Compress giant language models to effective and resource-saving models using knowledge distillation

Language models have drawn a lot of attention in NLP in recent years. Despite their short history of development, they have been employed and delivered astonishing performances in all sorts of NLP tasks, such as translation, question answering, information extraction and intelligent search.

However, we should not forget that giant language models are not only data hungry, but also energy hungry. State-of-the-art language models such as BERT, RoBERTa and XLNet process millions of parameters, which is only possible with the help of dozens of sophisticated and expensive chips. The CO2 generated in the process is also massive. Being responsible for such high energy consumption is not easy in times of climate change.

In order for companies to benefit from the performance of state-of-the-art language models without putting too much strain on their computing costs, the models used must be reduced to a minimum. Of course, performance should not suffer as a result. One possible means to achieve this is the so-called knowledge distillation, which is one common technique among model compression methods. In this presentation, we will show you how you can use knowledge distillation to generate models that achieve comparable performances as state-of-the-art language models effectively, and in a resource-saving manner.

The Search track is presented by OpenSource Connections

Entity Linking at scale with Lucene

Signal AI offers a sophisticated platform to support businesses in their decision making. Customers define searches across billions of documents by using an extensive DSL that includes concepts like entities and topics amongst them.
This metadata is being extracted from over 5 million documents each day and is made available to the end users within 30 seconds from its ingestion via a mix of machine learning and text retrieval techniques.

Entity Linking is one of the core capabilities in the Signal AI data processing platform. It is a complex system that uses various strategies to achieve the highest quality while retaining excellent throughput characteristics.

Back in 2019, one of the existing components of the Entity Linking system was rapidly reaching its limits and could not scale anymore.
To overcome the limitation, the team took an innovative approach and used Apache Lucene with its inverted index and term vectors capabilities to enable the identification of rule-based entities.
By choosing a percolator model the team had to revisit the previous architecture, breaking it down into smaller components that follow the Single Responsibility Principle for microservices.

This talk will take the audience through the evolution of this service, from its inception until today. It will provide details around the technical decisions and trade-offs that make this component one of the most resilient, fast and cost effective solutions, capable of handling 20 times more the number of rules at a fraction of the cost. It will also discuss how the same technology is used to reprocess the entire dataset every night in approximately 15 minutes.

Hybrid search > sum of its parts?

Lester Solbakken

Over the decades, information retrieval has been dominated by classical methods such as BM25. These lexical models are simple and effective yet vulnerable to vocabulary mismatch. With the introduction of pre-trained language models such as BERT and its relatives, deep retrieval models have achieved superior performance with their strong ability to capture semantic relationships. The downside is that training these deep models is computationally expensive, and suitable datasets are not always available for fine-tuning toward the target domain.

While deep retrieval models work best on domains close to what they have been trained on, lexical models are comparatively robust across datasets and domains. This suggests that lexical and deep models can complement each other, retrieving different sets of relevant results. But how can these results effectively be combined? And can we learn something from language models to learn new indexing methods? This talk will delve into both these approaches and exemplify when they work well and not so well. We will take a closer look at different strategies to combine them to get the best of both, even in zero-shot cases where we don't have enough data to fine-tune the deep model.

The Search track is presented by OpenSource Connections

18:00

18:00

60min

Get together @ Palais

Kesselhaus

18:00

60min

Get together @ Palais

Palais Atelier

18:00

60min

Get together @ Palais

Maschinenhaus

18:00

60min

Get together @ Palais

Frannz Salon

09:00

09:00

45min

Registration starts

Kesselhaus

09:00

45min

Registration starts

Palais Atelier

09:00

45min

Registration starts

Maschinenhaus

09:00

45min

Registration starts

Frannz Salon

10:10

Matscholar: The search engine for materials science researchers

Matscholar (Matscholar.com) is a scientific knowledge search engine for materials science researchers. We have indexed information about materials, their properties, and the applications they are used in for millions of materials by text mining the abstracts of more than 5 million materials science research papers. Using a combination of traditional and AI-based search technologies, our system extracts the key pieces of information and makes it possible for researchers to do queries that were previously impossible. Matscholar, which utilizes Vespa.ai and our our own bespoke language models, greatly accelerates the speed at which energy and climate tech researchers can make breakthroughs and can even help them discover insights about materials and their properties that have gone unnoticed.

The Search track is presented by OpenSource Connections

Min and Max Aggregations with Updates in Real Time.

As part of our analytics platform we handle real time ingestion and aggregations by performing aggregations such as count and sum based on roll ups such as day, hour and minute using Kafka streams app. We recently added support for Min and Max measures along with existing sum and count while performing aggregations on incoming Kafka records. The interesting part is that we also support all these aggregations on Updated records. This talk aims at exploring the interesting details that went into adding the Min and Max functionality to our Kafka streams app while performing real-time aggregations with Updates.

Understanding Vespa with a Lucene mindset

Vespa is no more a 'new kid on the block' in the domain of search and big data. Everyone is wooed over reading about its capabilities in search, recommendation, and machine-learned aspects augmenting search especially for large data-sets. With so many great features to offer and so less documentation to how to get started on Vespa , we want to take an opportunity to introduce it to the lucene based search users.
We will cover about Vespa architecture , getting started , leveraging advance features , important aspects all in the analogies easier for someone with a fresh or lucene based search engines mindset.

The Search track is presented by OpenSource Connections

Working in the Open...Search

Charlotte Henkle, Sean Neumann

In July of 2021, AWS launched the OpenSearch Project, an Apache 2.0 licensed fork derived from Elasticsearch 7.10.2 & Kibana 7.10.2. The OpenSearch Project is a community-driven, open source search and analytics suite. It consists of a search engine daemon, OpenSearch, and a visualization and user interface, OpenSearch Dashboards. OpenSearch enables people to ingest, secure, search, aggregate, view, and analyze data. Our goal is to build great software together with a strong and vibrant community. In this talk we’ll cover what we’ve launched so far, what’s coming in the future, and the challenges of stewarding an open-source project while also being associated with a large corporation.

This talk is sponsored by OpenSearch

10:40

Effective CI/CD for Large Systems

CI/CD brings tremendous value to development teams. The rapid availability of feedback helps developers make informed decisions about their design choices and lets teams deploy with confidence. But when systems become large and test times go from seconds to hours, how do we get our groove back? In this talk, we’ll explore strategies for validating large, complex systems, such as:

Setting well-defined component boundaries
Flexibly modeling dependencies between these components
Ranking tests by cost versus value
Testing in production with canary launches and feature flags

These and similar techniques let us minimize test times, maximize confidence, and free our teams up to focus on delivering value to customers.

Goodbye Tracking, Hello Privacy: The Technology & Architecture behind Ethical Search & Discovery

Nina Müller, Lara Menéndez García

Search is a vital part of the online experience and for many brands a key way to interact with their customers. Yet search results are too often derived from data collected by trackers and analytics, tools that disrespect human rights and GDPR or CCPA regulations. In this talk, we'll outline the negative impact of tracking while exploring alternative solutions that actively protect privacy without detracting from the search experience.

Key takeaways:

Learn the key principles of a privacy-first platform architecture
Explore high demand performance stability in a data protected environment
Liberty of liability: look, but don’t touch personal data

This talk is sponsored by Empathy.co

Should we stop using distance in our location-based data recommendation models?

Location is an important decision-making factor for many end users. Hotel aggregators, job search portals, property listing companies all filter out results that are too far away. If the results page shows locations that are hard to reach, conversion rates will plummet.

If you’re quality scoring results based on straight-line distance, you’re not personalising your results page as well as you could be. That’s because we never truly travel in a straight line, instead we’re at the mercy of the transport networks around us. Distance never considers the context of accessibility, which is unique to every location around the world.

Using distance is impacting search result ranking because:
1. It doesn’t acknowledge that long distances in quiet rural areas are easier to travel vs. congested urban areas
2. It ignores that some locations are situated on fast transport routes – they could appear far away but they may be really easy to access depending on the local infrastructure
3. Local geography can massively impact accessibility – mountains, rivers and beaches all provide accessibility challenges

The solution:
Using real world examples I’ll discuss how to integrate travel times into your recommendation model and what the effects are for businesses and end users. I’ll also discuss how the presence of transport data on search result listings helps reduce cognitive load when users are making a decision.

I’ll end with a quick demo showing how to build it into your recommendation engine.

The Search track is presented by OpenSource Connections

11:30

Build Real-time Analytic Applications: The Easy Way.

Sergio Ferragut

Apache Druid is the open source analytics database that enables development of modern data-intensive applications of any size. It provides sub-second response times on streaming and historical data and can scale to deliver real-time analytics with data ingestion at any data flow rate – with lightning fast queries at any concurrency.

Sounds great, right? But any large distributed system can be difficult and time-consuming to deploy and monitor. Deployment requirements change significantly from use case to use case, from dev/test clusters on the laptop to hundreds of nodes in the cloud. Kubernetes has become the de-facto standard for making these complicated systems be much easier to deploy and operate.

In this talk you will learn about Druid's microservice architecture and the benefits of deploying it on Kubernetes. We will walk you through the open source project's Helm Chart design and how it can be used to deploy and manage clusters of any size with ease.

Don't Panic: Getting Your Infrastructure Drift Under Control

In your ever-changing Infrastructure, some changes are intentional while others are not.

Infrastructure Drift can happen for many reasons, sometimes it happens when adding or removing resources, other times when changing resource definitions upon resource termination or failure, and even when changes have been made manually or via other automation tools.

When something is changed intentionally, it will appear in the source code, and should not raise any alarm. However, if any part of the infrastructure has been changed manually, there are tools that can identify this, and alert to the change. In other words, if your IaC drifted from its expected state, then you can in fact, detect it.

Applying simple solutions can empower DevOps and developer velocity, with the reassurance and context for unexpected changes in your IaC, in near real-time. This talk will showcase real-world examples, and practical ways to apply this in your production environments, while doing so safely and at the pace of your engineering cycles.

Drift is what happens whenever the real-world state of your infrastructure differs from the state defined in your configuration.

Scaling your Kafka pipeline can be a pain - but it doesn’t have to be!!

Opher Dubrovsky, Ido Nadler

Kafka data pipeline maintenance can be painful.
It usually comes with complicated and lengthy recovery processes, scaling difficulties, traffic ‘moodiness’, and latency issues after downtimes and outages.

It doesn’t have to be that way!

We’ll examine one of our multi-petabyte scale Kafka pipelines, and go over some of the pitfalls we’ve encountered. We’ll offer solutions that alleviate those problems, and go over comparisons between the before and after . We’ll then explain why some common sense solutions do not work well and offer an improved, scalable and resilient way of processing your stream.

We’ll cover:
- Costs of processing in stream compared to in batch
- Scaling out for bursts and reprocessing
- Making the tradeoff between wait times and costs
- Recovering from outages
- And much more…

Solving the knapsack problem with recursive queries and PostgreSQL

Francesco Tisiot

Optimization problems are everywhere, from deciding which clothes to pack in our luggage (aka the knapsack problem), to selecting the tasks that will be worked during a sprint. Trying to solve these type of problems by hand is a tedious task often resulting in sub-optimal decisions.

In this talk, we'll understand how PostgreSQL recursive queries can help. Starting from the proper problem definition, we'll then explore how to build queries that call themselves recursively, what are the risks associated with this approach and safeguards we can set to optimise performances. Finally we'll demonstrate how two new features released in PostgreSQL 14 enable an easier handling of the recursive statements.

If you're into PostgreSQL and eager to understand how recursion works, this session is for you!

12:20

Open Science: Building Models Like We Build Open-Source Software

Steven Kolawole

Elevator pitch
The use of transfer learning has begun a golden era in applications of ML but the development of these models “democratically” is still in the dark ages compared to best practices in SWE. I describe how methods of open-source SWE can allow models to be built by a distributed community of researchers.

Over the past few years, it has become increasingly common to use transfer learning when tackling machine learning problems (e.g. the BERT model on HuggingFace Hub has been downloaded tens of millions of times). However, pre-training often involves training a large model on a large amount of data. This incurs substantial computational (and therefore financial) costs; for example, Lambda estimates that training the GPT-3 language model would cost around $4.6 million. As a result, the most popular pre-trained models are being created by small teams within large, resource-rich corporations. This means that the majority of the research community is excluded from participating in the design and creation of these valuable resources.

Here, I elaborate on why we should develop tools that will allow us to build pre-trained models in the same way that we build open-source software. Specifically, models should be developed by a large community of stakeholders who continually update and improve them. Realizing this goal will require porting many ideas from open-source software development to building and training models, which motivates many threads of interesting research and opens up machine learning research for much larger participation.

Optimizing Containers for Security and Scaling

This talk is about creating minimal containers. The author has started to dive into Kubernetes and Container Security some years ago. Minimizing the size and the attack vectors are just two sides of the same coin. As a reward, you get much faster deployment pipelines, enabling more automated testing and higher scalability. A speed up by a factor of 10 or 20 is not unusual, sometimes the size of a cointainer shrinks by a factor of 100.

12factor IX: disposability
bad examples
optimizing the size of a container
building minimal containers from scratch
a small step in a Dockerfile, a big leap for container size
debugging minimal containers
speed up
security measured by Trivy

Relevance is not a Thing but a Perception

Ana Maria García Sánchez

When talking about relevance regarding search, it often sounds like it is a thing, something that can be touched and seen. Nevertheless, that is not the case. What do I mean by that? In this talk, I will provide some examples of how relevance is often merely seen as a score when it can be, in fact, an engaging relationship where the user and the search UI connect in aesthetic and enjoyable ways. I will present numerous examples of innovative search experiences that challenge prevailing schemas and structures and lead instead to elements of motion and correlated visual action that allows us to perceive the beauty of relevancy on a different level. Because relevance is a matter of perception

The Search track is presented by OpenSource Connections

13:00

13:00

60min

Lunch Break

Kesselhaus

13:00

60min

Lunch Break

Palais Atelier

13:00

60min

Lunch Break

Maschinenhaus

13:00

60min

Lunch Break

Frannz Salon

14:00

Autoscaling Elasticsearch for Logs on Kubernetes

Radu Gheorghe, Ciprian Hacman

Elasticsearch (or OpenSearch) clusters likely need to scale to adapt to changes in load. But autoscaling Elasticsearch isn't trivial: indices and shards need to be well sized and well balanced across nodes. Otherwise the cluster will have hotspots and scaling it further will be less and less efficient.

This talk focuses on two aspects:
- best practices around scaling Elasticsearch for logs and other time-series data
- how to apply them when deploying Elasticsearch on Kubernetes. In the process, a new (open-source) operator will be introduced (yes, there will be a demo!). This operator will autoscale Elasticsearch while keeping a good balance of load. It does so by changing the number of shards in the index template and rotating indices when the number of nodes changes.

The Search track is presented by OpenSource Connections

Do It Yourself: Programmable Metrics using OpenTelemetry

Ricardo Ferreira

Using metrics to measure how good or bad things are going is a proven way to ensure a software-based system is going in the right direction. Most metrics are created and monitored automatically by agent technologies installed in our infrastructure, making us hostages of the set of metrics that these agents are programmed to address. But what if you need to handle your own set of metrics?

This is a question that often drives developers mad because they fear spending development cycles building something that will end up being locked into a particular monitoring/observability vendor. But OpenTelemetry — a CNCF observability framework that provides a vendor-neutral approach to tackle metrics, logging, and tracing needs, can change everything.

This talk will explain how the OpenTelemetry framework allows the creation of custom metrics in a standard, scalable, and reusable way. It will provide an example in Java of a set of metrics that are continuously updated based on the execution of the code and how to hook that data with a compatible observability backend.

Learning about AI/ML for Text, with Wordle!

What can the hit game Wordle teach us about Information Retrieval, Search and AI/ML? As it turns out, quite a bit!

We'll use the Wordle game as our example "text problem" we want to solve, and run through many of the key concepts you need to get started with AI and ML for text. We'll see (with code!) how some common text-related statistics work, and how they can be used to solve (cheat...) Wordle. Then, we'll build ourselves an AI to do the same. Finally, we'll see how that compares to brute-forcing it with regular expressions!

We won't solve all your text-related problems, but hopefully you'll learn the key concepts you need for more advanced talks. And if nothing else, you'll understand the python code for an AI to help you win Wordle!

The Search track is presented by OpenSource Connections

Running Apache Spark on K8s: From AWS EMR to K8s

Ramiro Alvarez Fernandez, Álvaro Panizo, Daniel Hernández Alfageme

Spark is a trend technology that it is being used for a lot of companies for large-scale data analytics. During the first try, companies usually try to use the cloud provider solution to speed up their time to market, but once Spark is broadly embrace by more teams in the company and the solution should be able to be multi cloud provider, then the Kubernetes adoption appear and the journey to make it happen its worth to share to inspire others in the same situation. In this talk the audience will learn some benefits to migrate from AWS EMR to Spark on Kubernetes, from operability point of view (reliability, portability, scalability), through observability and finally reviewing efficiency and costs. This talk is a real use case three teams at Empathy.co were working during 6 months to make their solution more agnostic and with minimum cloud dependencies.

14:40

Word2Vec model to generate synonyms on the fly in Apache Lucene

Daniele Antuzi, Ilaria Petreti

If you want to expand your query/documents with synonyms in Apache Lucene, you need to have a predefined file containing the list of terms that share the same semantic.
It's not always easy to find a list of basic synonyms for a language and, even if you find it, this doesn’t necessarily match with your contextual domain.
The term "daemon" in the domain of operating system articles is not a synonym of "devil" but it's closer to the term "process".

Word2Vec is a two-layer neural network that takes as input a text and outputs a vector representation for each word in the dictionary.
Two words with similar meanings are identified with two vectors close to each other.

This talk explores our contribution to Apache Lucene that integrates this technique with the text analysis pipeline.
We will show how you can automatically generate synonyms on the fly from an Apache Lucene index and how you can use this new feature along with Apache Solr with practical examples!

The Search track is presented by OpenSource Connections

14:50

Dense Concept Retrieval

Konstantinos Perifanos, Lily Davies

At codec.ai we are processing at a daily basis a large volume of input streams in different modalities: text, image, videos. Understanding and making sense of what this content is from a cultural point of view is a challenging task. Here, we will be presenting our multimodal search engine which makes possible to search text, image and video content.

We will be discussing traditional information retrieval approaches augmented with dense retrieval representations produced by neural networks (embeddings), dot product queries with Elasticsearch and approximate nearest neighbor techniques such as Locality-Sensitive Hashing (LSH) an Product Quantization (PQ).

The Search track is presented by OpenSource Connections

Unpaired Sentiment-to-Sentiment Translation: A Cycled Reinforcement Learning Approach

Sakshi Deo Shukla

Sentiment-to-Sentiment translation is a special case for Style Transfer. Style Transfer is emphasised on generating the opposite polar style in terms of emotions or sentiment. This results in the transfer of style successfully but loses the semantic context of the sentence. This is caused due to inefficient amount of data having these relevant paired sentences with polar styles. This talk focuses on generating unpaired dataset which preserves the semantic context during a style change using cycled Reinforcement Learning approach on parallel data having emotionalization and neutralization modules.

The talk can be viewed from https://bit.ly/bbuzz2022

What's new in Apache Solr 9.0

Apache Solr 9.0 might be among the most anticipated release for the project in the last decade for Solr.

For folks who don't follow the project very closely, the list of changes is a lot to comprehend and digest. This talk would make that process easy for the developers by highlighting some key aspects of the 9.0 release.

During this talk, I'd cover the migration of the Solr build system to Gradle and what it means for developers who work with Solr. I will also talk about updates to modules like the movement of HDFS into a non-core plugin and the removal of auto-scaling framework, CDCR, and DIH.

In addition, this talk would also showcase some of the key security, scalability, and stability improvements that Solr 9.0 brings to the users.

At the end of this talk, the attendees would have a better understanding of the Solr 9.0 release and a high level road map for the project allowing them to plan better.

The Search track is presented by OpenSource Connections

15:20

URL Frontier, an open source API and implementation for crawl frontiers

This talk will present URLFrontier, an API and service implementation of a crawl frontier. After an introduction to how it fits in a distributed crawl architecture, we will go in more details on what the project provides, how it has been used so far and future works.

15:30

15:30

30min

Coffee Break

Kesselhaus

15:30

30min

Coffee Break

Palais Atelier

15:30

30min

Coffee Break

Maschinenhaus

15:30

30min

Coffee Break

Frannz Salon

16:00

Building an Open-source Framework for Generating Embedding Vectors

The combination of big data and deep learning has fundamentally changed the way we approach search systems, allowing us to index audio, images, video, and other human-generated data based on an embedding vector instead of an auxiliary description. These advancements are backed by new and often times increasingly complex machine learning (ML) models, leading to an even wider research-to-industry gap despite the introduction of MLOps platforms and a variety of model hubs. We summarize some of the challenges facing practical machine learning in 2022 and beyond as follows: 1) many ML applications require a combination of multiple models, leading to a lot of overly complex and difficult-to-maintain auxiliary code, 2) many engineers are unfamiliar with ML and/or data science, making it difficult for them to train, test, and integrate ML models into existing infrastructure, and 3) constant architectural updates to SOTA deep learning models creates significant overhead when deploying said models in production environments.

In this talk, we discuss lessons learned from building an open-source (https://github.com/towhee-io/towhee) and scalable framework for generating embedding vectors purpose-built to tackle the above challenges. Early on, we communicated with dozens of industry partners to understand their application(s) and architected our platform around their requirements. This open source project is currently being used by 3 major corporations ($10B+ market value) and a number of small- and mid-size startups in proof-of-concept and production systems.

The Search track is presented by OpenSource Connections

Muves: Multimodal and multilingual vector search with Hardware Acceleration

Aarne Talman, Dmitry Kan

Bringing multimodal experience into search journey became of high interest lately: searching images with text, or looking inside an audio file, combining that with the rgb frames of a video stream. Today, vector search algorithms (like FAISS, HNSW, BuddyPQ) and databases (Vespa, Weaviate, Milvus and others) make these experiences a reality. But what if you as a user would like to stay with the familiar Elasticsearch / OpenSearch AND leverage the vector search at scale? In this talk we will take a hardware acceleration route to build a vector search experience over products and will show how you can blend the worlds of neural search with symbolic filters.

We will discuss use cases where adding multimodal and multilingual vector search will improve recall and compare results from Elasticsearch/OpenSearch with and without the vector search component using tools like Quepid. We will also investigate different fine-tuning approaches and compare their impact on different quality metrics.

We will demonstrate our findings using our end-to-end search solution Muves which combines traditional symbolic search with multimodal and multilingual vector search and includes an integrated fine-tuner for easy domain adaptation of pre-trained vector models.

The Search track is presented by OpenSource Connections

Patterns and anti-patterns for production ready Kafka Streams apps

Christoph Schubert

Kafka Streams is a library for developing streaming application with Apache Kafka.
We will discuss best practices for developing a production-ready Kafka Streams application and for running it smoothly in production.
After reviewing the fundamentals of stateless and especially stateful programming with Kafka Streams, we will address the following questions:

How to prepare your application for seamless failover?
How to deal with the ever-growing table anti-pattern and properly implement TTL?
How to prevent resource-leaks when dealing with RocksDB-based state stores?
Which metrics to monitor?
How to size your runtime environment?
What should we keep in mind when deploying Kafka Streams on Kubernetes?
How to best deal with evolving data models?

Scaling the Open Source Climate Community

The scarcity of standardized and accessible data at the convergence of human climate impacts and the financial sector prevents economic stakeholders from effectively aligning world-wide investment and capital flows with Environmental, Social, and Governance (ESG) objectives. The majority of financial companies cannot afford costly bespoke ingestion and curation projects, and so climate-aware investing remains limited without the benefit of shared data or open protocols.

At the Open Source Climate (OS-Climate) community, we are building an open data science platform that supports data ingestion, processing and quality management for data from both corporate climate reports and investment related data. In order for this global community project to succeed, OS-Climate must implement traditional scalability of compute and data, but that alone is insufficient. The community must also scale the operation of its cluster and software deployments. Furthermore, it must effectively scale its ability to onboard new data workflows from actively contributing members. Last but not least, it must be able to scale its own governance at each of these levels, as they mature.

In this talk, Erik will introduce OS-Climate and tell the story of how this open community has managed its own evolution to continue scaling data, computation, operations, member contributions and governance. The audience will learn about tools from software, data science, platforms, and community architecture that can help their own communities grow.

16:50

Cloud-native ETL with Java Quarkus, Kubernetes, and Jib Container Builder

DataCater unlocks more value from organizations' data, faster. This talk walks you through our stack, architecture, and processes. We develop tools to deploy and run data-driven applications in a cloud-native environment.

We will give a whirlwind tour on developing a java Quarkus application, a CICD stack powered by Github Actions / ArgoCD, building and deploying containerized Kafka Streams applications at runtime with Jib container builder.

Having introduced the above common understanding, we will give a high-level overview of how we utilize modern Kubernetes and Cloud tooling to manage multiple clusters in different organizations together with our customers.

Next generation OLAP stack using Apache Pinot

Real-time analytics has transformed the way companies do business. It has unlocked the ability to make real-time decisions such as customer incentives, business metrics, fraud detection and provide a personalized user experience that accelerates growth and user retention. This is a complex problem and naturally, there are several OLAP (OnLine Analytics Processing) solutions out there, each focusing on a different aspect.

In order to support all such use cases, we need an ideal OLAP platform that has the ability to support extremely high query throughput with low latency and at the same time provide high query accuracy – in the presence of data duplication and real-time updates. In addition, the same system must be able to ingest data from all kinds of data sources, handle unstructured data and real-time upserts. While there are different ways of solving each such problem scenario, ideally we want one unified platform that can be easily customized. In this talk, we will go over the rich capabilities of Apache Pinot that make it an ideal OLAP platform.

NrtSearch: Yelp’s fast, scalable, and cost-effective open source search engine

Search and ranking are part of many important features on the Yelp platform - from looking for a plumber to showing relevant photos of the dish you search for. These varied use cases led to the creation of Yelp’s Elasticsearch-based ranking platform which we presented at Berlin Buzzwords 2019, allowing real-time indexing, learning-to-rank, and lesser maintenance overhead, as well as enabling access to search functionality to more teams at Yelp. We recently built Nrtsearch, a Lucene-based search engine, to replace Elasticsearch. We have open sourced this search engine under the Apache 2.0 license.

This talk will detail

Challenges associated with scaling Elasticsearch costs and performance.
Mainly issues related to the document-based replication approach.
Difficulties with real time auto scaling of Elasticsearch.
Inefficient usage of resources due to hot and cold node issues.

Architecture of Nrtsearch
Uses Lucene’s near-real-time (NRT) segment replication
Primary-Replica architecture: Primary does all writing including segment merges while replicas simply copy over segments using Lucene's NRT APIs and serve search queries.
Cluster orchestration, availability and management of nodes is left to systems like Kubernetes that excel at resource management and scheduling.
Truly stateless architecture: Deployed as a standard microservice using Kubernetes. State is committed to s3, upon a restart of a primary or replica, the most recent state from s3 is pulled down.

Benefits of this architecture
Performance increased by up to 50%
Cluster costs lowered by up to 50%
Use of standard tools (k8s) to manage operational aspects of the cluster, relieving ranking infrastructure teams to focus on search-related problems.

Challenges involved in rolling this out to production
Lucene’s segment replication approach and the code itself is not widely used in the industry so had some rough edges. Exciting performance bugs!

Future work
Enhance feature support via extensible plugins like vector-embeddings
Continue to simplify and open source deployment tooling to help others deploy NrtSearch in their own cloud environments.

The Search track is presented by OpenSource Connections

17:40

17:40

15min

Closing Session

Kesselhaus