PyConDE & PyData Berlin 2024

DDataflow: An open-source end-to-end testing framework for ML pipelines
04-23, 11:40–12:10 (Europe/Berlin), A1

In the realm of machine learning, the complexity of data pipelines often hinders rapid experimentation and iteration. This talk will introduce DDataflow, an innovative open-source tool, designed to facilitate end-to-end testing in ML pipelines by leveraging decentralized data sampling. Attendees will gain insights into the challenges of unit testing in large-scale data pipelines, the design philosophy behind DDataflow, and practical implementation strategies to enhance the reliability and efficiency of their ML pipelines.


Machine Learning pipelines, especially those dealing with large datasets, are intricate and multifaceted. The ability to quickly iterate and experiment is crucial, yet the complexity and scale of these pipelines often lead to prolonged development loops and latent errors. Traditional unit-testing approaches have proven to be cumbersome and inefficient in addressing these challenges due to the extensive boilerplate code and limited coverage they offer.

This talk will delve into the journey of developing DDataflow, a tool aimed at addressing the aforementioned challenges by enabling efficient end-to-end testing in ML pipelines. DDataflow employs decentralized data sampling to expedite testing processes, allowing for rapid and reliable iterations in ML pipelines.


Expected audience expertise: Domain

Advanced

Expected audience expertise: Python

Intermediate

Abstract as a tweet (X) or toot (Mastodon)

Explore our talk on DDataflow, a tool transforming ML pipeline testing. It streamlines the process with decentralized data sampling, tackling prolonged development and latent errors in large, complex datasets.

Public link to supporting material, e.g. videos, Github, etc.

https://github.com/getyourguide/DDataFlow, https://www.getyourguide.careers/posts/ddataflow-a-tool-for-data-end-to-end-tests-for-machine-learning-pipelines

See also: slides (6.3 MB)

Theodore Meynard is a data science manager at GetYourGuide. He leads the evolution of their ranking algorithm, helping customers find the best activities to book and locations to explore. Beyond work, he is one of the co-organizers of the Pydata Berlin meetup and the conference.
When he is not programming, he loves riding his bike looking for the best bakery-patisserie in town.

Jean Carlo Machado is a Brazilian DataScience Manager at GetYourGuide for the Growth Data Products team and the Machine Learning Platform Team. From this point of view is able to collaborate with amazing people in turning business opportunities into data science products, from inception to large scale production deployments of multiple data products. Jean values community building and getting communities together; he is currently one of the organizers of the MLOps.community Berlin. Jean spends a significant part of his ever shrinking free time building open-source tools his focus right now building social good tech.