PackagingCon

Micro-packaging reusable data science pipelines in Python
2021-11-10 , Room 3

We believe that sharing and reusing data science code is the future for scaling machine learning across the world because it allows us to work more efficiently. To achieve this grand vision, we had to look at how micro-packaging could be done in Python, the language of choice for most data scientists. Micro-packaging is a widely debated topic in the npm world, and it hasn't taken off in the Python packaging ecosystem.

This talk will present the journey that brought us to this point, the challenges we've faced implementing this functionality and the solution we created in Kedro, an open-source Python framework for data science. Whether you're a data practitioner or a software engineer curious to reuse code between projects, you can draw some inspiration from this talk.


We have used Kedro to build reusable code stores, similar to how React is used to create design systems. Kedro is an open-source Python framework for creating reproducible, maintainable and modular data science code. While Kedro did lift many barriers, our users found that they still needed a way to easily share code snippets and parts of their data science pipelines between projects. Furthermore, they wanted to consume business logic as source code, and possibly extend it, rather than as a library.

This prompted my team to think about ways to enable a more seamless experience of sharing data science code in ways that didn't confuse our beginner users or force data scientists to take a bunch of software engineering classes to use the feature.

Thus micro-packaging was born: packaging (and consuming) submodules using simple CLI commands and a manifest file (pyproject.toml).

In this talk, we cover:

  • Setting the stage - introducing the main pain points we needed to address
  • How the Kedro solution works
  • What's under the hood
  • Reflections & future thinking

I'm Software Engineer & Pythonista since 2017, currently working on QuantumBlack's and McKinsey's first open source project. In my spare time you can find me on a volleyball court, in an art gallery, or (in non-pandemic times) on a plane for a city-break.