PyCon LT 2023

Analyze your data at the speed of light with Polars and Kedro
2023-05-18 , Saphire C - Web Dev

Writing maintainable data science code is a big topic, and different people have different opinions on the best ways to do it. Wouldn't it be nice if there was an opinionated framework to set some structure and help data scientists be more effective and ship their analysis and models to production faster?

In this workshop we present Kedro, an opinionated Python framework for creating reproducible, maintainable and modular data science code. We will also show how you can combine it with Polars, a new dataframe library backed by Arrow and Rust, for lightning fast data manipulation and exploratory data analysis.


In this workshop we present Kedro, an opinionated Python framework for creating reproducible, maintainable and modular data science code. We will also show how you can combine it with Polars, a new dataframe library backed by Arrow and Rust, for lightning fast data manipulation and exploratory data analysis.

Kedro is an open source (Apache 2.0) Python framework for maintainable data science that provides a series of project templates, a declarative data catalog, functionality to create function-based data pipelines, and a powerful visualization tool. It has a rich ecosystem of plugins and extensions and a thriving community.

Traditionally, Kedro has encouraged the use of pandas for data I/O and manipulation. In recent times, Polars has become increasingly popular thanks to its expressive API, its lazy evaluation system, its out of core capabilities, and its impressive performance.

The workshop will be hands on, and the outline is as follows:

  1. The problem of maintainability in data science code
  2. What is Kedro?
  3. Quick data I/O with Polars
  4. Introducing the Kedro catalog and the Jupyter integration
  5. Creating pipelines in Kedro
  6. More exploratory data analysis with Polars
  7. Plots in Kedro Viz

What is a level of your talk:

Intermediate

What topics define your talk the best?:

python, open source, PyData, optimization and speed, data science, machine learning, data engineering, open source

Juan Luis (he/him/él) is an Aerospace Engineer with a passion for STEM, programming, outreach, and sustainability. He works as Developer Advocate for Kedro, an opinionated data science framework, at QuantumBlack, AI by McKinsey. He has worked as Developer Advocate at Read the Docs, as software engineer in the space, consulting, and banking industries, and as a Python trainer for several private and public entities.

Apart from being a long-time user and contributor to many projects in the scientific Python stack (NumPy, SciPy, Astropy) he has published several open-source packages, the most important one being poliastro, an open-source Python library for Orbital Mechanics used in academia and industry.

Finally, Juan Luis is the founder and former chair of the Python España association, the point of contact for the Spanish Python community, former organizer of PyCon Spain, which attracted 800 attendees in its last in-person edition in 2022, and current organizer of the PyData Madrid monthly meetups.