Developing pandas extensions in Rust
2023-08-14 , Aula

pandas is a batteries included dataframe library, implementing hundreds of generic operations for tabular data, such as math or string operations, aggregations and window functions... In some case, domain specific code may benefit from user defined functions (UDFs) that implement some particular logic. These functions can sometimes be implemented using more basic pandas vectorized operations, and they will be reasonably fast, but in some others a Python function working with the individual values needs to be implemented, and those will execute orders of magnitude slower than their equivalent vectorized versions. In this tutorial we will see how to implement functions in Rust that can be used with dataframe values at the individual level, but run at the speed of vectorized code, and in some cases faster.


While this tutorial will cover complex topics of low level programming languages like Rust, it'll be presented for a beginner audience. No previous knowledge about Rust is required, or any other knowledge other than basic pandas understanding is needed to follow the tutorial.

The tutorial will cover how libraries developed in a low level programming language like Rust can be called from Python, the basics of the internal representation of pandas dataframes, the Apache Arrow C data interface, and how to write a simple function in Rust.

To be able to follow the hands on part of this tutorial, participants should bring their laptops and have a working Python with a recent version od pandas and PyArrow, and have a Rust compiler.


Abstract as a tweet:

Learn how to develop a high performant pandas extension in Rust from pandas core developer @datapythonista.

Category [High Performance Computing]:

Vector and Array Manipulation

Category [Community, Education, and Outreach]:

Learning and Teaching Scientific Python

Category [Machine and Deep Learning]:

Supervised Learning

Category [Scientific Applications]:

Other

Category [Data Science and Visualization]:

Data Analysis and Data Engineering

Expected audience expertise: Domain:

none

Expected audience expertise: Python:

some

Marc is a pandas core developer and the release manager for pandas 1.5 and 2.0. He is also an Ibis and ASV core developer, a fellow of the Python Software Foundation, and the VP of infrastructure at NumFOCUS. Marc works as an independent software and data consultant for clients such as Bank of America, Unilever, Bumble, Tesco and NTT Communications.