2023-08-17 –, HS 119 - Maintainer track
This slot will cover the effort regarding interoperability in the scientific Python ecosystem. Topics:
- Using the Array API for array-producing and array-consuming libraries
- DataFrame interchange and namespace APIs
- Apache Arrow: connecting and accelerating dataframe libraries across the PyData ecosystem
- Entry Points: Enabling backends and plugins for your libraries
Using the Array API for array-producing and array-consuming libraries
Already using the Array API or wondering if you should in a project you maintain? Join this maintainer track session to share your experience and exchange knowledge and tips around building array libraries that implement the standard or libraries that consume arrays.
DataFrame-agnostic code using the DataFrame API standard
The DataFrame Standard provides you with a minimal, strict, and predictable API, to write code that will work regardless of whether the caller uses pandas, polars, or some other library.
DataFrame Interchange protocol and Apache Arrow
The DataFrame interchange protocol and Arrow C Data interface are two ways to interchange data between dataframe libraries. What are the challenges and requirements that maintainers encounter when integrating this into consuming libraries?
Entry Points: Enabling backends and plugins for your libraries
In this talk, we will discuss how NetworkX used entry points to enable more efficient computation backends to plug into NetworkX
Using the Array API for array-producing and array-consuming libraries
This session is for maintainers of projects that either implement the Array API (Numpy, cupy, pytorch, etc) or projects that use Array API inputs (scikit-learn, scipy, etc). Or maybe you are wondering if you should start investing in Array API for your project.
The Array API standard aims to specify a common API for multidimensional arrays. Solving the problem of subtle API differences between the many array libraries that exist. This means it provides a minimum set of functions and behaviours for array libraries to implement. As a result array consuming libraries do not need code to handle the slight differences, they can rely on these functions and behaviours to exist and be standard compliant in all array libraries.
The Array API standard is not yet in widespread use. Adoption across the ecosystem has only just started.
This session is a place to discuss and share your experiences in using the Array API standard for your array library or your library that consumes arrays.
Apache Arrow: connecting and accelerating dataframe libraries across the PyData ecosystem
The Apache Arrow (https://arrow.apache.org/) project specifies a standardized language-independent columnar memory format for tabular data. It enables shared computational libraries, zero-copy shared memory, efficient (inter-process) communication without serialization overhead, etc. Nowadays, Apache Arrow is supported by many programming languages and projects, and is becoming the de facto standard for tabular data.
But what does that mean in practice? There is a growing set of tools in the Python bindings, PyArrow, and a growing number of projects that use (Py)Arrow to accelerate data interchange and actual data processing. This talk will give an overview of the recent developments both in Apache Arrow itself as how it is being adopted in the PyData ecosystem (and beyond) and can improve your day-to-day data analytics workflows.
Entry Points: Enabling backends and plugins for your libraries
As a maintainer of an open source library, you always need to wrestle with the fact that there are new experimental things which could really help your users, but it may just be too experimental for now (especially for old projects).
There are always questions like:
- If we add this new requirement, we can use the new X feature, but now we depend on that library.
- A new change will help a lot of users, but this will utterly destroy all the code written using the library in the last 20 years.
- A new fork comes out, splits the community.
- You don’t want to write and maintain C/Rust/FORTRAN and want to ship that bit to other packages.
and many more!
With this talk, we would explore an option of providing your user community a workflow to develop plugins directly for your package.
We will look at an example case study from NetworkX, specifically using the entry points mechanism for plugin discovery. In NetworkX these plugins are currently used to swap in the computation bits.
- (2 minutes) Quick introduction about NetworkX and GraphBLAS
- (3 minutes) Quick Introduction about entry_points
- (5 minutes) Use NetworkX API - but get the speed of GraphBLAS
- (5 minutes) Demo-ing the plugin mechanism and implementation details
Parallel Computing
Category [Community, Education, and Outreach]:Learning and Teaching Scientific Python
Abstract as a tweet:Interoperability in the Scientific Python Ecosystem
Category [Machine and Deep Learning]:Supervised Learning
Category [Scientific Applications]:Astronomy
Category [Data Science and Visualization]:Data Analysis and Data Engineering
Expected audience expertise: Domain:none
Expected audience expertise: Python:some
I am a core contributor to Pandas and Apache Arrow, and maintainer of GeoPandas. I did a PhD at Ghent University and VITO in air quality research and worked at the Paris-Saclay Center for Data Science. Currently, I work at Voltron Data, contributing to Apache Arrow, and am a freelance teacher of python (pandas).
I contribute to scikit-learn. In the past I helped build mybinder.org and scikit-optimize. Way back in the history of time I was a particle physicist at CERN and Fermilab.
Machine Learning software engineer at Inria and member of the maintainers' team of the scikit-learn open source project.
I graduated as a machine learning research engineer in 2016, with a specialization in NLP. I co-founded Sancare a start-up company that aims at bringing NLP-based solutions for medical data analysis to hospitals, and that has made a place for itself in the market with a performant NLP-powered billing assistant for medical stays. I'm now working at INRIA, France as a Machine Learning Research Engineers, focused on performance computing.
I am currently working on the NetworkX open source project (work funded through a grant from Chan Zuckerberg Initiative!). Also collaborating with folks from the Scientific Python project (Berkeley Institute of Data Science), Anaconda Inc. Before this I used to work on the GESIS notebooks and gesis.mybinder.org.
I am also interested in the development and maintenance of the open source data & science software ecosystem. I try to help around with the Scientific Open Source ecosystem wherever possible. To share my love of Python and Network Science, I have presented workshops at multiple conferences like PyCon, (Euro)SciPy, PyData London and many more!
Sebastian Berg is a NumPy maintainer and steering council member working at NVIDIA. He started contributing to NumPy during his undergrad and PhD and Physics and continued working on NumPy at the Berkeley Institute for Data Science before continuing to contribute at NVIDIA.