2022-05-26 –, PyData Room
The pandas library is one of the key factors that enabled the growth of Python in the Data Science industry and continues to help data scientists thrive almost 15 years after its creation. Because of this success, nowadays there are several open-source projects that claim to improve pandas in various ways.
In this talk we will go over some of the most widely used dataframe Python libraries beyond pandas, clarify the relationship between them, compare them in terms of project scope and proximity to the original pandas API, and offer advice on when to use each of them.
The pandas library is one of the key factors that enabled the growth of Python in the Data Science industry and continues to help data scientists thrive almost 15 years after its creation. Because of this success, nowadays there are several open-source projects that claim to improve pandas in various ways, either by bringing it to a distributed computing setting (Dask), accelerating its performance with minimal changes (Modin), or offering slightly different API that solves some of its shortcomings (Polars).
The outline of the talk goes as follows:
- Short introduction to the importance of pandas, and brief recollection of its main pain points (5 minutes)
- Enumeration of some alternatives, description of our classification (pandas-like vs bespoke, single-node vs distributed) (5 minutes)
- Presentation of the libraries using brief code snippets, visualization of the dependency relationships between them (20 minutes)
- Recommendations and conclusions (5 minutes)
After the talk, you will have more information on how some of the modern alternatives to pandas fit onto the ecosystem, understand which ones provide the easiest migration path for an existing codebase, and be more prepared to judge which one to use for your next project. Prior exposure to pandas will help make the most of the presentation.
python, PyData, data science
Juan Luis (he/him/él) is an Aerospace Engineer with a passion for STEM, programming, outreach, and sustainability. He works as Data Scientist Advocate at Orchest, where he empowers data scientists by building an open-source, scalable, easy-to-use workflow orchestrator. He has worked as Developer Advocate at Read the Docs, previously as software engineer in the space, consulting, and banking industries, and as a Python trainer for several private and public entities.
Apart from being a long-time user and contributor to many projects in the scientific Python stack (NumPy, SciPy, Astropy) he has published several open-source packages, the most important one being poliastro, an open-source Python library for Orbital Mechanics used in academia and industry.
Finally, Juan Luis is the founder and former chair of the Python España association, the point of contact for the Spanish Python community, former organizer of PyCon Spain, which attracted more than 800 attendees in its last in-person edition in 2019, and current organizer of the PyData Madrid monthly meetups.
