On the structure and reproducibility of Python packages - data crunch
2024-09-25 , Louis Armand 2 - Ouest

Did you know that all top PyPI packages declare their 3rd party dependencies? In contrast, only about 53% of scientific projects do the same. The question arises: How can we reproduce Python-based scientific experiments if we're unaware of the necessary libraries for our environment?
In this talk, we delve into the Python packaging ecosystem and employ a data-driven approach to analyze the structure and reproducibility of packages. We compare two distinct groups of Python packages: the most popular ones on PyPI, which we anticipate to adhere more closely to best practices, and a selection from biomedical experiments. Through our analysis, we uncover common development patterns in Python projects and utilize our open-source library, FawltyDeps, to identify undeclared dependencies and assess the reproducibility of these projects.
This discussion is especially valuable for enthusiasts of clean Python code, as well as for data scientists and engineers eager to adopt best practices and enhance reproducibility. Attendees will depart with actionable insights on enhancing the transparency and reliability of their Python projects, thereby advancing the cause of reproducible scientific research.


Python has emerged as one of the most popular programming languages, with an extensive ecosystem of packages and projects catering to diverse domains. However, the structure and reproducibility of Python projects vary significantly, posing challenges for developers and researchers alike. In this talk, we perform a meta-analysis of Python packages in two different domains to gain a better understanding of wide-spread patterns and quantify the impact of missing or mis-configured dependencies on reproducibility.
The FawltyDeps team was driven by a nagging question: How reproducible are Python packages and is there a discernible difference between top packages and those developed for scientific research? So we did what any data-loving person would do - we collected data and analyzed it using the core logic of FawltyDeps and indispensable Jupyter notebooks, e.g. biomedical data analysis. Our analysis includes packages from biomedical domain, drawing on the article by Sheeba and Mietchen, and for top PyPI packages, we used public BigQuery data.
Our goal is to compare and contrast data-related projects with the top packages to unearth insights into the significant ways these two groups differ, both internally and in relation to each other, in terms of structure and dependency configuration. For instance, we found that nearly 20% of data science projects store all code and notebook files in the main directory, indicating a trend. This latter sample showcases data science projects crafted by developers of varying expertise levels, in which we anticipate greater variability in both structure and reproducibility.

Outline:
Introduction (0-2’):
We introduce the speakers and provide an overview of the talk's objectives.
Reproducibility of Python Projects (3-10’):
We delve into the importance of reproducibility in Python projects and discuss common challenges faced by developers.
Methodology and Data Collection (11-15’):
We outline the experimental methodology employed, including the selection of biomedical experiments and PyPI packages for analysis and the different types of analysis done on those packages.
Results and Insights (16-23’):
We present the findings of our analysis, highlighting common patterns in Python project development and assessing their reproducibility using FawltyDeps.
Key Takeaways (24-25’):
We emphasize the key lessons learned from our analysis, particularly the importance of structured project development and dependency declarations.
Question and Answer Session (26-30’):

See also: Another result - a comparison of usage of files where 3rd party dependencies are declared: requirements.txt, setup.py, setup.cfg, pyproject.toml and their combinations. (646.5 KB)

Maria's professional goal is to improve the environment by understanding it first in
the language of mathematics and then applying the gained knowledge.
After graduating in applied mathematics, Maria began research on the two-phase turbulent flows.
Knowledge of mathematical modeling helped her better understand the small-scale physical effects
and allowed her to more accurately model two-phase turbulence while reducing computational costs.

After completing her PhD, Maria began to work as a data scientist.
She was responsible for all stages of data processing, from creating ETL pipelines, through modeling
to visualization of the results and leading 2-5 people projects.
Her inclination towards the implementation and design aspects gravitated her towards functional programming.
She integrated Haskell into parts of the data processing pipelines, finding its type system and
expressiveness more akin to mathematical language. Maria is also dedicated to maintaining neat,
reusable, and well-documented code.

Outside of her technical pursuits, Maria is passionate about promoting diversity in the IT industry
and inspiring girls and women to engage in programming. Balancing her career with being a mother of three,
she finds limited but cherished time for personal hobbies. When the opportunity arises,
Maria enjoys the thrill of motorcycle rides beyond the city limits.

This speaker also appears in:

Zhihan is a data scientist/engineer with a rich academic background in
pure mathematics. Having earned a PhD in Geometric Topology and Probabilities
from Telecom Paris, she defined random walks on simplicial complexes and
applied machine learning algorithms for in-depth analysis.

At Tweag (an Open Source Program Office of Modus create),
Zhihan brings mathematical rigor to the forefront of
data science and data engineering. She is actively engaged in developing
innovative data engineering solutions and crafting compelling data
visualizations, all while embracing the challenges of cloud deployment.