Zhihan Zhang
Zhihan is a data scientist/engineer with a rich academic background in
pure mathematics. Having earned a PhD in Geometric Topology and Probabilities
from Telecom Paris, she defined random walks on simplicial complexes and
applied machine learning algorithms for in-depth analysis.
At Tweag (an Open Source Program Office of Modus create),
Zhihan brings mathematical rigor to the forefront of
data science and data engineering. She is actively engaged in developing
innovative data engineering solutions and crafting compelling data
visualizations, all while embracing the challenges of cloud deployment.
Session
Did you know that all top PyPI packages declare their 3rd party dependencies? In contrast, only about 53% of scientific projects do the same. The question arises: How can we reproduce Python-based scientific experiments if we're unaware of the necessary libraries for our environment?
In this talk, we delve into the Python packaging ecosystem and employ a data-driven approach to analyze the structure and reproducibility of packages. We compare two distinct groups of Python packages: the most popular ones on PyPI, which we anticipate to adhere more closely to best practices, and a selection from biomedical experiments. Through our analysis, we uncover common development patterns in Python projects and utilize our open-source library, FawltyDeps, to identify undeclared dependencies and assess the reproducibility of these projects.
This discussion is especially valuable for enthusiasts of clean Python code, as well as for data scientists and engineers eager to adopt best practices and enhance reproducibility. Attendees will depart with actionable insights on enhancing the transparency and reliability of their Python projects, thereby advancing the cause of reproducible scientific research.