JupyterLab is very widely used in the Python scientific community. Most, if not all, of the other tutorials will use Jupyter as a tool. Therefore, a solid understanding of the basics is very helpful for the rest of the conference as well as for your later daily work.
This tutorial provides an overview of important basic Jupyter features.
Through the use of NetworkX's API, tutorial participants will learn about the basics of graph theory and its use in applied network science. Starting with a computationally-oriented definition of a graph and its associated methods, we will build out into progressively more advanced concepts (path and structure finding). We will also discuss new advances to speed up NetworkX Code with dispatching to alternate computation backends like GraphBLAS. This will be a hands-on tutorial, so stretch your muscles and get ready to go through the exercises!
This tutorial offers a thorough introduction to the srai library for Geospatial Artificial Intelligence. Participants will learn how to use this library for geospatial tasks like downloading and processing OpenStreetMap data, extracting features from GTFS data, dividing an area into smaller regions, and representing regions in a vector space using various spatial features. Additionally, participants will learn to pre-train embedding models and train predictive models for downstream tasks.
This tutorial will provide an introduction to Python intended for beginners.
It will notably introduce the following aspects:
- built-in types
- controls flow (i.e. conditions, loops, etc.)
- built-in functions
- basic Python class
pandas is a batteries included dataframe library, implementing hundreds of generic operations for tabular data, such as math or string operations, aggregations and window functions... In some case, domain specific code may benefit from user defined functions (UDFs) that implement some particular logic. These functions can sometimes be implemented using more basic pandas vectorized operations, and they will be reasonably fast, but in some others a Python function working with the individual values needs to be implemented, and those will execute orders of magnitude slower than their equivalent vectorized versions. In this tutorial we will see how to implement functions in Rust that can be used with dataframe values at the individual level, but run at the speed of vectorized code, and in some cases faster.
NumPy is one of the foundational packages for doing data science with Python. It enables numerical computing by providing powerful N-dimensional arrays and a suite of numerical computing tools. In this tutorial, you'll be introduced to NumPy arrays and learn how to create and manipulate them. Then, you'll see some of the tools that NumPy provides, including random number generators and linear algebra routines.
Working with data can be challenging: it often doesn’t come in the best format for analysis, and understanding it well enough to extract insights requires both time and the skills to filter, aggregate, reshape, and visualize it. This session will equip you with the knowledge you need to effectively use pandas – a powerful library for data analysis in Python – to make this process easier.
This tutorial will introduce how to train machine learning models for time-to-event prediction tasks (health care, predictive maintenance, marketing, insurance...) without introducing a bias from censored training (and evaluation) data.
Ibis provides a common dataframe-like interface to many popular databases and analytics tools (BigQuery, Snowflake, Spark, DuckDB, …). This lets users analyze data using the same consistent API, regardless of which backend they’re using, and without ever having to learn SQL. No more pains rewriting pandas code to something else when you run into performance issues; write your code once using Ibis and run it on any supported backend. In this tutorial users will get experience writing queries using Ibis on a number of local and remote database engines.
This tutorial explains the fundamental ideas and concepts of matplotlib. It's suited for complete beginners to get started as well as existing users who want to improve their plotting abilities and learn about best practices.
Update: Here, I provide a prepared jupyter notebook for your to fill with code during the tutorial: https://github.com/StefanieSenger/Talks/blob/main/2023_EuroSciPy/2023_EuroSciPy_Intro_to_scikit-learn_fillout-notebook.ipynb. Please download it and have it at hand when the tutorial starts. You can still download it during the introduction part of the tutorial.
This tutorial will provide a beginner introduction to scikit-learn. Scikit-learn is a Python package for machine learning. We will talk about what Machine Learning is and how scikit-learn can implement it. In the practical part we will learn how to create a predictive modelling pipeline and how to fine tune its hyperparameters to improve the model's score.
Privacy guarantee is the most crucial requirement when it comes to analyse sensitive data. However, data anonymisation techniques alone do not always provide complete privacy protection; moreover Machine Learning models could also be exploited to leak sensitive data when attacked, and no counter-measure is applied. Privacy-preserving machine learning (PPML) methods hold the promise to overcome all these issues, allowing to train machine learning models with full privacy guarantees. In this tutorial we will explore several methods for privacy-preserving data analysis, and how these techniques can be used to safely train ML models without actually seeing the data.
Do you test your data pipeline? Do you use Hypothesis? In this workshop, we will use Hypothesis - a property-based testing framework to generate Pandas DataFrame for your tests, without involving any real data.
In this hands-on tutorial, participants will delve into numerical optimization fundamentals and engage with the optimization libraries scipy.optimize and estimagic. estimagic provides a unified interface to many popular libraries such as nlopt or pygmo and provides additional diagnostic tools and convenience features. Throughout the tutorial, participants will get the opportunity to solve problems, enabling the immediate application of acquired knowledge. Topics covered include core optimization concepts, running an optimization with scipy.optimize and estimagic, diagnostic tools, algorithm selection, and advanced features of estimagic, such as bounds, constraints, and global optimization.
Learn how to show your work with the MERCURY framework. This open-source tool perfectly matches your computed notebook (e.g., written in Jupyter Notebook). Without knowledge of frontend technologies, you can present your results as a web app (with interactive widgets), report, dashboard, or report. Learn how to improve your notebook and make your work understandable for non-technical mates. Python only!
This tutorial explores scikit-image, the numpy-native library in the scientific python ecosystem, for visual data analysis and manipulation.
Designed for beginners and advanced users, it empowers image analysis skills and offers insights into scikit-image documentation.
It covers basic concepts like image histogram, contrast, filtering, segmentation, and descriptors through practical exercises.
The tutorial concludes with advanced performance optimization techniques.
Familiarity with numpy arrays is essential as it the underlying data representation.
In the rapidly evolving landscape of Machine Learning (ML), significant advancements like Large Language Models (LLMs) are gaining critical importance in both industrial and academic spheres. However, the rush towards deploying advanced models harbors inherent ethical tensions and potential adverse societal impacts. The keynote will start with a brief introduction to the principles of ethics, viewed through the lens of philosophy, emphasizing how these fundamental concepts find application within ML. Grounding our discussion in tangible realities, we will delve into pertinent case studies, including the BigScience open science initiative, elucidating the practical application of ethical considerations. Additionally, the keynote will touch upon findings from my recent research, which investigates the synergy between ethical charters, legal tools, and technical documentation in the context of ML development and deployment.
From sensor data to epidemic outbreaks, particle dynamics to environmental monitoring, much of crucial real world data has temporal nature. Fundamental challenges facing data specialist dealing with time series include not only predicting the future values, but also determining when these values are alarming. Standard anomaly detection algorithms and common rule-based heuristics often fall short in addressing this problem effectively. In this talk, we will closely examine this domain, exploring its unique characteristics and challenges. You will learn to apply some of the most promising techniques for detecting time series anomalies as well as relevant scientific Python tools that can help you with it.
Let's Talk Inclusivity and Mental Health.
What's beyond the lines of code? Let's explore the spectrum of experiences, from contributors to volunteers, developers to conference attendees.
Join us to share your insights, experiences, and solutions for a more supportive and inclusive scientific Python ecosystem. Let's empower one another and shape a community that thrives on empathy, understanding, and collaboration.
We love to use Python in our day jobs, but that enterprise database you run your ETL job against may have other ideas. It probably speaks SQL, because SQL is ubiquitous, it’s been around for a while, it’s standardized, and it’s concise.
But is it really standardized? And is it always concise? No!
Do we still need to use it? Probably!
What’s a data-person to do? String-templated SQL?
print(f”That way lies {{ m̴͕̰̻̏́ͅa̸̟̜͉͑d̵̨̫̑n̵̖̲̒͑̾e̸̘̼̭͌s̵͇̖̜̽s̸̢̲̖͗͌̏̊͜ }}”.)
Instead, come and learn about Ibis! It offers a dataframe-like interface to construct concise and composable queries and then executes them against a wide variety of backends (Postgres, DuckDB, Spark, Snowflake, BigQuery, you name it.).
When operating a classifier in a production setting (i.e. predictive phase), practitioners are interested in potentially two different outputs: a "hard" decision used to leverage a business decision or/and a "soft" decision to get a confidence score linked to each potential decision (e.g. usually related to class probabilities).
Scikit-learn does not provide any flexibility to go from "soft" to "hard" predictions: it uses a cut-off point at a confidence score of 0.5 (or 0 when using decision_function
) to get class labels. However, optimizing a classifier to get a confidence score close to the true probabilities (i.e. a calibrated classifier) does not guarantee to obtain accurate "hard" predictions using this heuristic. Reversely, training a classifier for an optimum "hard" prediction accuracy (with the cut-off constraint at 0.5) does not guarantee obtaining a calibrated classifier.
In this talk, we will present a new scikit-learn meta-estimator allowing us to get the best of the two worlds: a calibrated classifier providing optimum "hard" predictions. This meta-estimator will land in a future version of scikit-learn: https://github.com/scikit-learn/scikit-learn/pull/26120.
We will provide some insights regarding the way to obtain accurate probabilities and predictions and also illustrate how to use in practice this model on different use cases: cost-sensitive problems and imbalanced classification problems.
Pandas has reached a 2.0 milestone in 2023. But what does that mean? And what is coming after 2.0? This talk will give an overview of what happened in the latest releases of pandas and highlight some topics and major new features the pandas project is working on
Have you ever wanted to write a DataFrame-agnostic function, which should perform the same operation regardless of whether the input is pandas / polars / something else? Did you get stuck with special-casing to handle all the different APIs? All is good, the DataFrame Standard is here to help!
Since its release, ChatGPT is now widely adopted as "the" text generation tool used across all industries and businesses. This also includes the domain of scientific research where we do observe more and more scientific papers partially or even fully generated by AI. The same also applies to the peer-reviews reports created while reviewing a paper.
What are the guidelines in the scientific research world? What is now the meaning of the written word and how do we build a model that can identify whether a text is AI-generated? What are the potential solutions to solve this important issue?
Within this talk, we are discussing on how to detect AI-generated text and how to create a scalable architecture integrating this tool.
This maintainer track aims to lead discussions about the current needs for sparse data in the scientific python Ecosystem. It will present achievements and pursuit of the work initiated in the first Scientific Python Developer Summit, which took from 22nd May to 28th May 2023.
Scientific code is often complex, resource-intensive, and sensitive to performance issues, making accurate timing and benchmarking critical for optimising performance and ensuring reproducibility. However, benchmarking scientific code presents several challenges, including variability in input data, hardware and software dependencies, and optimisation trade-offs. In this talk, I discuss the importance of timing and benchmarking for scientific code and outline strategies for addressing these challenges. Specifically, I emphasise the need for representative input data, controlled benchmarking environments, appropriate metrics, and careful documentation of the benchmarking process. By following these strategies, developers can effectively optimise code performance, select efficient algorithms and data structures, and ensure the reliability and reproducibility of scientific computations.
This talk will discuss the importance of Continuous Integration and Continuous Delivery (CI/CD) principles in the development of scientific applications, with a focus on creating robust and reproducible code that can withstand rigorous testing and scrutiny. The presentation will cover best practices for project structure and code organization, as well as strategies for ensuring reproducibility, collaboration, and managing dependencies. By implementing CI/CD principles in scientific application development processes, researchers can improve efficiency, reliability, and maintainability, ultimately accelerating research.
Python is slow. We feel the performance limitations when doing computationally intensive work. There are many libraries and methods to accelerate your computations, but which way to go? This talk serves as a navigation guide through the world of speeding up Python. At the end, you should have a high-level understanding of performance aspects and know which way to go when you want to speed up your code next time.
Solara is a pure Python web framework designed to scale complex applications. Leveraging a React-like API, Solara offers the scalability, component-based coding, and simple state management that have made React a standard for large web applications. Solara uses a pure Python implementation of React, Reacton, to create ipywidgets-based applications that work both in the Jupyter Notebook environment and as standalone web apps with frameworks like FastAPI. This talk will explore the design principles of Solara, illustrate its potential with case studies and live examples, and provide resources for attendees to incorporate Solara into their own projects. Whether you're a researcher developing interactive visualizations or a data scientist building complex web applications, Solara provides a Python-centric solution for scaling your projects effectively.
NumPy is planning a 2.0 release early next year replacing the 1.X release. While we hope that the release will not be disruptive to most users we do plan some larger changes that may affect many. These changes include modifications to the Python and C-API, for example making the NumPy promotion rules more consistent around scalar values.
Chalk'it is an open-source framework that transforms Python scripts into distributable web app dashboards. It utilizes drag-and-drop widgets to establish an interface linked to a dataflow connecting Python code and various data sources. Chalk'it supports multiple Python graphics libraries, including Plotly, Matplotlib and Folium for interactive mapping and visualization. The framework operates entirely in web browsers using Pyodide. In our presentation, we will showcase Chalk'it, emphasizing its primary features, software architecture, and key applications, with a special focus on geospatial data visualization.
estimagic is a Python package for nonlinear optimization with or without constraints. It is particularly suited to solving difficult nonlinear estimation problems. On top, it provides functionality to perform statistical inference on estimated parameters.
In this presentation, we give a tour through estimagic's most notable features and explain its position in the ecosystem of Python libraries for numerical optimization.
Polars is the "relatively" new fast dataframe implementation that redefines what DataFrames are able to do on a single machine, both in regard to performance and dataset size.
In this talk, we will dive into polars and see what makes them so efficient. It will touch on technologies like Arrow, Rust, parallelism, data structures, query optimization and more.
So you don't know JavaScript but know how to use python? Do you want to build an app where you can draw molecules for some application like properties prediction? Then come to this talk where I'll show you how to use Ketcher, EPAM tool for small molecule drawing, PyScirpt and rdkit for your next drug discovery app.
Zarr is an API and cloud-optimized data storage format for large, N-dimensional, typed arrays, based on an open-source technical specification. In the last 4 years it grew from a Python implementation to a large ecosystem. In this talk, we want to share how this transformation happened and our lessons learned from this journey. Today, Zarr is driven by an active community, defined by an extensible specification, has implementations in C++, C, Java, Javascript, Julia, and Python, and is used across domains such as Geospatial, Bio-imaging, Genomics and other Data Science domains.
This slot will cover the effort regarding interoperability in the scientific Python ecosystem. Topics:
- Using the Array API for array-producing and array-consuming libraries
- DataFrame interchange and namespace APIs
- Apache Arrow: connecting and accelerating dataframe libraries across the PyData ecosystem
- Entry Points: Enabling backends and plugins for your libraries
Using the Array API for array-producing and array-consuming libraries
Already using the Array API or wondering if you should in a project you maintain? Join this maintainer track session to share your experience and exchange knowledge and tips around building array libraries that implement the standard or libraries that consume arrays.
DataFrame-agnostic code using the DataFrame API standard
The DataFrame Standard provides you with a minimal, strict, and predictable API, to write code that will work regardless of whether the caller uses pandas, polars, or some other library.
DataFrame Interchange protocol and Apache Arrow
The DataFrame interchange protocol and Arrow C Data interface are two ways to interchange data between dataframe libraries. What are the challenges and requirements that maintainers encounter when integrating this into consuming libraries?
Entry Points: Enabling backends and plugins for your libraries
In this talk, we will discuss how NetworkX used entry points to enable more efficient computation backends to plug into NetworkX
Today state of the art scientific research as well as industrial software development strongly depend on open source libraries. The demographic of the contributors to these libraries is predominantly white and male. In order to increase participation of groups who have been historically underrepresented in this domain PyLadies Berlin, a volunteer run community group focussed on helping marginalised people to professionally establish themselves in tech, has been running hands on monthly open source hack nights for more than a year. After some initial challenges the initiative yielded encouraging results. This talk summarises the learnings and teaches how they can be applied in the wider open source community.
In this project we generate tools to identify birds within the spatial extent of a meteorological radar. Using the opportunities created by modern dual-polarization radars we build graph neural networks to identify bird flocks. For this, the original point cloud data is converted to multiple undirected graphs following a set of predefined rules, which are then used as an input in graph convolutional neural network (Kipf and Welling, 2017, https://doi.org/10.48550/arXiv.1609.02907). Each node has a set of features such as range, x, y, z coordinates and several radar specific parameters e.g. differential reflectivity and phase shift which are used to build model and conduct graph-level classification. This tool will alleviate problem of manual identification and labelling which is tedious and time intensive. Going forward we also focus on using the temporal information in the radar data. Repeated radar measurements enable us to track these movements across space and time. This makes it possible for regional movement studies to bridge the methodological gap between fine-scale, individual-based tracking studies and continental-scale monitoring of bird migration. In particular, it enables novel studies of the roles of habitat, topography and environmental stressors on movements that are not feasible with current methodology. Ultimately, we want to apply the methodology to data from continental radar networks to study movement across scales.
Have you ever wondered what type of data you can get about a certain location on the globe? What if I told you that you can access an enormous amount of information while sitting right there at your laptop? In this talk, I'll show you how to use Google Earth Engine to enrich your dataset. Either your exploring, or planning your next ML project, Geospatial data can provide you with a lot of information you did not know you had access to. Let me show you how!
The gallery of your project might group the examples by module, by use case, or some other logic. But as examples grow in complexity, they may be relevant for several groups. In this talk we discuss some possible solutions and their drawbacks to motivate the introduction of a new feature to sphinx-gallery: a content-based recommendation system.
By using Dask to scale out RAPIDS workloads on Kubernetes you can accelerate your workloads across many GPUs on many machines. In this talk, we will discuss how to install and configure Dask on your Kubernetes cluster and use it to run accelerated GPU workloads on your cluster.
AI is poised to be "Our final invention," either the key to a never-ending utopia or a direct road to dystopia (or apocalypse). Even without the eschatological framing, it's still a revolutionary technology increasingly embedded in every aspect of our life, from smartphones to smart cities, from autonomous agents to autonomous weapons. In the face of acceleration, there can be no delay: if we want AI to shape a better tomorrow, we must discuss safety today.
Pyodide is a Python distribution for the browser and Node.js based on WebAssembly / Emscripten.
Pyodide supports most commonly used scientific Python packages, like numpy, scipy, scikit-learn, matplotlib and there is growing interest to use it for improving package documentation through interactivity.
In this talk we will describe the work we have done in the past nine months to improve the state of Pyodide in a scientific Python context, namely:
- running the scikit-learn and scipy test suites with Node.js to get a view of what currently works, what does not, and what can be hopefully be fixed one day
- packaging OpenBLAS in Pyodide and use it for Pyodide scipy package to improve its stability, maintainability and performance
- adding JupyterLite functionality to sphinx-gallery, which is used for example galleries of popular scientific Python package like scikit-learn, matplotlib, scikit-image, etc ...
- adding the sphinx-gallery Jupyterlite functionality for scikit-learn example gallery
We will also mention some of the Pyodide sharp bits and conclude with some of the ideas we have to use it even more widely.
Pickle files can be evil and simply loading them can run arbitrary code on your system. This talk presents why that is, how it can be exploited, and how skops
is tackling the issue for scikit-learn/statistical ML models. We go through some lower level pickle related machinery, and go in detail how the new format works.
Imagine a world where there are tools allowing any researcher to easily produce high quality scientific websites. Where it's trivial to include rich interactive figures that connect to Jupyter servers or run in-browser with WASM
& pyodide
, all from a local folder of markdown files and Jupyter notebooks.
We introduce MyST Markdown (https://mystmd.org/), a set of open-source, community-driven tools designed for open scientific communication.
It's a powerful authoring framework that supports blogs, online books, scientific papers, preprints, reports and journals articles. It includes thebe
a minimal connector library for Jupyter, and thebe-lite
that bundles a JupyterLite server with pyodide
into any web page for in-browser python
. It also provides publication-ready tex and pdf generation from the same content base, minimising the rework of publishing to the web and traditional services.
Python versioning is a critical aspect of maintaining a consistent ecosystem of packages, yet it can be challenging to get right. In this talk, we will explore the difficulties of Python versioning, including the need for upper bounds, and discuss mitigation strategies such as lockfiles in the Python packaging ecosystem (pip, poetry, and conda / mamba). We will also highlight a new community effort to analyze Python libraries dynamically and statically to detect the symbols (or libraries) they are using. By analyzing symbol usage, we can predict when package combinations will start breaking with each other, achieving a high rate of correct predictions. Our goal is to gather more community inputs to create a robust compatibility matrix. Additionally, we are doing similar work in C/C++ using libabigail to address ABI problems.
Rigid transformation in 3D are complicated due to the multitude of different conventions and because they often form complex graphs that are difficult to manage. In this talk I will give a brief introduction to the topic and present the library pytransform3d as a set of tools that can help you to tame the complexity. Throughout the talk I will use examples from robotics (imitation learning, collision detection, state estimation, kinematics) to motivate the discussed features, even though presented solutions are useful beyond robotics.
Could scikit-learn future be GPU-powered ? This talk will discuss the performance improvements that GPU computing could bring to existing scikit-learn algorithms, and will describe a plugin-based design that is being foresighted to open-up scikit-learn compatibility to faster compute backends, with special concern for user-friendliness, ease of installation, and interoperability.
In this talk, we will discuss incident management using Hawkes processes within an IT infrastructure. We show how a model previously applied for earthquake predictions can help answer the question ‘what caused what’ in a major European bank.
The graphic server protocol is a proposal to mutualize efforts across scientific visualization libraries, languages and platforms such as to provide a unified intermediate-level protocol to render graphical primitives independently of the specifics of the high-level visualization interfaces.
This talk discusses using the pandas API on Apache Spark to handle big data, and the introduction of Pandas Function APIs. Presented by an Apache Spark committer and a product manager, it offers technical and managerial insights.
Handling and analyzing massive data sets is highly important for the vast majority of research communities, but it is also challenging, especially for those communities without a background in high-performance computing (HPC). The Helmholtz Analytics Toolkit (Heat) library offers a solution to this problem by providing memory-distributed and hardware-accelerated array manipulation, data analytics, and machine learning algorithms in Python, targeting the usage by non-experts in HPC.
In this presentation, we will provide an overview of Heat's current features and capabilities and discuss its role in the ecosystem of distributed array computing and machine learning in Python.
The use of AI documentation such as repository cards (model and dataset cards), as a means of transparently discussing ethical and inclusive problems that could be found within the outputs and/or during the creation of AI artefacts, with the aim of inclusivity, fairness and accountability, has increasingly become part of the ML discourse. As limitations and risks centred documentation approaches have become more standard and anticipated with launches of new development e.g Chatgpt/GPT-4 system card and other LLM model cards.
This talk highlights the inclusive approaches that the broader open source community could explore when thinking about their aims when creating documentation.