JupyterLab is very widely used in the Python scientific community. Most, if not all, of the other tutorials will use Jupyter as a tool. Therefore, a solid understanding of the basics is very helpful for the rest of the conference as well as for your later daily work.
This tutorial provides an overview of important basic Jupyter features.
Every scientific conference has seen a massive uptick in applications that use some type of machine learning. Whether it’s a linear regression using scikit-learn, a transformer from Hugging Face, or a custom convolutional neural network in Jax, the breadth of applications is as vast as the quality of contributions.
This tutorial aims to provide easy ways to increase the quality of scientific contributions that use machine learning methods. The reproducible aspect will make it easy for fellow researchers to use and iterate on a publication, increasing citations of published work. The use of appropriate validation techniques and increase in code quality accelerates the review process during publication and avoids possible rejection due to deficiencies in the methodology. Making models, code and possibly data available increases the visibility of work and enables easier collaboration on future work.
This work to make machine learning applications reproducible has an outsized impact compared to the limited additional work that is required using existing Python libraries.
This tutorial will provide an introduction to Python intended for beginners.
It will notably introduce the following aspects:
- built-in types
- controls flow (i.e. conditions, loops, etc.)
- built-in functions
- basic Python class
This tutorial will introduce how to leverage scikit-learn's powerful
histogram-based gradient boosted regression trees with various loss functions
(Least squares, Poisson and the pinball loss for quantile estimation) on a time
series forecasting problem. We will see how to leverage pandas to build lag and
windowing features and scikit-learn time-series cross-validation tools and other
model evaluation tools.
This tutorial will guide towards good evaluation of machine-learning models, choosing metrics and procedures that match the intended usage, with code examples using the latest scikit-learn's features. We will discuss how good metrics should characterize all aspects of error, e.g. on the positive and negative class; the probability of a detection, or the probability of a true event given a detection; as they may need to catter for class imbalance. Metrics may also evaluate confidence scores, e.g. calibration. Model-evaluation procedures should gauge not only the expected generalization performance, but also its variations.
This tutorial will provide an introduction to the NumPy library intended for beginners.
NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.
The audio (& speech) domain is going through a massive shift in terms of end-user performances. It is at the same tipping point as NLP was in 2017 before the Transformers revolution took over. We’ve gone from needing a copious amount of data to create Spoken Language Understanding systems to just needing a 10-minute snippet.
This tutorial will help you create strong code-first & scientific foundations in dealing with Audio data and build real-world applications like Automatic Speech Recognition (ASR) Audio Classification, and Speaker Verification using backbone models like Wav2Vec2.0, HuBERT, etc.
This tutorial is an introduction to pandas intended for beginners.
pandas is one of Python's core packages for data science. pandas organizes data into DataFrames and provides powerful methods for manipulating them. The library is built on top of NumPy. It'll be helpful for the tutorial if you have some experience with NumPy arrays, for example, by following the Introduction to NumPy tutorial.
In this tutorial we will go through the main features of the PyTorch
framework for Deep Learning.
We will start by learning how to build a neural network from the ground up, deep diving into torch.tensor
, Dataset
and optimisers
.
We will analyse data cases from different domains (e.g. numerical, images), introducing different neural network layers and architecture. Last but not least, a few tips from a pure Data science-y perspective will be shared, to appreciate the wonderful integration PyTorch has with the Python Data model!
This tutorial will provide an introduction SciPy intended for beginners.
SciPy is a collection of mathematical algorithms and convenience functions built on the NumPy extension of Python. It adds significant power to the interactive Python session by providing the user with high-level commands and classes for manipulating and visualizing data.
This tutorial is an introduction to geospatial data analysis, with a focus on tabular vector data using GeoPandas. It will show how GeoPandas and related libraries can improve your workflow (importing GIS data, visualizing, joining and preparing for analysis, exploring spatial relationships, …).
This tutorial will provide a beginner introduction to scikit-learn. Scikit-learn is a Python package for machine learning.
This tutorial will be subdivided into three parts. First, we will present how to design a predictive modeling pipeline that deals with heterogeneous types of data. Then, we will go more into detail in the evaluation of models and the type of trade-off to consider. Finally, we will show how to tune the hyperparameters of the pipeline.
Image data are used in many scientific fields such as astronomy, life sciences or material sciences. This tutorial will walk you through image processing with the scikit-image library, which is the numpy-native image processing library of the scientific python ecosystem.
The first hour of the tutorial will be accessible to beginners in image processing (some experience with numpy array is a pre-requisite), and will focus on some basic concepts of digital image manipulation and processing (filters, segmentation, measures). In the last half hour, we will focus on more advanced aspects and in particular Emma will speak about performance and acceleration of image processing.
This tutorial will provide a beginner introduction to scikit-learn. Scikit-learn is a Python package for machine learning.
This tutorial will be subdivided into three parts. First, we will present how to design a predictive modeling pipeline that deals with heterogeneous types of data. Then, we will go more into detail in the evaluation of models and the type of trade-off to consider. Finally, we will show how to tune the hyperparameters of the pipeline.
This tutorial explains the fundamental ideas and concepts of matplotlib. It's suited for complete beginners to get started as well as existing users who want to improve their plotting abilities and learn about best practices.
This workshop is for data scientists and other programmers who want to add another tool in their data science toolkit. Modelling, analysing and visualising data as networks! Network Science deals with analysing network data, and the data can come from different fields like politics, finance, computer science, law and even Game of Thrones!
Modern accelerators (graphics processing units and tensor processing units) allow for high performance computing at massive scale. JAX traces computation in Python programs through the familiar numpy API, and uses XLA to compile programs that run efficiently on these accelerators. A set of composable function transformations allows for expressing versatile scientific computing with an elegant syntax.
Flax provides abstractions on top of JAX that make it easy to handle weights and other states that is required for solving problems using neural networks.
This talk first presents the basic JAX API that allows for computing gradients, compiling functions, or vectorizing computation. It then proceeds to cover other parts of the JAX ecosystem commonly used for neural network programming, such as basic building blocks and optimizers.
Automatic image processing is a common task in many scientific and technological fields such as life sciences (with medical imaging), satellite imaging, etc. While machine learning is often used for efficient processing of such data sets, building a high-quality training set is an important task. Specialized software (such as rootpainter, ilastik) exist in different communities to build such training sets thanks to user annotations drawn on images.
In this talk, I will show how to use the open-source libraries plotly and dash to build custom interactive applications for interactive image annotation, and how to combine these tools with libraries such as scikit-image or machine learning/deep learning libraries for building a whole image processing pipeline.
Python is the most popular programming language in the data space and is one of the major driver of many advancements in machine learning. However, it's much less know that the Python library Pyomo is a great tool for solving mathematical optimization problems common in operations research.
In this talk I'm demonstrating how Pyomo can be used to find optimal decisions when data is uncertain and how to combine data driven forecasts with optimal decision making.
This session focuses on issues related to education in the ecosystem, from three different aspects, and during the session, we focus on recent advances and existing and upcoming challenges.
- Materials: how are projects dealing with documentation and education materials
- Methods: What should we do to make our materials more accessible to underrepresented and/or historically marginalised groups?
- Tools: What are the existing tools in the ecosystem helping us achieve the above goals, and what do we need to develop?
We will give an overview of these different aspects.
In my current work as a contributor experience lead, I am supporting and growing Matplotlib’s and Pandas’ communities by organizing events, meetings, and proactive engagement with a focus on equity and inclusion of historically marginalized groups. In my talk I’ll give an introduction to this new role, the grant that supports it, and some of the work done so far…
I will share takeaways for maintainers, and contributors; from simple changes that can be implemented relatively easily, to bigger topics, which one might want to learn more about, and slowly yet proactively, facilitate changes to tweak the contributor experience for a project.
The Pythran compiler is used to speed-up generic Python scientific kernels across the world. Through ten code samples taken from scipy, scikit-image codebase and stack overflow snippets, this talks is going to demonstrate the major features of the compiler, as well as some technical nits!
The conda-forge project is one of the fastest growing Open Source communities out there – and most data scientists have probably heard of it. In this talk we explain the inner workings of conda-forge, its relationship to conda and PyPI, and we will explain how everyone can package software with conda-forge.
Today state of the art scientific research strongly depends on open source libraries. The demographic of the contributors to these libraries is predominantly white and male [1][2][3][4]. In recent years there have been a number of various recommendations and initiatives to increase the participation in open source projects of groups who are underrepresented in this domain [1][3][5][6]. While these efforts are valuable and much needed, contributor diversity remains a challenge in open source communities [2][3][7]. This talk highlights the underlying problems and explores how we can overcome them.
This is part of the maintainers track.
In this session, we want to share some updates on the DataFrame ecosystem: the DataFrame interchange protocol (https://data-apis.org/dataframe-protocol/latest/purpose_and_scope.html) and Arrow C Data interface (https://arrow.apache.org/docs/format/CDataInterface.html), and the integration of those interoperability protocols with different libraries. Further, we want to have an open conversation about challenges and requirements related to DataFrame interoperability and supporting multiple DataFrame libraries in projects.
Mamba is a fast, cross-platform and language independent package manager that is fully compatible with conda packages.
It has enabled the conda-forge project to scale way beyond what was previously possible.
In this talk we present further innovations in the mamba ecosystem, including boa, a new build tool based on mamba and quetz, an open-source and extensible package server for conda packages.
SymPy is an open source computer algebra system (CAS) written in Python.
The recent addition of the array expression module provides an alternative to the matrix expression module, with generalized support to higher dimensions (matrices are constrained to 2 dimensions).
Given the importance of multidimensional arrays in machine learning and mathematical optimization problems, this talk will illustrate examples of tensorial expressions in mathematics and how they can be manipulated using either module or in the index-explicit way.
Conversion tools have been provided to SymPy to allow users to switch an expression between the array form and either the matrix or index-explicit form. In particular, the conversion from array to matrix form attempts to represent contractions, diagonalizations and axis-permutations with operations commonly used in matrix algebra, such as matrix multiplication, transposition, trace, Hadamard and Kronecker products.
A gradient algorithm for array expressions has been implemented, returning a closed-form array expression equivalent to the derivative of arrays by arrays. The derivative algorithm for matrix expressions now uses this algorithm, attempting to convert the array back to matrix form if trivial dimensions can be dropped.
We will explain a mechanism for generating neural network glyphs, like the glyphs we use in human languages. Glyphs are purposeful marks, images with 2D structures used to communicate information. We will use neural networks to generate those structured images, by optimizing for robustness.
Colab Notebook | Slides | Blog Post | Github Repo
How a discrete event simulation can help mining companies reduce their dependence on diesel as a source of fuel for their large haulage trucks. Using open source software, mining environments are modeled, and helps decision making for building an all electric mine, where diesel powered vehicles are made obsolete.
Memory-mapped files are an underused tool in machine learning projects, which offer very fast I/O operations, making them suitable for storing datasets during training that don't fit into memory.
In this talk, we will discuss the benefits of using memory maps, their downsides, and how to address them.
Deep learning can assist radiology doctors in interpreting and analyzing radiology images. We will present use cases which are used today in clinical practice. These range from organ segmentation to image classification.
This talk will take state of the art python models and show how, through advanced inference techniques, we can drastically increase the performance of the models at runtime. You’ll learn about the open source MLServer project and see live how easily it helps serve python-based machine learning models.
This session is part of the mainters track.
Recently it became possible to run Python and the scientific Python packages in the browser thanks to WebAssembly and Emscripten. This is done in particular in the Pyodide and emscripten-forge projects. It allows for a scientific Python application, or a compute environment such as JupyterLite, to be seamlessly accessible to a large number of users with very little effort or infrastructure requirements.
At the same time, the scientific Python ecosystem did not evolve with the web in mind. We will discuss some of the challenges package maintainers may face when trying to run their package in the browser, and what could be done to overcome these.
Computer chips are created using photolithography. Today's lithography machines are highly complex machines containing ultra-high precision optics. How do you create and in particular measure these optics? That's easy, you build the world's best interferometer. But what if that's not enough?
Identifying the right tools to enable for high performance machine learning may be overwhelming as the ecosystem continues to grow at break-neck speed. This becomes particularly emphasised when dealing with the ever growingly popular large language and image generation models such as GPT2, OTP and DALL-E, between others. In this session we will dive into a practical showcase where we will be productionising the large image generation model DALL-E, and showcase some optimizations that can be introduced as well as considerations as the use-cases scale. By the end of this session practitioners will be able to run their own DALL-E powered applications as well as integrate these with functionalities from other large language models like GPT2, etc. We will be leveraging key tools in the Python ecosystem to achieve this, including Pytorch, HuggingFace, FastAPI and MLServer.
Privacy is becoming an increasingly pressing topic in data collection and data science. Thankfully, Privacy Enhancing Technologies (or PETs) are maturing alongside the growing demand and concern. In this keynote, we’ll explore what possibilities emerge when using Privacy Enhancing Technology like differential privacy, encrypted computation and federated learning and investigate how these technologies could change the face of data science today.
This talk explains why Python is a good choice for research and development. It spans the arch from a conceptual, almost philosophical, understanding of the software needs of reseach and development up to concrete organiziational strategies.
What would the world look like if Russia had won the cold war? If the Boston Tea Party never happened? And where would we all be if Guido van Rossum had decided to pursue a career in theatre? Unfortunately we don't have the technology to slide into parallel worlds and explore alternative histories. However it turns out we do have the tools to simulate parallel realities and give decent answers to intriguing 'what if' questions. This talk will provide a gentle introduction to these tools, professionally known as Causal Inference.
The Scientific Python project aims to better coordinate the ecosystem and grow the community. This session focuses on our efforts to better coordinate project development, and to improve shared infrastructure. In this session together we will discuss project goals and recent technical work.
The Scientific Python project’s vision is to help pave the way towards a unified, expanded scientific Python community. It focuses its efforts along two primary axes: (i) to create a joint community around all scientific projects and (ii) to support maintainers by building cross-cutting technical infrastructure and tools. In this session we mostly focus on the second aspect.
The project has already launched a process whereby projects can, voluntarily, adopt reference guidelines; these are known as SPECs or Scientific Python Ecosystem Coordination documents. SPECs are similar to projects specific guidelines like PEPs, NEPs, SLEPs, and SKIPs, to name a few. The distinction being that SPECs have a broader scope, targeted at all (or most) projects from the scientific Python ecosystem.
The project also provides and maintains tools to help maintainers. This includes a theme for the project websites (used on, e.g., numpy.org and scipy.org), a self-hosted privacy-friendly web analytics platform, a community discussions forum, a technical blog, and project development statistics.
We present these tools, discuss various upcoming SPECs, and highlight the project’s future potential.
This talk will cover how to build predictive models that handle well missing values, using scikit-learn. It will give on the one side the statistical considerations, both the classic statistical missing-values theory and the recent development in machine learning, and on the other side how to efficiently code solutions.
The Mission Support System Software (MSS) is a client/server application developed in the community to collaboratively create flight plans based on model data. Through conda-forge, the components of MSS can be used on different platforms.
Extreme weather events are a well known source of human suffering, loss of life, and financial hardship. Amongst these, tropical cyclones are notoriously impactful, leading to significant interest in predicting the genesis, tracks, and intensity of these storms - a task which continues to present significant challenges. In particular, tropical cyclogenesis (TCG) can be described as “a needle in a haystack” problem, and steps must be taken to make predictions tractable. Previously, the filtering of non-genesis points by thresholding predictive variables has been described, with thresholds being selected to reduce the number of discarded TCG cases. In the art, this thresholding has often been carried out empirically, that while effective relies on domain knowledge. This talk instead proposes the development of a systematic, machine-learning-based approach implemented in Python. The method is designed to be interpretable to the point of becoming transparent machine learning. Threshold values that minimize the false-alarm rate and maintain a high recall are found, and then combined in a forward selection algorithm. As other extreme events in the geosciences are considered needle in the haystack problems, the described approach can be of use in reducing the variable space in which to study and predict the events. Finally, the transparent nature of the proposed approach can provide simple insight into the conditions in which these events occur.
Mathematical optimization is the selection of the best alternative with respect to some criterion, among a set of candidate options.
There are multiple applications of mathematical optimization. For example, in investment portfolio optimization, we search for the best way to invest capital given different alternatives. In this case, an optimization problem will allow us to choose a portfolio that minimizes risk (or maximizes profit), among all possible allocations that meet the defined requirements.
In most cases, mathematical optimization is used as a tool to facilitate decision-making. Sometimes these decisions can be made automatically in real-time.
This talk will explore how to formulate and solve mathematical optimization problems with Python, using different optimization libraries.
This is part of the maintainers track.
Most of us have been hearing about Diversity Equity and Inclusion (DEI) for some years now, and even had access to many resources by now. Our projects have codes of conduct, and some have been doing sprints and mentorships. But how much has fundamentally changed?
Let’s meet for an honest conversation about the challenges of DEI actions, and culture change. How do we achieve long-term impact? What are low-hanging fruit? We can share hard-to-ask questions, effective tools, experiences that shaped our approach, and see if we can all nudge each other forward a little.
Inclusion happens at the community level, also when we want to address DEI itself. So, we will need to create a safe space for hard questions and leave judgment at the door.
Thanks to our grant to advance an inclusive culture in the scientific Python ecosystem, we have created the contributor experience lead role. We have been working with NumPy, SciPy, Matplotlib, and pandas to learn how to integrate this new role to a project, and how to introduce contributor hospitality techniques. We are working on creating widely available resources, and we would benefit from hearing from maintainers from the wider community.
Every scientific conference has seen a massive uptick in applications that use some type of machine learning. Whether it’s a linear regression using scikit-learn, a transformer from Hugging Face, or a custom convolutional neural network in Jax, the breadth of applications is as vast as the quality of contributions.
This tutorial aims to provide easy ways to increase the quality of scientific contributions that use machine learning methods. The reproducible aspect will make it easy for fellow researchers to use and iterate on a publication, increasing citations of published work. The use of appropriate validation techniques and increase in code quality accelerates the review process during publication and avoids possible rejection due to deficiencies in the methodology. Making models, code and possibly data available increases the visibility of work and enables easier collaboration on future work.
This work to make machine learning applications reproducible has an outsized impact compared to the limited additional work that is required using existing Python libraries.
In the present time, we are facing a continuous growing of the energy price. It is then important to optimize the use of heat pumps, both in domestic and industrial environments. Using an opportunely labeled dataset of accelerometer, speed or relative position over time coming from a cheap sensor it is possible to estimate the I/O state of any heating or cooling engine. This new real-time measure allows then to compute the energy consumption and to study the most cheap usage scheme.
In this presentation we will show a real-case implementation of some fast binary classifiers, from basic statistics to machine learning, assessing the performance of each method in terms of computational time, precision and accuracy levels.
In this talk, I'll give an overview of software quality and why it's important - especially for scientists. Provide best practices and libraries to dive deeper into, hypes to ignore, and simple guidelines to follow to write code that your peers will love.
After the talk, the audience will have a guide on how to develop better code and be aware of potential blind spots.
In this talk, we will look at the growing Python in the browser ecosystem, with a focus on the Pyodide project. We will discuss the remaining challenges as well as new possibilities it offers for scientific computing, education, and research.
JupyterLite is a Jupyter distribution that runs entirely in the web browser, backed by in-browser language kernels including WebAssembly powered Jupyter Xeus kernels and Pyodide.
JupyterLite enables data science and interactive computing with the PyData scientific stack, directly in the browser, without installing anything or running a server.
JupyterLite leverages the Emscripten and Conda Forge infrastructure, making it possible to easily install custom packages with binary extensions in the browser, such as numpy, scipy and scikit-learn.
Fairness, accountability, and transparency in machine learning have become a major part of the ML discourse. Since these issues have attracted attention from the public, and certain legislation are being put in place regulating the usage of machine learning in certain domains, the industry has been catching up with the topic and a few groups have been developing toolboxes to allow practitioners incorporate fairness constraints into their pipelines and make their models more transparent and accountable. Some examples are fairlearn, AIF360, LiFT, fairness-indicators (TF), ...
This talk explores some of the tools existing in this domain and discusses work being done in scikit-learn to make it easier for practitioners to adopt these tools.
Since the announcement of PyScript, it has gained lots of attention and imagination about how we can run applications of Python in the browser. Out of everything that I have come across, most of the use cases are data visualisation. Let's see how we can up our data viz game with PyScript.
scikit-learn is an open-source scientific library for machine learning in Python. In this talk, we will present the recent work carried over by the scikit-learn core-developers team to improve its native performance.
We all know and love our carefully designed CI pipelines, which tests our code and makes sure by adding some code or fixing a bug we aren’t introducing a regression in the codebase. But we often don’t give the same treatment to benchmarking as we give to correctness. The benchmarking tests are usually one off scripts written to test a specific change. In this talk, we will discuss various strategies to test our code for performance regressions using ASV (airspeed velocity) for python projects.
Panel is one of the leading choices for building dashboards in Python. In this talk, we discuss the practical aspects of complex data-driven dashboards. There are tutorials and guides available which help teach new users the basics, but this talk focuses on the challenges of building more complex, industry-ready, deployed dashboards. There are a variety of niche issues which arise when you push the limits of complexity, and we will share the solutions we have developed. We will demonstrate these solutions as we walk through the entire lifecycle from data ingestion, though exploratory analysis to deployment as a finished website.