“A Comparison of Mixing Algorithms for Fixed-Point Iterations in Self-Consistent Electronic Structure Calculations”
Robert Cimrman;
Poster
In ab-initio calculations of electronic structure and material properties within the density-functional theory (DFT) framework, a self-consistent stationary state of a many-electron system is sought by a fixed-point iteration of Kohn-Sham equations, the so called DFT loop. One of the key components needed for fast convergence is to apply a suitable mixing of new and previous states in the DFT loop.
“Advanced Machine learning with Scikit-learn”
Yam Peleg;
Tutorial
Supervised learning is a branch in computer science that studies the design of algorithms that can optimize based on labeled examples.
In this tutorial we will review the methods and techniques used to deploy Machine learning models on "real life" problems.
Problems that will be addressed consists of: How to avoid overfitting on small datasets, What are the b est practices to tackle the misclassifications in real life scenarios, How to boost performance with Meta learning.
“Apache Parquet as a columnar storage for large datasets”
Peter Hoffmann;
Talk
Apache Parquet has become the de facto columnar storage format for large
data processing. With the support in Pandas and Dask through Apache Arrow and
fastparquet, Python has gained an efficient binary DataFrame storage format.
This talks will outline the Apache Parquet data format and show how
it can be used in Python to work with data larger than memory
and larger than local disk space.
“Benchmarking and performance analysis for scientific Python”
Roman Yurchak;
Talk
In this talk we will review performance analysis tools available in the
scientific Python ecosystem, as well as useful metrics that can be used
to guide the code optimization process. The neurtu
package will be introduced, aiming to facilitate time and memory complexity estimations
together with parametric benchmarks.
“Big geospatial data visualization and analysis in Jupyter”
Davide De Marchi;
Talk
Present the interactive component of the JEODPP (JRC Earth Observation Data and Processing Platform), developed at the Joint Research Center of the European Commission (Ispra, VA, Italy), that allows to visualize and interactively process big geospatial datasets. Demonstrate usage of python code inside Jupyter notebooks to browse, display, analyse and combine huge vector and raster image collections coming from the Copernicus Programme and the Sentinel satellites.
“CatBoost - the new generation of Gradient Boosting”
Anna Veronika Dorogush;
Talk
CatBoost (http://catboost.yandex) is a new open-source gradient boosting library, that outperforms existing publicly available implementations of gradient boosting in terms of quality.
The talk will cover a broad description of gradient boosting and its areas of usage and the differences between CatBoost and other gradient boosting libraries. We will also briefly explain the details of the proprietary algorithm that leads to a boost in quality.
“CFFI, Ctypes, Cython, Cppyy: The good, the bad, the mighty and the unknown”
Matti Picus;
Tutorial
Create a fractal image using a c-based function to calculate the RGB color of the pixel, but call the function from Python. There are many ways to accomplish this task, we will explore a few of them, as well as demonstrate a pure-python version in PyPy.
“Databases for Data Scientists (Overview & SQL)”
Alexander CS Hendorf;
Tutorial
This tutorial will provide a crash course about the major differences of the various database systems as relational, NoSQL and
graph databases and the differences between them. We will cover relational database design, the SQL query language.
“Data collection, cleaning and mining, from remote locations to web based dashboards.”
Giuseppe Antonello;
Poster
Dataflow from remote location to web based dashboard is presented, with all the intermediate operations on datasets.
“Data Science Security — Protect Against Data Privacy Breaches”
Justin Mayer;
Talk
Data science often contains personal, private data that we have an obligation to protect from security breaches and other unintended redistribution. Multi-factor authentication, VPNs, full-disk encryption, and other measures can be utilized to ensure the personal data entrusted to us is treated with proper care and respect.
“Data visualization -- from default and suboptimal to efficient and awesome”
Boris Gorelik;
Tutorial
Data visualization is an indispensable tool for any data scientist. It serves as a means to convey a message or explain a concept. You would never settle for default settings of a machine learning algorithm. Instead, you would tweak them to obtain optimal results. Similarly, you should never stop with the default results you receive from a data visualization framework. Doing so leads to suboptimal results and makes you and your message less convincing.
After this tutorial, you will be able to name four most common mistakes in data visualization, and learn how to apply them in your graphs. We will use matplotlib for this tutorial.
“Data visualizations for the web with Altair and Vega(-Lite)”
Patrick Muehlbauer;
Talk
Altair is a declarative visualization library for Python built on top of Vega-Lite.
We will show how Altair and Vega enable data scientists and frontend developers to work efficiently together to build beatiful customer facing data science dashboards.
“Deep Diving into GANs: From Theory to Production”
Michele "Ubik" De Simoni, Paolo Galeone;
Tutorial
With our accrued experience with GANs, we would like to guide you through the required steps to go from theory to production with this revolutionary technology.
Starting from the very basic of what a GAN is, passing trough Tensorflow implementation, using the most cutting edge APIs available in the framework, and finally, production-ready serving at scale using Google Cloud ML Engine.
“Deep Learning for Human Pose Estimation”
Ale Solano;
Talk
The use of deep learning to detect human body keypoints (such as eyes, neck, shoulders or knees) has outperformed classic methods of human pose estimation. In this talk we'll cover the process from receiving an RGB image to make a real robot detect a person position and orientation to finally approach her. And everything with Python.
“Deep Learning in Python using Chainer”
Crissman Loomis;
Tutorial
Learn how to do Deep Learning using Chainer, and open source, Python AI Framework. Coded almost entirely in Python, chainer provides intuitive coding with superior scaling for faster results.
“Detecting anomalies using statistical distances”
Charles Masson;
Talk
Statistical distances are distances between distributions or data samples and are used in a variety of machine learning applications. In this talk, we will show how we use SciPy's statistical distance functions—some of which we recently contributed—to design powerful and production-ready anomaly detection algorithms. With visual illustrations, we will describe the inner workings and the properties of a few common statistical distances and explain what makes them convenient to use, yet powerful to solve various problems. We will also show real-life applications and concrete examples of the anomalous patterns that such algorithms are able to detect in performance-monitoring and business-metric time series.
“Doing bioinformatics with scikit-bio and BioPython”
Joris Vankerschaver;
Tutorial
Scikit-bio and BioPython are two packages for bioinformatics in Python. In this talk, we will explore some of their functionality, and use to answer some biological questions.
“Efficient Biomedical Named Entity Recognition in Python”
Tilia Ellendorff;
Talk
We present two tools implemented in python for processing scientific literature in the biomedical domain: the Bio Term Hub is an automated aggregator of terminologies from life science databases; OGER is a fast, efficient, and accurate entity recognition and linking system.
“Efficient GPU-based Sparse Recovery Methods using a Python Package for Matrix-free Operators”
Sebastian Semper;
Poster
We demonstrate the application of the Python package "fastmat", which allows to exploit structure in matrices, to implement certain linear transformations in a matrix-free fashion on GPUs. We show how this allows an efficient iterative scheme to detect material defects in a specimen using ultrasound measurements.
“Empowered Analytics: Blending MongoDB's Aggregation Framework with Numpy and Pandas”
Anna Herlihy, Nathan Leniz;
Tutorial
Learn how to use MongoDB's Aggregation framework to preprocess, transform, and compute values prior to bringing the data down from the cloud. Then use Pandas, Numpy, and Seaborn to further analyze the data, create beautiful visualizations, and improve your overall analysis and deliver stunning reports.
“Explaining model predictions using Shapley values”
Ankur Ankan;
Talk
We can answer a lot of interesting questions by understanding the effects of features on the output of our models. Shapley values allows us to compute the contribution of the features in our dataset towards the predictions.
“Extending Python 3.7s Data Classes”
Geir Arne Hjelle;
Talk
Data classes are introduced in Python 3.7 and offer a way of creating classes while writing minimal amounts of code. We will show how to use data classes, and how they work together with Python's data analysis stack.
“F2x - Automated FORTRAN wrapping without limits”
Michael Meinel;
Talk
F2x has replaced f2py in our internal projects. Using a full FORTRAN grammar and template-based code generations it can overcome the limitations of f2py and even more.
“Fission track counting in mineral samples using pytracks”
Alexandre de Siqueira;
Talk
Procedures for measuring and counting tracks in minerals are time-consuming and involve practical problems. Here we present pytracks, a package based on numpy, scipy, scikit-image and other packages, that is capable of counting these tracks automatically.
“From exploratory computing to performances, a tour of Python profiling and optimization”
Antonino Ingargiola;
Tutorial
Python is an excellent language for exploratory and interactive computing. In this Jupyter-based hands-on tutorial we will see how to apply different optimization tools and techniques to combine interactivity and performances.
“Getting Started with the Jupyter Notebook”
Mike Müller;
Tutorial
The Jupyter Notebook is used for essentially all other tutorials at EuroSciPy. This tutorial gives an overview over the basic functionality and shows how to use some of the many tools it provides to simplify your Python programming workflow.
“GMDH Neural Network for Short-term Electricity Load Forecasting”
Kostas Passadis;
Talk
Group Method of Data Handling is a family of algorithms used for modelling complex non-linear systems, pattern recognition and function approximation. Throughout the talk we will implement the GMDH multilayered algorithm which is a type of feedforward Artificial Neural Network from scratch. Along the way we will discuss some fundamental concepts of machine learning and at the end of the talk we will build a model for forecasting electricity load.
“Going full Python for Machine Learning in Biomedical engineering”
Jeremy Laforet;
Talk
Elements of reflexion gathered with the start of European project CHRONOS on the specific constrains on ML application in Biomedical engineering and our approach through Python.
“GranFilm : the modeling of the optical properties of thin granular films.”
Alexis CVETKOV-ILIEV;
Poster
GranFilm is a numerical tool which simulates the optical reponse of supported nanoparticles. It aims at understanding thin film growth during sputtering.
“How PyPy can help for high-performance computing”
Antonio Cuni;
Talk
PyPy is an alternative implementation of Python which is famous for its speed:
thanks to its JIT compiler and fast GC, it can run Python programs up to 100x
faster than CPython.
This talk will cover the following topics, with a particular focus on
scientific applications:
-
What is PyPy, and the current status of scientific libraries
-
Why and when it is fast
-
When it is slow, and how the PyPy team is handling the problem
-
Future roadmap
“How to not screw up with machine learning in production (and more about engineering in data science)”
Denys Kovalenko;
Talk
Some problems that data science teams face are of engineering nature. By applying best software engineering practices to machine learning infrastructure we can make data science teams more successful and productive.
“Imbalanced-learn: a scikit-learn-contrib to tackle learning from imbalanced data set”
Guillaume Lemaitre;
Talk
Overview of the imbalanced-learn package and what's new in the release 0.4
“Introduction to matplotlib”
Alexandre de Siqueira;
Tutorial
This is a matplotlib tutorial for beginners. We will use several functions to create different kinds of plots using this library.
“Introduction to Python”
Mojdeh Rastgoo;
Tutorial
This tutorial will be a plain introduction to the Python language.
“KPIs implementation and decision tree algorithms as support tools in wastewater treatment plants management.”
Giuseppe Antonello;
Talk
A set of algorithms - full developed in Python - for KPIs and decision trees implementation are presented, in use cases of wastewater treatment plants management.
“Listening to Quasars and Shooting Satellites With Lasers”
Geir Arne Hjelle;
Talk
Quasars, lasers and satellites are all used to keep track of the Earth as it tumbles through space. By combining observations of far away objects, some billions of light-years away, we are able to monitor the centimeter changes at the surface of our planet. Python plays an increasing part in this analysis.
“Machine Learning for microcontrollers with Python and C”
Jon Nordby;
Talk
How to deploy efficient machine learning models on tiny microcontrollers,
using standard Python and scikit-learn workflows.
“Machine Learning Methods for User Experience and Performance Monitoring”
Susanne Greiner;
Poster
Performance monitoring is playing an increasing role in the time of IoT and the cloud. Unfortunately data are commonly unlabeled or labels added by a human expert - that has been using data for troubleshooting - are only available for short periods.
User experience can close this gap by providing detailed information on how the current constellation of performance metric values is perceived by the user making a starting point for several investigations based on machine learning methods.
ScikitLearn is used to explore synthetic monitoring data together with performance metrics from heterogeneous sources.
“MNE-python, a toolkit for neurophysiological data”
Joan Massich;
Poster
mne-python is an opensource package for exploring, visualizing, and analyzing human neurophysiological data: MEG, EEG, sEEG, ECoG, and more.
“modAL: A module active learning framework for Python”
Tivadar Danka;
Talk
modAL is an active learning framework for Python, built on top of scikit-learn. In this talk, we are going to take a look at how active learning can help you bring out the best from your unlabelled data and how can you rapidly build active learning workflows with nearly complete freedom using modAL.
“Modelling Signal Aquisition Frontends for Compressed Sensing Applications using Open Source”
Christoph Wagner;
Poster
Compressed sensing is a nifty concept aiming to reduce the amount of data digitized during the acquisition of an analogue signal.
This poster describes an holistic approach to perform the simulative evaluation of mixed-signal systems using publicly available open source modules from the python ecosystem. For algorithmic evaluation numpy, scipy and fastmat will be used. The analogue frontend portion is split into multiple blocks and evaluated using BMSpy.
“ModelXplore, a python based model exploration”
Nicolas Cellier;
Talk
ModelXplore is an helper library that give some tool to facilitate the exploration of time-expensive models (or experimentation).
It give access to a variety of samplers, of regression function (called meta-model), easy access to sensitivity analysis, and make easy the computation of response surface.
“Navigating the Magical Data Visualisation Forest”
Margriet Groenendijk;
Talk
Data visualization is fun but can take up a lot of time, especially when you are exploring new data. The magic forest is much easier to navigate with PixieDust, a free open-source Python library that makes it quick and simple to explore data with any visualization library without writing code in a Jupyter notebook. Learn how PixieDust takes out some of the coding, how to contribute, and how to make and share visualizations in seconds.
“Numpy - where we are and where we want to be”
Matti Picus;
Talk
NumPy is one of the core Python libraries. It has been given a boost of full-time developer time to move the library forward, where will this lead?
“Parallel Data Analysis with Dask”
Ian Stokes Rees;
Tutorial
The libraries that power data analysis in Python are essentially limited to a single CPU core and to datasets that fit in RAM. Attendees will see how dask can parallelize their workflows, while still writing what looks like normal python, NumPy, or pandas code.
Dask is a parallel computing framework, with a focus on analytical computing. We'll start with dask.delayed, which helps parallelize your existing Python code. We’ll demonstrate dask.delayed on a small example, introducing the concepts at the heart of dask like the task graph and the schedulers that execute tasks. We’ll compare this approach to the simpler, but less flexible, parallelization methods available in the standard library like concurrent.futures.
Attendees will see the high-level collections dask provides for writing regular Python, NumPy, or Pandas code that is then executed in parallel on datasets that may be larger than memory. These high level collections provide a familiar API, but the execution model is very different. We'll discuss concepts like the GIL, serialization, and other headaches that come up with parallel programming. We’ll use dask’s various schedulers to illustrate the differences between multi-threaded, multi-processes, and distributed computing.
Dask includes a distributed scheduler for executing task graphs on a cluster of machines. We’ll provide each person access to their own cluster.
“Parselmouth: an efficient Python interface to the Praat phonetics software package”
Yannick Jadoul;
Talk
Parselmouth is a Python interface to Praat, a scientific application for computational phonetics used in a wide range of academic fields related to speech. Using the pybind11 library, we have created an efficient yet natural Python interface around the Praat C/C++ codebase, allowing the integration of Praat functionality with the Python scientific ecosystem.
“Phase-space analysis of chaotic deterministic dynamics with Python: the case of biological systems with many degrees of freedom”
Paola Lecca, Unnamed user;
Talk
We present a Python program that generates and analyses the phase portrait of systems of ordinary differential equations (ODEs) describing dynamics affected by deterministic chaos. Deterministic chaos manifests itself in an irregular behaviour of the dynamics arising from a strictly deterministic time evolution without any source of noise or external stochasticity. This irregularity is expressed in an extremely sensitive dependence on the initial conditions, which precludes any long-term prediction of the dynamics. Deterministic chaos can be found in systems with a very low degree of freedom, and is usually termed low dimensional deterministic chaos, as it is attributable to the dynamics of a small fraction of the total system components. Low-dimensional chaos is expectedly common in systems with few degrees of freedom, but rare in systems with many degrees of freedom such as medium- or large size dynamics biological networks.
The possibility of detecting low dimensional chaos in biological networks is of great interest to the community of biologists and biotechnologists because, since deterministic chaos is generated by an underlying deterministic process, it is potentially controllable.
Our Python program analyses the phase portrait and performs a stability analysis of the ODEs of the dynamics of biologically interacting agents and detects the major drivers of chaos. The identification of these drivers allows to improve predictability and controllability of the system dynamics.
We will show the performance of our code on an accurate ordinary differential equation model of the gene network regulating sporulation initiation in Bacillus subtilis. Results of the analysis will be presented at the conference.
“Privacy for Data Scientists”
Katharine Jarmul;
Tutorial
What does data privacy mean within the realm of data science? How can we continue to do high-performance data science in a post-GDPR world? In this 3-hour tutorial, we will cover privacy basics for data scientists: from an introduction to the theories and algorithms defining privacy to practical steps you can take to better preserve privacy in your data science.
“Pyccel, a Fortran static compiler for scientific High-Performance Computing”
Dr. Ing. Ratnani Ahmed;
Talk
Presenting Pyccel, a source-to-source Python-Fortran, and DSL enabling HPC capabilities.
“Pythonizing workflows with with modern and legacy chemistry softwares.”
Olav Vahtras;
Talk
Initiatives are described to combine legacy software and modern software initiatives with Python as a glue in computational chemistry.
“Python Tools for Climate Science”
Robert Gieseke;
Talk
Python plays an increasing role in Climate Science -- this talk focuses on Simple Climate Models and the temperature, greenhouse gas concentrations and emissions data needed to use these models.
“Remapping or Regridding between Spherical Grids for Earth Modeling”
Ki-Hwan Kim;
Talk
I will introduce a novel toolkit for remapping or regridding between general spherical grids. It is very useful for earth modeling and simulations.
“Reproducibility and exploratory computing with a Jupyter-based workflow”
Antonino Ingargiola;
Talk
Reproducibility and exploratory computing are often two irreconcilable driving needs in data analysis. Here, I will present a Jupyter a workflow to manage the project lifecycle, from the initial exploratory phase to a mature and stable codebase. I will show how by combining git, Google Drive, conda environments, python packaging and batch notebook execution, it is possible to manage the complexity of a quickly growing project, while keeping track of the analysis for reproducibility.
“Reproducible science with Binder, Docker and Jupyter applied to neuronal simulations using NeuronEAP”
Maria Teleńczuk;
Poster
NeuronEAP is a library for simulation of electric field generated during normal activity of the brain. To make this library accessible to anyone we employed current technology for building scientific environments online (Binder, Docker, Jupyter notebook)
“RepSeP - Reproducible Self-Publishing for Python-Based Research.”
Horea Christian;
Talk
RepSeP enables and exemplifies the compilation and distribution-friendly packaging of reproducible scientific publications which employ live elements (e.g. plots, tables, or inline statistics) generated from Python analysis scripts.
The repository ships PythonTeX boilerplate code (for high quality typesetting), and showcases a defined set of Python analysis scripts used throughout the main science communication formats --- article, poster, and presentation slide set --- with appropriate content-variant styling.
“RPackUtils: R package dependencies manager and Bioconductor/CRAN mirroring tool”
Sylvain Gubian;
Poster
RPackUtils is an R package dependencies manager developed in Python with reproducibility in mind. RPackUtils can manage several public and private repositories...
“Scalable Data Science with Python”
Alejandro Saucedo;
Talk
This talk will provide insights on some of the key learnings I've obtained throughout my career building & scaling machine learning pipelines. I will provide a deep dive on how to support and scale complex data pipelines as your data science team / projects grow using Airflow and Celery.
“Scientific computing for quantum technology”
Nathan Shammah;
Talk
In this talk we discuss the emerging open-source quantum-tech ecosystem and its challenges. We introduce QuTiP, the quantum toolbox in Python, and discuss how the wider scientific computing community can get involved in this ecosystem.
“Scikit-learn and tabular data: closing the gap”
Joris Van den Bossche;
Talk
This talk will give an overview of the challenges and current bottlenecks when working with tabular data and scikit-learn. Then it will show the ungoing developments in sckikit-learn to improve this situation and highlight some third-party libraries that try to ease those problems.
“Searching efficiently through (genomic) sequences with vantage point trees”
Joris Vankerschaver;
Talk
Vantage point trees provide a fast lookup mechanism to find the most similar match from a set of sequences for a given query sequence. Althrough virtually unknown in the literature, they can be efficiently implemented using only a little bit of Numpy and Cython.
“Simpler data science: dirty categories and scikit-learn updates”
Gaël Varoquaux;
Talk
This talk will touch upon two packages that hope to make machine learning
easier, dirty_cat [1], and the lesser-known scikit-learn.
Dirty-cat strives to make it easier to work with categorical data that
contain variations in the categories, such as typos, variants of company
names, or open-ended input. It uses simple, off-the-shelf, vectorization
of the categories in a way that is robust to morphological variants.
I will also give an update on scikit-learn: upcoming features, and the
striving health of a happy community.
[1] https://dirty-cat.github.io
“Simulation of Rarefied Gases”
Thomas Sasse;
Poster
Presentation of our Python3 implementation for the simulation of rarefied gases (Boltzmann Equation).
“Teaching programming with Jupyterhub and Nbgrader”
Gert-Ludwig Ingold;
Talk
The tools provided by project Jupyter offer exciting possibilities for teaching and at the same time invite to rethink some teaching concepts.
“Teaching with JupyterHub - lessons learned”
Martin Christen;
Talk
In this talk experiences using JupyterHub for teaching are shared - the multi-user Jupyter Notebook are shared.
“The Hitchhiker's Guide to Parallelism with Python”
Declan Valters;
Tutorial
This tutorial will introduce a range of Python modules suitable for implementing different styles of parallel programming with Python. Four topics are covered, including multiprocessing
, numba
, mpi4py
, and cython
with OpenMP. The tutorial is intended to give a taster session of each approach.
“Three most common mistakes in data visualization”
Boris Gorelik;
Talk
Communication is a crucial part of our jobs. Data visualization plays an important role in such a communication. In this lecture, you will learn about three biggest visualization anti-patterns that I have been able to identify during more than 15 years of my professional career. I will accompany each anti-pattern with a case study. After attending this lecture, you will be able to identify and fix common mistakes in your and your colleagues' graphs
“Understanding and diagnosing your machine-learning models”
Gaël Varoquaux;
Tutorial
Often achieving a good prediction is only half of the job. Questions immediately arise: How to improve this prediction? What drives the prediction? Can we operate changes to the system based on the predictions? All these questions require understanding how good is the model prediction, and how do the model predict.
This tutorial assumes basic knowledge of scikit-learn. It will focus on statistics, tests, and interpretation rather than improving the prediction.
“Useful Decorators for Data Science”
Uri Goren;
Talk
Aspect oriented programming is a very useful concept, and Python enables it via the decorator feature.
In this talk we would go through several useful decorators that should be in the toolbox of every data scientist.
We would see how automatic caching of results, logging of runtime errors, , interactive graphs for jupyter notebook, pyspark user-defined-functions and remote code execution.
“Using OpenCV + Games to help Parkinson's Patients”
Jayaditya Gupta;
Talk
Though Parkinson's disease is not curable but it can be delayed with the help of physical exercises. The project features a simple exercise game where user have to lift his/her hands in the air and the movements are tracked with the help of OpenCV.
“Using Python for Wind Resource Assessment”
Neil Davis;
Poster
Using Python as an interface layer to expose Fortran based model APIs and work with modern (XML) and legacy (ASCII) file formats.
“When less is more: dimensionality reduction in neuroscience”
Pietro Marchesi;
Talk
Python-powered dimensionality reduction can help us gain insight into the complex dynamics of neural ensembles.
“Workflow for Optimizing Data Visualization and Documentation”
Sayako Kodera, Dominique Albert-Weiß;
Poster
A presented workflow is used in our research group and enables to optimize visualization and documentation process in scientific writing.