Welcome to our schedule sneak peek!

We prepared a list of exciting talks, so you can get a feel for our conference. Please keep in mind that this is not our full schedule. We will follow up with the full schedule in time, stay tuned!

»Introduction to matplotlib«
Alexandre de Siqueira; Tutorial (90 minutes)

This is a matplotlib tutorial for beginners. We will use several functions to create different kinds of plots using this library.


»Machine Learning for microcontrollers with Python and C«
Jon Nordby; Talk (15 minutes)

How to deploy efficient machine learning models on tiny microcontrollers, using standard Python and scikit-learn workflows.


»Getting Started with the Jupyter Notebook«
Mike Müller; Tutorial (90 minutes)

The Jupyter Notebook is used for essentially all other tutorials at EuroSciPy. This tutorial gives an overview over the basic functionality and shows how to use some of the many tools it provides to simplify your Python programming workflow.


»F2x - Automated FORTRAN wrapping without limits«
Michael Meinel; Talk (15 minutes)

F2x has replaced f2py in our internal projects. Using a full FORTRAN grammar and template-based code generations it can overcome the limitations of f2py and even more.


»The Hitchhiker's Guide to Parallelism with Python«
Declan Valters; Tutorial (90 minutes)

This tutorial will introduce a range of Python modules suitable for implementing different styles of parallel programming with Python. Four topics are covered, including `multiprocessing`, `numba`, `mpi4py`, and `cython` with OpenMP. The tutorial is intended to give a taster session of each approach.


»RPackUtils: R package dependencies manager and Bioconductor/CRAN mirroring tool«
Sylvain Gubian; Poster (90 minutes)

RPackUtils is an R package dependencies manager developed in Python with reproducibility in mind. RPackUtils can manage several public and private repositories...


»Scalable Data Science with Python«
Alejandro Saucedo; Talk (30 minutes)

This talk will provide insights on some of the key learnings I've obtained throughout my career building & scaling machine learning pipelines. I will provide a deep dive on how to support and scale complex data pipelines as your data science team / projects grow using Airflow and Celery.


»CatBoost - the new generation of Gradient Boosting«
Anna Veronika Dorogush; Talk (30 minutes)

CatBoost (http://catboost.yandex) is a new open-source gradient boosting library, that outperforms existing publicly available implementations of gradient boosting in terms of quality. The talk will cover a broad description of gradient boosting and its areas of usage and the differences between CatBoost and other gradient boosting libraries. We will also briefly explain the details of the proprietary algorithm that leads to a boost in quality.


»Efficient GPU-based Sparse Recovery Methods using a Python Package for Matrix-free Operators«
Sebastian Semper; Poster (90 minutes)

We demonstrate the application of the Python package "fastmat", which allows to exploit structure in matrices, to implement certain linear transformations in a matrix-free fashion on GPUs. We show how this allows an efficient iterative scheme to detect material defects in a specimen using ultrasound measurements.


»Detecting anomalies using statistical distances«
Charles Masson; Talk (30 minutes)

Statistical distances are distances between distributions or data samples and are used in a variety of machine learning applications. In this talk, we will show how we use SciPy's statistical distance functions—some of which we recently contributed—to design powerful and production-ready anomaly detection algorithms. With visual illustrations, we will describe the inner workings and the properties of a few common statistical distances and explain what makes them convenient to use, yet powerful to solve various problems. We will also show real-life applications and concrete examples of the anomalous patterns that such algorithms are able to detect in performance-monitoring and business-metric time series.


»Simulation of Rarefied Gases«
Thomas Sasse; Poster (90 minutes)

Presentation of our Python3 implementation for the simulation of rarefied gases (Boltzmann Equation).


»Parselmouth: an efficient Python interface to the Praat phonetics software package«
Yannick Jadoul; Talk (15 minutes)

Parselmouth is a Python interface to Praat, a scientific application for computational phonetics used in a wide range of academic fields related to speech. Using the pybind11 library, we have created an efficient yet natural Python interface around the Praat C/C++ codebase, allowing the integration of Praat functionality with the Python scientific ecosystem.


»Remapping or Regridding between Spherical Grids for Earth Modeling«
Ki-Hwan Kim; Talk (15 minutes)

I will introduce a novel toolkit for remapping or regridding between general spherical grids. It is very useful for earth modeling and simulations.


»From exploratory computing to performances, a tour of Python profiling and optimization«
Antonino Ingargiola; Tutorial (90 minutes)

Python is an excellent language for exploratory and interactive computing. In this Jupyter-based hands-on tutorial we will see how to apply different optimization tools and techniques to combine interactivity and performances.


»Advanced Machine learning with Scikit-learn«
Yam Peleg; Tutorial (90 minutes)

Supervised learning is a branch in computer science that studies the design of algorithms that can optimize based on labeled examples. In this tutorial we will review the methods and techniques used to deploy Machine learning models on "real life" problems. Problems that will be addressed consists of: How to avoid overfitting on small datasets, What are the b est practices to tackle the misclassifications in real life scenarios, How to boost performance with Meta learning.


»Fission track counting in mineral samples using pytracks«
Alexandre de Siqueira; Talk (15 minutes)

Procedures for measuring and counting tracks in minerals are time-consuming and involve practical problems. Here we present pytracks, a package based on numpy, scipy, scikit-image and other packages, that is capable of counting these tracks automatically.


»Imbalanced-learn: a scikit-learn-contrib to tackle learning from imbalanced data set«
Guillaume Lemaitre; Talk (15 minutes)

Overview of the imbalanced-learn package and what's new in the release 0.4


»A Comparison of Mixing Algorithms for Fixed-Point Iterations in Self-Consistent Electronic Structure Calculations«
Robert Cimrman; Poster (90 minutes)

In ab-initio calculations of electronic structure and material properties within the density-functional theory (DFT) framework, a self-consistent stationary state of a many-electron system is sought by a fixed-point iteration of Kohn-Sham equations, the so called DFT loop. One of the key components needed for fast convergence is to apply a suitable mixing of new and previous states in the DFT loop.


»KPIs implementation and decision tree algorithms as support tools in wastewater treatment plants management.«
Giuseppe Antonello; Talk (15 minutes)

A set of algorithms - full developed in Python - for KPIs and decision trees implementation are presented, in use cases of wastewater treatment plants management.


»Data collection, cleaning and mining, from remote locations to web based dashboards.«
Giuseppe Antonello; Poster (90 minutes)

Dataflow from remote location to web based dashboard is presented, with all the intermediate operations on datasets.


»GMDH Neural Network for Short-term Electricity Load Forecasting«
Kostas Passadis; Talk (15 minutes)

Group Method of Data Handling is a family of algorithms used for modelling complex non-linear systems, pattern recognition and function approximation. Throughout the talk we will implement the GMDH multilayered algorithm which is a type of feedforward Artificial Neural Network from scratch. Along the way we will discuss some fundamental concepts of machine learning and at the end of the talk we will build a model for forecasting electricity load.


»Scientific computing for quantum technology«
Nathan Shammah and Shahnawaz Ahmed; Talk (30 minutes)

In this talk we discuss the emerging open-source quantum-tech ecosystem and its challenges. We introduce QuTiP, the quantum toolbox in Python, and discuss how the wider scientific computing community can get involved in this ecosystem.


»CFFI, Ctypes, Cython, Cppyy: The good, the bad, the mighty and the unknown«
Matti Picus; Tutorial (90 minutes)

Create a fractal image using a c-based function to calculate the RGB color of the pixel, but call the function from Python. There are many ways to accomplish this task, we will explore a few of them, as well as demonstrate a pure-python version in PyPy.


»Numpy - where we are and where we want to be«
Matti Picus; Talk (15 minutes)

NumPy is one of the core Python libraries. It has been given a boost of full-time developer time to move the library forward, where will this lead?


»Deep Learning for Human Pose Estimation«
Ale Solano; Talk (30 minutes)

The use of deep learning to detect human body keypoints (such as eyes, neck, shoulders or knees) has outperformed classic methods of human pose estimation. In this talk we'll cover the process from receiving an RGB image to make a real robot detect a person position and orientation to finally approach her. And everything with Python.


»Deep Learning in Python using Chainer«
Crissman Loomis; Tutorial (90 minutes)

Learn how to do Deep Learning using Chainer, and open source, Python AI Framework. Coded almost entirely in Python, chainer provides intuitive coding with superior scaling for faster results.


»Searching efficiently through (genomic) sequences with vantage point trees«
Joris Vankerschaver; Talk (15 minutes)

Vantage point trees provide a fast lookup mechanism to find the most similar match from a set of sequences for a given query sequence. Althrough virtually unknown in the literature, they can be efficiently implemented using only a little bit of Numpy and Cython.


»Teaching programming with Jupyterhub and Nbgrader«
Gert-Ludwig Ingold; Talk (15 minutes)

The tools provided by project Jupyter offer exciting possibilities for teaching and at the same time invite to rethink some teaching concepts.


»MNE-python, a toolkit for neurophysiological data«
Joan Massich; Poster (90 minutes)

mne-python is an opensource package for exploring, visualizing, and analyzing human neurophysiological data: MEG, EEG, sEEG, ECoG, and more.


»Using Python for Wind Resource Assessment«
Neil Davis; Poster (90 minutes)

Using Python as an interface layer to expose Fortran based model APIs and work with modern (XML) and legacy (ASCII) file formats.


»Doing bioinformatics with scikit-bio and BioPython«
Joris Vankerschaver; Tutorial (90 minutes)

Scikit-bio and BioPython are two packages for bioinformatics in Python. In this talk, we will explore some of their functionality, and use to answer some biological questions.


»Teaching with JupyterHub - lessons learned«
Martin Christen; Talk (15 minutes)

In this talk experiences using JupyterHub for teaching are shared - the multi-user Jupyter Notebook are shared.


»When less is more: dimensionality reduction in neuroscience«
Pietro Marchesi; Talk (30 minutes)

Python-powered dimensionality reduction can help us gain insight into the complex dynamics of neural ensembles.


»Listening to Quasars and Shooting Satellites With Lasers«
Geir Arne Hjelle; Talk (15 minutes)

Quasars, lasers and satellites are all used to keep track of the Earth as it tumbles through space. By combining observations of far away objects, some billions of light-years away, we are able to monitor the centimeter changes at the surface of our planet. Python plays an increasing part in this analysis.


»Extending Python 3.7s Data Classes«
Geir Arne Hjelle; Talk (15 minutes)

Data classes are introduced in Python 3.7 and offer a way of creating classes while writing minimal amounts of code. We will show how to use data classes, and how they work together with Python's data analysis stack.


»Python Tools for Climate Science«
Robert Gieseke; Talk (15 minutes)

Python plays an increasing role in Climate Science -- this talk focuses on Simple Climate Models and the temperature, greenhouse gas concentrations and emissions data needed to use these models.


»Going full Python for Machine Learning in Biomedical engineering«
Jeremy Laforet; Talk (15 minutes)

Elements of reflexion gathered with the start of European project CHRONOS on the specific constrains on ML application in Biomedical engineering and our approach through Python.


»Data Science Security — Protect Against Data Privacy Breaches«
Justin Mayer; Talk (15 minutes)

Data science often contains personal, private data that we have an obligation to protect from security breaches and other unintended redistribution. Multi-factor authentication, VPNs, full-disk encryption, and other measures can be utilized to ensure the personal data entrusted to us is treated with proper care and respect.


»How to not screw up with machine learning in production (and more about engineering in data science)«
Denys Kovalenko; Talk (15 minutes)

Some problems that data science teams face are of engineering nature. By applying best software engineering practices to machine learning infrastructure we can make data science teams more successful and productive.


»Big geospatial data visualization and analysis in Jupyter«
Davide De Marchi; Talk (15 minutes)

Present the interactive component of the JEODPP (JRC Earth Observation Data and Processing Platform), developed at the Joint Research Center of the European Commission (Ispra, VA, Italy), that allows to visualize and interactively process big geospatial datasets. Demonstrate usage of python code inside Jupyter notebooks to browse, display, analyse and combine huge vector and raster image collections coming from the Copernicus Programme and the Sentinel satellites.


»Useful Decorators for Data Science«
Uri Goren; Talk (30 minutes)

Aspect oriented programming is a very useful concept, and Python enables it via the decorator feature. In this talk we would go through several useful decorators that should be in the toolbox of every data scientist. We would see how automatic caching of results, logging of runtime errors, , interactive graphs for jupyter notebook, pyspark user-defined-functions and remote code execution.


»modAL: A module active learning framework for Python«
Tivadar Danka; Talk (15 minutes)

[modAL](https://cosmic-cortex.github.io/modAL/) is an active learning framework for Python, built on top of scikit-learn. In this talk, we are going to take a look at how active learning can help you bring out the best from your unlabelled data and how can you rapidly build active learning workflows with nearly complete freedom using [modAL](https://cosmic-cortex.github.io/modAL/).


»Apache Parquet as a columnar storage for large datasets«
Peter Hoffmann; Talk (30 minutes)

Apache Parquet has become the de facto columnar storage format for large data processing. With the support in Pandas and Dask through Apache Arrow and fastparquet, Python has gained an efficient binary DataFrame storage format. This talks will outline the Apache Parquet data format and show how it can be used in Python to work with data larger than memory and larger than local disk space.


»Data visualization -- from default and suboptimal to efficient and awesome«
Boris Gorelik; Tutorial (90 minutes)

Data visualization is an indispensable tool for any data scientist. It serves as a means to convey a message or explain a concept. You would never settle for default settings of a machine learning algorithm. Instead, you would tweak them to obtain optimal results. Similarly, you should never stop with the default results you receive from a data visualization framework. Doing so leads to suboptimal results and makes you and your message less convincing. After this tutorial, you will be able to name four most common mistakes in data visualization, and learn how to apply them in your graphs. We will use matplotlib for this tutorial.


»Three most common mistakes in data visualization«
Boris Gorelik; Talk (15 minutes)

Communication is a crucial part of our jobs. Data visualization plays an important role in such a communication. In this lecture, you will learn about three biggest visualization anti-patterns that I have been able to identify during more than 15 years of my professional career. I will accompany each anti-pattern with a case study. After attending this lecture, you will be able to identify and fix common mistakes in your and your colleagues' graphs


»GranFilm : the modeling of the optical properties of thin granular films.«
Alexis CVETKOV-ILIEV; Poster (90 minutes)

GranFilm is a numerical tool which simulates the optical reponse of supported nanoparticles. It aims at understanding thin film growth during sputtering.


»Understanding and diagnosing your machine-learning models«
Gaël Varoquaux; Tutorial (90 minutes)

Often achieving a good prediction is only half of the job. Questions immediately arise: How to improve this prediction? What drives the prediction? Can we operate changes to the system based on the predictions? All these questions require understanding how good is the model prediction, and how do the model predict. This tutorial assumes basic knowledge of scikit-learn. It will focus on statistics, tests, and interpretation rather than improving the prediction.


»ModelXplore, a python based model exploration«
Nicolas Cellier; Talk (15 minutes)

ModelXplore is an helper library that give some tool to facilitate the exploration of time-expensive models (or experimentation). It give access to a variety of samplers, of regression function (called meta-model), easy access to sensitivity analysis, and make easy the computation of response surface.


»Pyccel, a Fortran static compiler for scientific High-Performance Computing«
Dr. Ing. Ratnani Ahmed; Talk (15 minutes)

Presenting Pyccel, a source-to-source Python-Fortran, and DSL enabling HPC capabilities.


»Reproducibility and exploratory computing with a Jupyter-based workflow«
Antonino Ingargiola; Talk (15 minutes)

Reproducibility and exploratory computing are often two irreconcilable driving needs in data analysis. Here, I will present a Jupyter a workflow to manage the project lifecycle, from the initial exploratory phase to a mature and stable codebase. I will show how by combining git, Google Drive, conda environments, python packaging and batch notebook execution, it is possible to manage the complexity of a quickly growing project, while keeping track of the analysis for reproducibility.


»Scikit-learn and tabular data: closing the gap«
Joris Van den Bossche; Talk (30 minutes)

This talk will give an overview of the challenges and current bottlenecks when working with tabular data and scikit-learn. Then it will show the ungoing developments in sckikit-learn to improve this situation and highlight some third-party libraries that try to ease those problems.


»Explaining model predictions using Shapley values«
Ankur Ankan; Talk (15 minutes)

We can answer a lot of interesting questions by understanding the effects of features on the output of our models. Shapley values allows us to compute the contribution of the features in our dataset towards the predictions.


»Pythonizing workflows with with modern and legacy chemistry softwares.«
Olav Vahtras; Talk (30 minutes)

Initiatives are described to combine legacy software and modern software initiatives with Python as a glue in computational chemistry.


»Introduction to Python«
Mojdeh Rastgoo; Tutorial (90 minutes)

This tutorial will be a plain introduction to the Python language.


»Reproducible science with Binder, Docker and Jupyter applied to neuronal simulations using NeuronEAP«
Maria Teleńczuk; Poster (90 minutes)

NeuronEAP is a library for simulation of electric field generated during normal activity of the brain. To make this library accessible to anyone we employed current technology for building scientific environments online (Binder, Docker, Jupyter notebook)


»Data visualizations for the web with Altair and Vega(-Lite)«
Patrick Muehlbauer; Talk (30 minutes)

Altair is a declarative visualization library for Python built on top of Vega-Lite. We will show how Altair and Vega enable data scientists and frontend developers to work efficiently together to build beatiful customer facing data science dashboards.


»Using OpenCV + Games to help Parkinson's Patients«
Jayaditya Gupta; Talk (15 minutes)

Though Parkinson's disease is not curable but it can be delayed with the help of physical exercises. The project features a simple exercise game where user have to lift his/her hands in the air and the movements are tracked with the help of OpenCV.


»Navigating the Magical Data Visualisation Forest«
Margriet Groenendijk; Talk (15 minutes)

Data visualization is fun but can take up a lot of time, especially when you are exploring new data. The magic forest is much easier to navigate with [PixieDust](https://ibm-watson-data-lab.github.io/pixiedust/index.html), a free open-source Python library that makes it quick and simple to explore data with any visualization library without writing code in a Jupyter notebook. Learn how PixieDust takes out some of the coding, how to contribute, and how to make and share visualizations in seconds.


»Modelling Signal Aquisition Frontends for Compressed Sensing Applications using Open Source«
Christoph Wagner; Poster (90 minutes)

Compressed sensing is a nifty concept aiming to reduce the amount of data digitized during the acquisition of an analogue signal. This poster describes an holistic approach to perform the simulative evaluation of mixed-signal systems using publicly available open source modules from the python ecosystem. For algorithmic evaluation numpy, scipy and fastmat will be used. The analogue frontend portion is split into multiple blocks and evaluated using BMSpy.


»Deep Diving into GANs: From Theory to Production«
Paolo Galeone, Michele De Simoni; Tutorial (90 minutes)

With our accrued experience with GANs, we would like to guide you through the required steps to go from theory to production with this revolutionary technology. Starting from the very basic of what a GAN is, passing trough Tensorflow implementation, using the most cutting edge APIs available in the framework, and finally, production-ready serving at scale using Google Cloud ML Engine.


»Efficient Biomedical Named Entity Recognition in Python«
Tilia Ellendorff; Talk (15 minutes)

We present two tools implemented in python for processing scientific literature in the biomedical domain: the Bio Term Hub is an automated aggregator of terminologies from life science databases; OGER is a fast, efficient, and accurate entity recognition and linking system.


»RepSeP - Reproducible Self-Publishing for Python-Based Research.«
Horea Christian; Talk (30 minutes)

**RepSeP** enables and exemplifies the compilation and distribution-friendly packaging of reproducible scientific publications which employ live elements (e.g. plots, tables, or inline statistics) generated from Python analysis scripts. The repository ships PythonTeX boilerplate code (for high quality typesetting), and showcases a defined set of Python analysis scripts used throughout the main science communication formats --- article, poster, and presentation slide set --- with appropriate content-variant styling.


»Workflow for Optimizing Data Visualization and Documentation«
Dominique Albert-Weiß, Sayako Kodera; Poster (90 minutes)

A presented workflow is used in our research group and enables to optimize visualization and documentation process in scientific writing.


»Benchmarking and performance analysis for scientific Python«
Roman Yurchak; Talk (15 minutes)

In this talk we will review performance analysis tools available in the scientific Python ecosystem, as well as useful metrics that can be used to guide the code optimization process. The [neurtu](https://github.com/symerio/neurtu) package will be introduced, aiming to facilitate time and memory complexity estimations together with parametric benchmarks.


»Machine Learning Methods for User Experience and Performance Monitoring«
Susanne Greiner; Poster (90 minutes)

Performance monitoring is playing an increasing role in the time of IoT and the cloud. Unfortunately data are commonly unlabeled or labels added by a human expert - that has been using data for troubleshooting - are only available for short periods. User experience can close this gap by providing detailed information on how the current constellation of performance metric values is perceived by the user making a starting point for several investigations based on machine learning methods. ScikitLearn is used to explore synthetic monitoring data together with performance metrics from heterogeneous sources.


»Phase-space analysis of chaotic deterministic dynamics with Python: the case of biological systems with many degrees of freedom«
Unnamed user, Paola Lecca; Talk (15 minutes)

We present a Python program that generates and analyses the phase portrait of systems of ordinary differential equations (ODEs) describing dynamics affected by deterministic chaos. Deterministic chaos manifests itself in an irregular behaviour of the dynamics arising from a strictly deterministic time evolution without any source of noise or external stochasticity. This irregularity is expressed in an extremely sensitive dependence on the initial conditions, which precludes any long-term prediction of the dynamics. Deterministic chaos can be found in systems with a very low degree of freedom, and is usually termed low dimensional deterministic chaos, as it is attributable to the dynamics of a small fraction of the total system components. Low-dimensional chaos is expectedly common in systems with few degrees of freedom, but rare in systems with many degrees of freedom such as medium- or large size dynamics biological networks. The possibility of detecting low dimensional chaos in biological networks is of great interest to the community of biologists and biotechnologists because, since deterministic chaos is generated by an underlying deterministic process, it is potentially controllable. Our Python program analyses the phase portrait and performs a stability analysis of the ODEs of the dynamics of biologically interacting agents and detects the major drivers of chaos. The identification of these drivers allows to improve predictability and controllability of the system dynamics. We will show the performance of our code on an accurate ordinary differential equation model of the gene network regulating sporulation initiation in Bacillus subtilis. Results of the analysis will be presented at the conference.


»How PyPy can help for high-performance computing«
Antonio Cuni; Talk (30 minutes)

PyPy is an alternative implementation of Python which is famous for its speed: thanks to its JIT compiler and fast GC, it can run Python programs up to 100x faster than CPython. This talk will cover the following topics, with a particular focus on scientific applications: - What is PyPy, and the current status of scientific libraries - Why and when it is fast - When it is slow, and how the PyPy team is handling the problem - Future roadmap


»Privacy for Data Scientists«
Katharine Jarmul; Tutorial (90 minutes)

What does data privacy mean within the realm of data science? How can we continue to do high-performance data science in a post-GDPR world? In this 3-hour tutorial, we will cover privacy basics for data scientists: from an introduction to the theories and algorithms defining privacy to practical steps you can take to better preserve privacy in your data science.


»Parallel Data Analysis with Dask«
Ian Stokes Rees; Tutorial (90 minutes)

The libraries that power data analysis in Python are essentially limited to a single CPU core and to datasets that fit in RAM. Attendees will see how dask can parallelize their workflows, while still writing what looks like normal python, NumPy, or pandas code. Dask is a parallel computing framework, with a focus on analytical computing. We'll start with dask.delayed, which helps parallelize your existing Python code. We’ll demonstrate dask.delayed on a small example, introducing the concepts at the heart of dask like the task graph and the schedulers that execute tasks. We’ll compare this approach to the simpler, but less flexible, parallelization methods available in the standard library like concurrent.futures. Attendees will see the high-level collections dask provides for writing regular Python, NumPy, or Pandas code that is then executed in parallel on datasets that may be larger than memory. These high level collections provide a familiar API, but the execution model is very different. We'll discuss concepts like the GIL, serialization, and other headaches that come up with parallel programming. We’ll use dask’s various schedulers to illustrate the differences between multi-threaded, multi-processes, and distributed computing. Dask includes a distributed scheduler for executing task graphs on a cluster of machines. We’ll provide each person access to their own cluster.


»Databases for Data Scientists (Overview & SQL)«
Alexander Hendorf; Tutorial (90 minutes)

This tutorial will provide a crash course about the major differences of the various database systems as relational, NoSQL and graph databases and the differences between them. We will cover relational database design, the SQL query language.


»Empowered Analytics: Blending MongoDB's Aggregation Framework with Numpy and Pandas«
Nathan Leniz, Anna Herlihy; Tutorial (90 minutes)

Learn how to use MongoDB's Aggregation framework to preprocess, transform, and compute values prior to bringing the data down from the cloud. Then use Pandas, Numpy, and Seaborn to further analyze the data, create beautiful visualizations, and improve your overall analysis and deliver stunning reports.


»Simpler data science: dirty categories and scikit-learn updates«
Gaël Varoquaux; Talk (15 minutes)

This talk will touch upon two packages that hope to make machine learning easier, dirty_cat [1], and the lesser-known scikit-learn. Dirty-cat strives to make it easier to work with categorical data that contain variations in the categories, such as typos, variants of company names, or open-ended input. It uses simple, off-the-shelf, vectorization of the categories in a way that is robust to morphological variants. I will also give an update on scikit-learn: upcoming features, and the striving health of a happy community. [1] https://dirty-cat.github.io