2019-09-05 –, Track 2 (Baroja)
Modern hardware is multi-core. It is crucial for Python to provide
efficient parallelism. This talk exposes the current state and advances
in Python parallelism, in order to help practitioners and developers take
better decisions on this matter.
Parallel computing in Python: Current state and recent advances
Modern hardware is multi-core. It is crucial for Python to provide
high-performance parallelism. This talk will expose to both data-scientists and
library developers the current state of affairs and the recent advances for
parallel computing with Python. The goal is to help practitioners and
developers to make better decisions on this matter.
I will first cover how Python can interface with parallelism, from leveraging
external parallelism of C-extensions –especially the BLAS family– to Python's
multiprocessing and multithreading API. I will touch upon use cases, e.g single
vs multi machine, as well as and pros and cons of the various solutions for
each use case. Most of these considerations will be backed by benchmarks from
the scikit-learn machine
learning library.
From these low-level interfaces emerged higher-level parallel processing
libraries, such as concurrent.futures,
joblib and
loky (used by
dask and scikit-learn) These
libraries make it easy for Python programmers to use safe and reliable
parallelism in their code. They can even work in more exotic situations, such
as interactive sessions, in which Python’s native multiprocessing support tends
to fail. I will describe their purpose as well as the canonical use-cases they
address.
The last part of this talk will focus on the most recent advances in the Python
standard library, addressing one of the principal performance bottlenecks of
multi-core/multi-machine processing, which is data communication. We will
present a new
API
for shared-memory management between different Python processes, and
performance improvements for the serialization of large Python objects (PEP
574, pickle
extensions). These performance
improvements will be leveraged by distributed data science frameworks such as
dask, ray and
pyspark.
Parallel computing in Python: Current state and recent advances
Python Skill Level:professional
Domain Expertise:none
Domains:Parallel computing / HPC
Hi! My name is Pierre. I currently work as a research engineer in the Parietal Team at a French research institute called INRIA. You may know my team as we created many machine-learning and scientific computing libraries among which scikit-learn, joblib, nilearn and others. I am currently improving Python's multiprocessing tools across the whole scientific computing ecosystem. I notably contributed to scikit-learn, joblib, numpy, cpython, cloudpickle and many other libraries. You can follow me on twitter (https://twitter.com/PierreGlaser) and github (https://github.com/pierreglaser).