Modern Data Science: A new approach to DataFrames and pipelines
2019-09-04 , Track 1 (Mitxelena)

We will demonstrate how to explore and analyse massive datasets (>150GB) on a laptop with the Vaex library in Python. Using computational graphs, efficient algorithms and storage (Apache Arrow / hdf5) Vaex can easily handle up to a billion rows.


Working with datasets comprising millions or billions of samples is an increasingly common task, one that is typically tackled with distributed computing. Nodes in high-performance computing clusters have enough RAM to run intensive and well-tested data analysis workflows. More often than not, however, this is preceded by the scientific process of cleaning, filtering, grouping, and other transformations of the data, through continuous visualizations and correlation analysis. In today’s work environments, many data scientists prefer to do this on their laptops or workstations, as to more effectively use their time and not to rely on spotty internet connection to access their remote data and computation resources. Modern laptops have sufficiently fast I/O SSD storage, but upgrading RAM is expensive or impossible.

Applying the combined benefits of computational graphs, which are common in neural network libraries, with delayed (a.k.a lazy) evaluations to a DataFrame library enables efficient memory and CPU usage. Together with memory-mapped storage (Apache Arrow, hdf5) and out-of-core algorithms, we can process considerably larger data sets with fewer resources. As an added bonus, the computational graphs ‘remember’ all operations applied to a DataFrame, meaning that data processing pipelines can be generated automatically.

In this talk, we will demonstrate Vaex, an open-source DataFrame library that embodies these concepts. Using data from the New York City YellowCab taxi service comprising 1.1 billion samples and taking up over 170 GB on disk, we will showcase how one can conduct an exploratory data analysis, complete with filtering, grouping, calculations of statistics and interactive visualisations on a single laptop in real time. Finally we will show an example of how one can automatically build a machine learning pipeline as a by-product of the exploratory data analysis using the computational graphs in Vaex.


Project Homepage / Git

https://github.com/vaexio/vaex

Project Homepage / Git Abstract as a tweet

We will use #Vaex to explore and analyse a massive (>150GB) dataset on a laptop in real time. Using computational graphs, efficient algorithms and storage (Apache Arrow / hdf5) Vaex can easily handle up to a billion rows.

Python Skill Level

basic

Domain Expertise

none

Domains

Astronomy, Big Data, Data Visualisation, Jupyter, Machine Learning, Open Source, Vector and array manipulation

Jovan is a senior data scientists & researcher at XebiaLabs, where he creates predictive models related to DevOps pipelines. Working mostly with Python in the Jupyter ecosystem, he has considerable experience in clustering analysis and predictive modeling. Jovan has a PhD in Astrophysics, is a co-founder of vaex.io, and is interested in novel machine learning technologies and applications.

Maarten Breddels is an entrepreneur and freelance developer/consultant/data scientist working mostly with Python, C++ and Javascript in the Jupyter ecosystem. Creator of ipyvolume and vaex, founder of vaex.io. His expertise ranges from fast numerical computation, API design, to 3d visualization. He has a Bachelor in ICT, a Master and PhD in Astronomy, likes to code and solve problems.

This speaker also appears in: