2019-09-04, 15:45–16:00, Track 1 (Mitxelena)
Modern data systems tend to heavily focus on optimizing for the system’s time. In this talk, we discuss the design of Modin, a DataFrame library, and how to optimize for the human system.
Modern data systems tend to heavily focus on optimizing for the system’s time. Some of these optimizations, however, are counterproductive to the end user’s workflow and thought process. In this talk, we discuss the design of Modin, a DataFrame library, and how to optimize for the human system.
Modin is a project at UC Berkeley's RISELab designed to optimize for the data scientist’s time. Often when building a data system, the system designers will follow a set of “best practices” in order to optimize performance. These “best practices” often require data scientists to understand and personally optimize concepts and system components that are not central to extracting value from their data.
The fundamental goal of data science is to extract value from data. Despite this, data systems are being built with user requirements such as: (1) knowledge of partitioning, (2) understanding laziness and what triggers computation, (3) an entirely new API, and (4) where their code is running (e.g. locally, on-prem cluster, cloud). This overhead is passed to the data scientist, even though there is no overlap between these new requirements and the fundamental goal of their profession.
In this talk, we will discuss how we think about the problem of large scale data science and optimizing for the human system. We will discuss the system design of Modin, which enables pluggable backends, runtimes, and APIs. The system is designed to solve the needs of the data science community regardless of an individual user’s environment. Currently, Modin supports the pandas API, and a proof of concept for SQL has been implemented. Modin is completely open-source and can be found on GitHub: https://github.com/modin-project/modin.