PyConDE & PyData Berlin 2024

Data valuation for machine learning
2024-04-22 , B07-B08

Data valuation techniques compute the contribution of training points to the final performance of machine learning models. They are part of so-called data-centric ML, with immediate applications in data engineering like data pruning or improved collection processes, and in model debugging and development. In this talk we demonstrate how the open source library pyDVL can be used to detect mislabeled and out-of-distribution samples with little effort. We cover the core ideas behind the most successful algorithms and illustrate how they can be used to inspect your data to extract the most out of it.


The core idea of so-called data-centric machine learning is that any effort spent on improving the quality of the data used to train a model is probably better spent than on improving the model itself. This tested rule of thumb is particularly relevant for applications where data is scarce, expensive to acquire or difficult to annotate.

Concepts of the usefulness of a datum or its influence on the outcome of a prediction have a long history in statistics and ML, in particular through the notion of the influence function. However, it has only been recently that rigorous and practical notions of value for data, and in particular data-sets, have appeared in the ML literature. The core idea is to look at data points known to be “useful” in some sense — for instance in that they substantially contribute to the final performance of a model — and focus acquisition or labelling efforts around similar ones, while eliminating or “cleaning” the less useful ones.

In a nutshell, data valuation for machine learning is the task of assigning a scalar to each element of a training set which reflects its contribution to the final performance of some model trained on it. This can be used to repair or prune corrupt or superfluous data, or for data collection, like active learning strategies when labelling is expensive.

While many exact methods have exponential time complexity in the size of the training set, recent advances provide either good approximation strategies or introduce alternative approaches which are starting to make this field relevant in practice. In this context, pyDVL is an LGPL library aiming to provide robust, parallel implementations of every relevant method for simple usage in applications and research. In this talk we showcase how it can be used to detect issues in data pipelines and to improve final performance. pyDVL is still in early stages of development but already provides over a dozen algorithms, runs in parallel using ray and supports sklearn-compatible interfaces and large pytorch models with out-of-core computation thanks to dask.


Expected audience expertise: Domain

Intermediate

Expected audience expertise: Python

Intermediate

Abstract as a tweet (X) or toot (Mastodon)

pyDVL is the library for data valuation in machine learning. Use it to clean, prune and select your data to improve model performance.

Public link to supporting material, e.g. videos, Github, etc.

https://pydata2024.pydvl.org

After several years working as a software developer, Miguel pursued studies in pure mathematics in Madrid and Munich. After finishing his PhD in mathematics, and a short research stay in machine learning, he finally transitioned into the field and ended up working as an applied researcher at the appliedAI Initiative, where he went on to found and head the TransferLab.

After completing his PhD in applied mathematics, specializing in applied
harmonic and numerical analysis, Kristof developed a keen interest in the
rapidly evolving field of artificial intelligence. This interest inspired
him to transition his career towards AI engineering,
where he spent the next five years working on various machine learning
projects. In May 2023, he joined the TransferLab team at appliedAI Institute.