JuliaCon 2023

OndaBatches.jl: Continuous, repeatable, and distributed batching
2023-07-28 , 32-124

At Beacon Biosignals we don't want to have to re-invent the wheel about data
loading and batch randomization every time we stand up a new machine learning
project. So we've collected a set of patterns that have proven useful across
multiple projects into
OndaBatches.jl, which
serves as a foundation for building the specific batch randomization,
featurization, and data movement systems that each machine learning project
requires.


At Beacon Biosignals we don't want to have to re-invent the wheel about data
loading and batch randomization every time we stand up a new machine learning
project. So we've collected a set of patterns that have proven useful across
multiple projects into
OndaBatches.jl, which
serves as a foundation for building the specific batch randomization,
featurization, and data movement systems that each machine learning project
requires.

Our typical machine learning task involves time series datasets composed of at
least thousands of multichannel recordings, each of which has on the order of
100 million individual samples, with accompanying dense or sparse labels. While
not the largest machine learning datasets known to humankind, these are large
enough to be generally inconvenient. The size, shape, and structure of these
datasets (and the associated learning tasks) require some modifications of a
typical machine learning workflow (e.g. one in which the entire dataset is
processed in its entirety in each training epoch).

In this talk, I will present
OndaBatches.jl, a Julia
package that implements a set of patterns that have proven to be useful across a
number of projects at Beacon. OndaBatches.jl serves as a foundation for
building the specific batch randomization, featurization, and data movement
systems that each machine learning project requires. Its purpose is to build
and serve batches for machine learning workflows based on densely labeled time
series data, in a way that is:
- distributed (cloud native, throw more resources at it to make sure data
movement is not the bottleneck)
- scalable (handle out-of-core datasets, both for signal data and labels)
- deterministic + reproducible (pseudo-random)
- resumable
- flexible and extensible via normal Julia mechanisms of multiple dispatch

This talk focuses on two aspects of OndaBatches.jl design and development.
First, I'll describe the process of moving a local workflow into a distributed
setting in order to support scalability. Second, I'll discuss how Julia's
composability has shaped the design and functionality of OndaBatches.jl. In
particular, OndaBatches.jl builds on...
- ...Onda.jl to represent both
the multi-channel time series that is the input data and the regularly-sampled
labels.
- ...Distributed.jl to
compose well with various cluster managers (including Kubernetes via
K8sClusterManagers.jl)
in service of scalability.
- ...base Julia patterns around iteration in order to separate batch
state from batch content (in service of reproducibility and resumability)
- ...Julia's multiple dispatch pattern to allow our machine learning
teams to customize behavior where needed without having to re-invent basic
functionality every time.

Research Scientist at Beacon Biosignals and recovering academic.