Juliacon 2024

KSVD.jl: A case study in performance optimization.
07-10, 16:30–16:40 (Europe/Amsterdam), Function (4.1)

We present KSVD.jl, an extremely fast implementation of the K-SVD algorithm, including extensive single-core optimizations, shared-state multithreading, pipelined GPU offloading, and an optional distributed executor. With this implementation, we are able to outperform existing numpy-based implementations by ~100x and scale to datasets with millions of samples.


The K-SVD algorithm has gained recent attention in the application of dictionary learning for large-scale transformer models, see e.g. Bricken 2023. However, as it stands, it has been deemed computationally infeasible when applied to large datasets with millions (or billions) of samples.
And indeed, current implementations seem not up to the task; the implementation available as sklearn.decompositions.MiniBatchDictionaryLearning, which extensively leverages numpy and joblib, takes over 3 minutes for ten iterations on a dataset with ten thousand elements (despite multi-threading).

For this reason, we present KSVD.jl, an implementation of the K-SVD algorithm in Julia.
This implementation outperforms sklearn’s implementation by about $50\times$ when computing the same problem, can gain an additional $2\times$ by reducing the precision from Float64 to Float32, and can be scaled across many compute nodes with almost linear speedup improvements.
That means if, for example, eight compute nodes are available, we can expect a speedup of $(50 \cdot 2 \cdot 8)\times = 800\times$ for moderate to large datasets.
Further, KSVD.jl also employs several algorithmic modifications that, to the author’s knowledge, lead to faster convergence given the same number of compute operations.

In this presentation, we will present a breakdown of the optimizations and their effects, as well as guidance on the performance optimization workflow in Julia. In particular we will treat the K-SVD algorithm as a case-study to discuss several different aspects of performance optimization and their applicability in Julia. Specifically, this will include
- careful adjustments to the execution order and small algorithmic adjustments,
- single-core optimizations like aggressive buffer preallocations, exploiting cache locality, improving the memory layout, and reducing memory movements,
- careful multi-threading using small batch updates with frequent cross-communication implemented with Julia’s efficient task scheduling,
- a custom multi-threaded dense-sparse matrix multiplication implementation (ThreadedDenseSparseMul.jl),
- pipelined GPU-offloading for large matrix multiplications (currently unused in the fastest version),
- a custom distributed executor allowing to spread the computation over many compute nodes.

We hope that the insights we can share will empower others to improve and optimize their code when applicable, and provide some guidelines about the step-by-step breakdown of the process.

The package is available at https://github.com/RomeoV/KSVD.jl.

See also: The prezzy 😎 (948.9 KB)
  • PhD Student at the Stanford Intelligent Systems Lab (SISL), developing certification guidelines for employing data-driven algorithms in safety-critical applications.
This speaker also appears in: