Why GPU Clusters Don't Need to Go Brrr? Leverage Compound Sparsity to Achieve the Fastest Inference Performance on CPUs
04-19, 10:00–10:45 (Europe/Berlin), A1

Forget specialized hardware. Get GPU-class performance on your commodity CPUs with compound sparsity and sparsity-aware inference execution.
This talk will demonstrate the power of compound sparsity for model compression and inference speedup for NLP and CV domains, with a special focus on the recently popular Large Language Models. The combination of structured + unstructured pruning (to 90%+ sparsity), quantization, and knowledge distillation can be used to create models that run an order of magnitude faster than their dense counterparts, without a noticeable drop in accuracy. The session participants will learn the theory behind compound sparsity, state-of-the-art techniques, and how to apply it in practice using the Neural Magic platform.


By intelligently applying SOTA compound sparsity techniques, we can remove 95%+ of the weights and reduce the remaining 5% to 8-bit precision on modern models such as BERT, while maintaining 99%+ of their baseline accuracy. In this talk, we’ll be covering how we can build up to this extreme sparsity and how to harness it to achieve an order of magnitude speedup for CPU inference.

This talk will focus on the success story of utilizing sparsity to run fast inference of modern neural networks on CPUs. We will focus on the popular Large Language Models with the goal of learning how the recent state-of-the-art in model compression can help to dramatically lower the computational budget when it comes to model inference.

Today’s ML hardware acceleration is headed towards chips that apply a petaflop of compute to a cell phone-size memory. Our brains, on the other hand, are biologically the equivalent of applying a cell phone of compute to a petabyte of memory. In this sense, the direction being taken by hardware designers is the opposite of that proven by nature. Why? Simply because we don’t know the algorithms nature uses.
GPUs bring data in and out quickly, but have little locality of reference because of their small caches. They are geared towards applying a lot of compute to little data, not little compute to a lot of data. The networks are designed to run on them full layer after full layer in order to saturate their computational pipeline.
CPUs, on the other hand, have large, much faster caches than GPUs, and have an abundance of memory (terabytes). A typical CPU server can have memory equivalent to tens or even hundreds of GPUs. CPUs are perfect for a brain-like ML world in which parts of an extremely large network are executed piecemeal, as needed.

This is the problem Neural Magic set out to solve and the perspective which led to the creation of DeepSparse, a custom computational engine designed to mimic, on commodity hardware, the way brains compute. It uses neural network sparsity combined with the locality of communication by utilizing the CPU’s large fast caches and its very large memory.


Expected audience expertise: Domain

Intermediate

Expected audience expertise: Python

Novice

Abstract as a tweet

Fun fact: you can remove 90% of a neural network's weights without losing much accuracy! With model sparsity, you can even run these networks on your CPU with GPU-level performance. Learn about compound sparsity (pruning, quantization, knowledge distillation) for faster inference

Public link to supporting material

https://github.com/neuralmagic

Engineer, roboticist, software developer, and problem solver. Previous experience in autonomous driving (Argo AI), AI in industrial robotics (Arrival), and building machines that build machines (Tesla). Currently working in Neural Magic, focusing on the sparse future of AI computation.
Works towards unlocking creative and economic potential with intelligent robotics while avoiding the uprising of sentient machines."