The Race to the Bottom - Low Latency in the age of the Transformer
06-13, 16:50–17:10 (Europe/Berlin), Maschinenhaus

So you want to deploy a large language model, and keep your latency SLA? NLP adds enormous value to customers, but getting it to work efficiently is fraught with uncertainty and high cost. As transformers and other big neural network architectures make their way into your platform, you may be finding it difficult to get the speed and throughput you need within your budget, or even understand why it is so expensive.

This talk will give an overview of the latency and throughput challenges, and how to solve them. We will give an overview in the product and cost implications as well as the technical improvements that can be used to get things running fast. We will compare solutions and help make sense of difficult to understand technology.

The audience will walk away with the information they need to decide on the best direction for inference in their production platform.

Keywords: MLOps, Inference, Latency

Max Irwin is the founder of https://max.io, and is a contributing author of the book "AI Powered Search". Prior to founding MAX.IO, he was Managing Consultant at OpenSource Connections, and was the founding leader of the Search Center of Excellence at Wolters Kluwer.

Max has over 20 years of experience directing delivery and strategy of large scale applications in various industries, with 10 of those years globally managing large diverse teams to improve search quality to drive results. He has deep practical hands-on technical expertise in search relevance, customer experience, natural language processing, and growing quality-focused culture.