Running open large language models in production with serverless GPUs DevFest Berlin 2024

Running open large language models in production with serverless GPUs
.ical
2024-11-23 17:10–17:50, Tresor

Many developers are interested in running open large language models, such as Google's Gemma and Llama. Open models give you full control over the deployment options, the timing of model upgrades, the private data that goes into the model, and the ability to fine-tune on specific tasks such as data extraction. Hugging Face TGI is a popular open-source LLM inference server, and Hugging Face TRL is excellent for fine-tuning. You’ll learn how to build and deploy an application that uses an open model on Google Cloud Run with cost-effective GPUs that scale down to zero instances.

Wietse Venema

Wietse Venema is an engineer at Google Cloud. He wrote the O’Reilly book on Cloud Run.

Running open large language models in production with serverless GPUs .ical 2024-11-23 17:10–17:50, Tresor

Running open large language models in production with serverless GPUs
.ical
2024-11-23 17:10–17:50, Tresor