2024-11-16 –, LT9
Language: English
This talk will cover various aspects of optimizing Large Language Models (LLMs) with Python, including quick start, availability optimization, and throughput optimization. Explore cutting-edge techniques involved in areas such as model compilation, model compression, model inference batching, distributed training, and Large Model Inference (LMI) containers. Discover practical examples of optimizing some open-source models using techniques like LMI containers, Low-Rank Adaptation (LoRA), Fully Sharded Data Parallelism (FSDP), Paged Attention, Rolling Batch, and more.
The talk will delve into techniques and strategies for optimizing Large Language Models (LLMs) with Python. It focuses on addressing the computational challenges associated with training and deploying these models efficiently.
One key aspect discussed is model parallelism, which involves distributing the model across multiple devices or instances to overcome memory limitations. Tensor parallelism, a form of model parallelism, is explored, where individual tensors are split across devices. Pipeline parallelism, another technique, enables concurrent execution of different model components on separate devices.
The talk will also cover distributed training strategies, such as data parallelism and tensor-parallel language models, which leverage multiple devices to accelerate training. Techniques for reducing memory footprint, like quantization, pruning, and distillation, are explored as means to optimize LLM deployment.
Optimizations for inference are also discussed, including model compression methods like quantization-aware training, pruning, and distillation. Kernel fusion, a technique that combines multiple operations into a single optimized kernel, is highlighted for improving inference performance. Additionally, the document explores accelerated inference using hardware accelerators.
The talk aims to provide guidance on leveraging Python's capabilities for efficient LLM training, deployment, and inference. It covers a range of strategies and techniques to address the computational challenges associated with these models, enabling researchers and practitioners to optimize LLMs for improved performance and cost-effectiveness.
Haowen Huang is currently a Senior Developer Advocate at Amazon Web Services (AWS). He has over 20 years of experience in the telecommunications, internet, and cloud computing industries. He has previously worked for companies such as Microsoft, Sun, and China Telecom. He currently focuses on creating and sharing technical content in the areas of generative AI, large language models (LLMs), machine learning, and data science, and empowering developers around the world.