PyCon Lithuania 2024

Gabriel Martín Blázquez

Gabriel is a Machine Learning Engineer focused on NLP. From academia to industry, he is now working on Argilla, where we have contributed to the backend of Argilla and also in the development and design of distilabel, a library for generating synthetic data using LLMs.


Twitter handle. For example (@handle-name)

@gabrielmbmb_

Notable open source projects that you contribute to. Add URLs, one per line.

https://github.com/argilla-io/argilla
https://github.com/argilla-io/distilabel
https://github.com/zenml-io/zenml


Session

04-05
14:00
25min
🧼 From GPU-poor to data-rich: data quality practices for LLM fine-tuning
Gabriel Martín Blázquez, David Berenstein

If you are GPU-poor you need to become data-rich. I will give an overview of what we learned from looking at Alpaca, LIMA, Dolly, UltraFeedback and Zephyr and how we applied that to fine-tuning a state-of-the-art open source LLM called Notus and Notux by becoming data-rich.

Data
Room 111