PyCon Lithuania 2024

🧼 From GPU-poor to data-rich: data quality practices for LLM fine-tuning
2024-04-05 , Room 111

If you are GPU-poor you need to become data-rich. I will give an overview of what we learned from looking at Alpaca, LIMA, Dolly, UltraFeedback and Zephyr and how we applied that to fine-tuning a state-of-the-art open source LLM called Notus and Notux by becoming data-rich.


GPUs are in high demand and low supply but being GPU-poor can be solved by focusing on data quality and becoming data-rich. By looking at efforts like Alpaca, LIMA, Dolly, UltraFeedback and Zephyr, we can see again and again that data quality is often a thing that does not get the attention it deserves.

1) Alpaca was made up of synthetic data that was not representative of real-world usage. 2) LIMA standing for Less Is More Alignment showed that a high-quality curated preference dataset with only a fraction of the required data could outperform other datasets in alignment tasks. 3) Databricks employees seemed to misunderstand the annotation task at hand. 3) UltraFeedback showed synthetic data at scale was possible and that GPT4 could be used to curate data aligned with human judgement. 4) Zephyr was trained on UltraFeedback but overlooked a bug in the dataset. 5) We trained Notus by resolving this bug but overlooked the fact training data was present in the benchmarks. 6) We started distilabel and worked on Notux.

Gabriel is a Machine Learning Engineer focused on NLP. From academia to industry, he is now working on Argilla, where we have contributed to the backend of Argilla and also in the development and design of distilabel, a library for generating synthetic data using LLMs.

Hi there 👋

From failing to study medicine ➡️ BSc industrial engineer ➡️ MSc computer scientist.
Life can be strange, so better enjoy it.
I´m sure I do by: 👨🏽‍🍳 Cooking, 👨🏽‍💻 Coding, 🏆 Committing.