Leveraging LLMs to build supervised datasets suitable for smaller models
2024-09-25 , Louis Armand 1 - Est

For some natural language processing (NLP) tasks, based on your production constraints, a simpler custom model can be a good contender to off-the-shelf large language models (LLMs), as long as you have enough qualitative data to build it. The stumbling block being how to obtain such data? Going over some practical cases, we will see how we can leverage the help of LLMs during this phase of an NLP project. How can it help us select the data to work on, or (pre)annotate it? Which model is suitable for which task? What are common pitfalls and where should you put your efforts and focus?


Lately, large language models (LLMs) have appeared to make wonders for just about any natural language processing (NLP) task. But for many reasons, one may be unwilling or unable to use such models in production. Which is not a problem if the task can be tackled fairly well with a different kind of model, as long as one has enough qualitative training and testing data.

But how to obtain said data? This is where LLMs come back into play. We will see how language models can help us create good datasets for different NLP tasks (classification, named entity recognition) with a fraction of the cost, time and pain it would normally require. From data selection to data annotation, we will go over some practical cases. We will try to answer some questions along the way: which model for which task, how much data do I need, and is this the ultimate dream or do you still need to put in a bit of effort? (spoiler alert: you do).

Justine leads the data science team at HelloWork, a digital provider of employment, recruitment, and training solutions. She spent the last 10+ years enjoying machine learning, python and other data science fun stuff in various fields. Her current work includes a good deal of natural language processing.

Senior data scientist at HelloWork