PyConDE & PyData Berlin 2024

Is GenAI All You Need to Classify Text? Some Learnings from the Trenches
2024-04-24 , A1

In recent times, GenAI has sparked fervent excitement, sometimes touted as the panacea for all natural language processing (NLP) tasks. This presentation explores a practical text classification scenario at Malt, highlighting the practical hurdles encountered when employing GenAI (latency, environmental impact, and budgetary constraints). To overcome these obstacles, a smaller, dedicated model emerged as a viable solution. We'll delve into the construction and optimization (quantization, graph optimization) of this multilingual model. Finally we’ll see how GenAI's unparalleled zero-shot capabilities enables its continuous adaptation.


In recent times, GenAI has sparked fervent excitement, sometimes touted as the panacea for all natural language processing (NLP) tasks. This presentation explores a practical text classification scenario at Malt, highlighting first the practical hurdles encountered when employing GenAI (latency, environmental impact, and budgetary constraints).

In a second part, we’ll cover how we overcame these obstacles by building a small dedicated model built from a pre-trained SentenceBERT [1], a model trained on semantic similarity. We'll explain how training a classification network on top of it preserves the original language alignment [2], enabling multilingual generalization.

Next, we'll unveil the secret to unlocking even more efficiency: quantization and graph optimization techniques thanks to the ONNX ecosystem [3]. These optimizations while reducing even more the latency and resource consumption of this dedicated model enable it to be deployed with just a CPU.

Finally, we’ll see that GenAI still plays a relevant role in our text classification journey. Its unparalleled zero-shot capabilities allow us to continuously adapt our dedicated model, ensuring it remains relevant amidst an ever-changing product.

[1] Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks.
[2] Reimers, N., & Gurevych, I. (2020). Making monolingual sentence embeddings multilingual using knowledge distillation.
[3] https://onnx.ai/onnx/


Expected audience expertise: Python:

Novice

Expected audience expertise: Domain:

Intermediate

Abstract as a tweet (X) or toot (Mastodon):

GenAI is sometimes touted as the panacea for all natural language processing (NLP) tasks. This presentation explores a practical text classification scenario at Malt, highlighting the practical hurdles encountered when employing GenAI and how we overcame these obstacles.

Marc Palyart is the Head of Data Science at Malt, the freelancer marketplace, where he leads the search and matching team. With over a decade of data-wizardry under his belt, he's ventured into the depths of academia and scaled the heights of industry where he's had the pleasure of collaborating with some truly remarkable people.

Kateryna is Data Scientist at Malt, the freelancer marketplace, where she works in the search and matching team. She has a background in bioinformatics and passionate about beautiful code.