PyConDE & PyData Berlin 2024

How to Do Monolingual, Multilingual, and Cross-lingual Text Classification in April, 2024
2024-04-23 , B07-B08

In 2023, the field of NLP was again flurried -- the appearing of powerful closed- and opens-source LLMs opened new possibility for texts processing. However, many questions about these models usability for typical NLP tasks are still open. One of them is quite simple -- if we want a classification model for some task, can we rely on LLMs or is it still better to fine-tune an own model? It might be easier to obtain some classifier for English, but what if my target language is not so resource-rich? In this presentation, the main "recipes" how to obtain the best text classifier depending on the language and data availability will be described.


We will provide the answer to the three main questions:

  1. If I want a text classifier for English texts, what is better -- to fine-tune the model or to prompt LLM? Which model is to fine-tune though?

  2. If my data is not in English, i.e. not resource rich language, what should I do? Can I utilize LLMs? Or I need to somehow get the data? Or I can transfer somehow knowledge from existing English data?

  3. If I want a multilingual model for several languages, again, what is the choice -- LLMs or own model? Which model then?

The findings and comparisons will be illustrated on three tasks -- toxic speech, formal speech, and fluent speech detection -- for two languages -- English (as resource-rich language) and Ukrainian (as low resource language in terms of different data availability). We will provide tests of closed- and open-source models together with fine-tuned opensources models like BERT, RoBERTa.


Expected audience expertise: Domain:

Intermediate

Expected audience expertise: Python:

Intermediate

Abstract as a tweet (X) or toot (Mastodon):

If I want a text classifier in 2024, what should I choose -- LLMs or pre-LLM era classifier? Is the answer the same for English and other languages? We will provide the recipe how to find your classifier depending on the target language and data availability.

Hi, I'm Daryna 👋🇺🇦 I am a postdoctoral researcher at Social Computing Research Group in Technical University of Munich🇩🇪. Before, I obtained my PhD degree at Skolkovo Institue of Science and Technology under supervision of Alexander Panchenko with topic "Method for Fighting Harmful Multilingual Textual Content" 📜. Currently, I continue to follow my research vector participating in eXplainable AI (XAI) project and also multilingual NLP developing the models for the Ukrainian language.