2024-04-05 –, Room 228
Presentation about how we (few local NLP enthusiasts) trained Language Transformer to generate meaningful text in Lithuanian language. Everything was based on volunteer work with huge R&D flavor.
During this presentation I will not only cover what kind of data we used to train this model and what results we got but also present other initiatives we drive in NLP field. Will try to do both technical and interactive presentation.
At this moment it's difficult to propose exact format (outlet) of presentation. There are two main topics (parts) which I would like to present:
- Part 1. Sequence-to-sequence Transformer which is able to generate (locally) text in Lithuanian language. We trained this model in two ways: 1) from text generate headline of article. 2) from headline generate text. During presentation I will cover both modes of this model. This is not LLM this is transformer which learned language rules and is able to generate sequences in Lithuanian language.
- Part 2. Lithuanian characters predictor. In Lithuanian language we have few diacritic symbols (ąčęėįšųū) and often we are skipping them and autocorrect helps to fix this. We trained model which is able to restore diacritic symbols from text i.e., it corrects suris to sūris. It's not autocorrect tool it's something more light and sophisticated how to fix 'lisping' text to correct one. It's 99% accurate and can be used i.e., in chatbots where people typing text without diacritic symbols.
I would like to present these two topics during 25 min. It will be just intro that few local NLP enthusiasts are interesting in out language from ML perspective.
I'm Senior Data Scientist at IBM Lithuanian with PhD in Technology sciences. In the recent years my main focus areas are GenAI with Large Language Models, Natural Language Processing/Understanding, computer vision, and MLOps. I also 2x is AWS certified in Machine Learning and DevOps.