Why you should (not) train your own BERT model for different languages or domains

Language models like BERT can capture general language knowledge and transfer it to new data and tasks. However, applying a pre-trained BERT to non-English text has limitations. Is training from scratch a good (and feasible) way to overcome them?


Large pre-trained language models play an important role in recent NLP research. They can learn general language features from unlabeled data, and then be easily fine-tuned for any supervised NLP task, e.g. classification or named entity recognition. BERT (Bidirectional Encoder Representations from Transformers) is one example of a language model, pre-trained weights are available for everyone to use.

BERT has been shown to successfully transfer knowledge from pre-training to many different NLP benchmark tasks. How good can this transfer work for non-English or domain-specific texts, though?

We will have a look at how exactly input text is fed to and processed by BERT. Prior knowledge of the BERT architecture is therefore not necessary, but we assume some experience with deep learning and NLP. This introduction gives the background to discuss the limitations you will face when fine-tuning a pre-trained BERT on your data, e.g. related to vocabulary and input text length.

An alternative solution compared to simple fine-tuning is to train your own version of BERT from scratch, thus overcoming some of the discussed limitations. The hope for improving performance is countered by time and hardware requirements that training from scratch brings with it. I was curious enough to give it a try and want to share the findings from this experiment. We will go over the achieved improvements and the actual effort - so you can decide whether you should try to train your own BERT.


Domains:

Artificial Intelligence, Deep Learning, Data Science, Natural Language Processing, Machine Learning, Science

Domain Expertise:

expert

Python Skill Level:

basic

Abstract as a tweet:

Language models like BERT can capture general language knowledge and transfer it to new data and tasks. However, applying a pre-trained BERT to non-English text has limitations. Is training from scratch a good (and feasible) way to overcome them?