papluca/xlm-roberta-base-language-detection · How can I add some some new languages into your model?

20 days ago

I need some more language something like Korea or Malay language.
How can I do it?

Owner 9 days ago

Hi,

This model is a fine-tuned version of xlm-roberta-base and language identification is tackled as a multi-class classification problem. I hence believe there are two options for adding new languages: (i) retrain the model from scratch (i.e. starting from FacebookAI/xlm-roberta-base) or (ii) fine-tune the existing model (i.e. papluca/xlm-roberta-base-language-detection). In any case, the time-consuming part is data collection, i.e. getting a reasonable number of text samples in the languages you are interested in.

In the absence of clever tricks to avoid full retraining, I'd go for option (i), which is computationally more expensive than option (ii), but will likely lead to a more accurate model. This involves adding the new language data along with the original dataset, which is available here on HF, and extending the last linear layer to account for the new classes. You can find the dataset information, as well as the complete training code, on the model card.

Hope this helps you!

papluca

Owner 9 days ago

P.S.: When this model was created, the datasets I found included “only” 21 languages. However, I am committed to improving its accuracy and expanding its coverage to additional languages. If you know of high-quality data sources, I would greatly appreciate your recommendations! 😉