Ukrainian flair embeddings (forward, large)

Trained for 10 epochs on the texts from ubertext2.0 and corpus of Ukrainian scraped texts from Stefan Schweter (54GB in total).

This is the forward version of the embeddings. You can find the backward version here

The characters dictionary used for training is in flair_dictionary.pkl file

The model params are:

    is_forward_lm=True,
    hidden_size=2048,
    sequence_length=250,
    mini_batch_size=1024,
    max_epochs=30

For smaller size flair embeddings of the Ukrainian language please check uk-forward

For more information on flair embeddings, see the article or the paper below:

@inproceedings{akbik2018coling,
  title={Contextual String Embeddings for Sequence Labeling},
  author={Akbik, Alan and Blythe, Duncan and Vollgraf, Roland},
  booktitle = {{COLING} 2018, 27th International Conference on Computational Linguistics},
  pages     = {1638--1649},
  year      = {2018}
}

For more information on UberText 2.0 please see:

@inproceedings{chaplynskyi-2023-introducing,
    title = "Introducing {U}ber{T}ext 2.0: A Corpus of {M}odern {U}krainian at Scale",
    author = "Chaplynskyi, Dmytro",
    booktitle = "Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP)",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.unlp-1.1",
    pages = "1--10",
    abstract = "This paper addresses the need for massive corpora for a low-resource language and presents the publicly available UberText 2.0 corpus for the Ukrainian language and discusses the methodology of its construction. While the collection and maintenance of such a corpus is more of a data extraction and data engineering task, the corpus itself provides a solid foundation for natural language processing tasks. It can enable the creation of contemporary language models and word embeddings, resulting in a better performance of numerous downstream tasks for the Ukrainian language. In addition, the paper and software developed can be used as a guidance and model solution for other low-resource languages. The resulting corpus is available for download on the project page. It has 3.274 billion tokens, consists of 8.59 million texts and takes up 32 gigabytes of space.",
}