Canine for Language Identification

Canine model trained on WiLI-2018 dataset to identify the language of a text.

Preprocessing

10% of train data stratified sampled as validation set
max sequence length: 512

Hyperparameters

epochs: 4
learning-rate: 3e-5
batch size: 16
gradient_accumulation: 4
optimizer: AdamW with default settings

Test Results

Accuracy: 94,92%
Macro F1-score: 94,91%

Inference

Dictionary to return English names for a label id:

import datasets
import pycountry
def int_to_lang():
    dataset = datasets.load_dataset('wili_2018')
    # names for languages not in iso-639-3 from wikipedia
    non_iso_languages = {'roa-tara': 'Tarantino', 'zh-yue': 'Cantonese', 'map-bms': 'Banyumasan',
                         'nds-nl': 'Dutch Low Saxon', 'be-tarask': 'Belarusian'}
    # create dictionary from data set labels to language names
    lab_to_lang = {}
    for i, lang in enumerate(dataset['train'].features['label'].names):
        full_lang = pycountry.languages.get(alpha_3=lang)
        if full_lang:
            lab_to_lang[i] = full_lang.name
        else:
            lab_to_lang[i] = non_iso_languages[lang]
    return lab_to_lang

Credit to

@article{clark-etal-2022-canine,
    title = "Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation",
    author = "Clark, Jonathan H.  and
      Garrette, Dan  and
      Turc, Iulia  and
      Wieting, John",
    journal = "Transactions of the Association for Computational Linguistics",
    volume = "10",
    year = "2022",
    address = "Cambridge, MA",
    publisher = "MIT Press",
    url = "https://aclanthology.org/2022.tacl-1.5",
    doi = "10.1162/tacl_a_00448",
    pages = "73--91",
    abstract = "Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model{'}s ability to adapt. In this paper, we present Canine, a neural encoder that operates directly on character sequences{---}without explicit tokenization or vocabulary{---}and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, Canine combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. Canine outperforms a comparable mBert model by 5.7 F1 on TyDi QA, a challenging multilingual benchmark, despite having fewer model parameters.",
}
@dataset{thoma_martin_2018_841984,
  author       = {Thoma, Martin},
  title        = {{WiLI-2018 - Wikipedia Language Identification 
                   database}},
  month        = jan,
  year         = 2018,
  publisher    = {Zenodo},
  version      = {1.0.0},
  doi          = {10.5281/zenodo.841984},
  url          = {https://doi.org/10.5281/zenodo.841984}
}

SebOchs
/

canine-c-lang-id

Canine for Language Identification

Preprocessing

Hyperparameters

Test Results

Inference

Credit to

Dataset used to train SebOchs/canine-c-lang-id