B-GPT_el_en_sequential

This is a bilingual GPT-2 style model. For the first half of training, this model was trained only on Greek data. In the second half of training, the model was trained on only English data. At the end of training, 50% of training data seen by the model is Greek and 50% is English. The tokenizer was trained on the same overall proportions of data as the language model at the final step.

Model details:

All models are trained with a [CLS] (same as [BOS]) token prepended, and a [SEP] (same as [EOS]) token separating sequences. For best results, make sure that [CLS] is prepended to your input sequence (see sample usage linked above)! Details for this model specifically:

Architecture: gpt2
Parameters: 124770816
Maximum sequence length: 512 tokens
Training tokens: 12B
Vocabulary size: 50000
Compute cost: ~9 NVIDIA A6000 GPU hours
CO2 Emission: 1.17 kg

Training dataset: OSCAR 2021/09

Checkpoints are taken at training steps: 0, 10000, 20000, 30000, 40000, 50000, 64000, 64010, 64020, 64030, 64040, 64050, 64060, 64070, 64080, 64090, 64100, 64110, 64120, 64130, 64140, 64150, 64160, 64170, 64180, 64190, 64200, 64300, 64400, 64500, 64600, 64700, 64800, 64900, 65000, 66000, 67000, 68000, 69000, 70000, 80000, 90000, 100000, 110000, 120000, 128000.

Use This Model

Load the model:

Note: if you do not specify a revision, it will load the final checkpoint of the model. See above for the list of checkpoints. The checkpoint step is the name of the revision.

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("catherinearnett/B-GPT_el_en_sequential")
model = AutoModel.from_pretrained("catherinearnett/B-GPT_el_en_sequential", revision = "128000")

Text Generation:

from transformers import pipeline

pipe = pipeline("text-generation", model="catherinearnett/B-GPT_el_en_sequential")
    
pipe("I am a")

Citation

If you use this model, please cite:

@article{arnett2025acquisition,
  author = {Catherine Arnett and Tyler A. Chang and James A. Michaelov and Benjamin K. Bergen},
  title = {On the Acquisition of Shared Grammatical Representations in Bilingual Language Models},
  journal = {arXiv preprint arXiv:2503.03962},
  year = {2025},
  url = {https://arxiv.org/abs/2503.03962}
}

catherinearnett
/

B-GPT_el_en_sequential

B-GPT_el_en_sequential

Model details:

Use This Model

Citation

Dataset used to train catherinearnett/B-GPT_el_en_sequential

Collection including catherinearnett/B-GPT_el_en_sequential

B-GPT