BLOOM-CLP German (6.4B parameters)
This is a monolingual German language model trained using the CLP-Transfer method based on BLOOM-7b1.
You can try out the model at European Language Grid.
UPDATE: We recently released an instruction-tuned version of this model: malteos/bloom-6b4-clp-german-oasst-v0.1.
How to use
You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:
>>> from transformers import pipeline, set_seed
>>> generator = pipeline('text-generation', model='malteos/bloom-6b4-clp-german')
>>> set_seed(42)
>>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=3)
[{'generated_text': "Hello, I'm a language model, a language for thinking, a language for expressing thoughts."},
{'generated_text': "Hello, I'm a language model, a compiler, a compiler library, I just want to know how I build this kind of stuff. I don"},
{'generated_text': "Hello, I'm a language model, and also have more than a few of your own, but I understand that they're going to need some help"},]
Training dataset
- ca. 50B German tokens
- Web-crawled content from the German subset OSCAR v22.01 (excluding content tagged as header, footer, noisy, or adult)
- Web-crawled content from the GC4 Corpus (including only the head and middle parts)
- Both Web-crawled datasets are deduplicated with Google's suffix array implementation
- German court decisions from Open Legal Data
Code
Hardware
- 32xA100-40GB GPUs
- 12.5 days
- Tensorboard logs
Evaluation
Validation PPL compared to from-scratch training (the lower the better):
Additional evaluations can be found in our paper.
How to cite
If you are using our code or models, please cite our paper:
@misc{Ostendorff2023clp,
doi = {10.48550/ARXIV.2301.09626},
author = {Ostendorff, Malte and Rehm, Georg},
title = {Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning},
publisher = {arXiv},
year = {2023}
}
License
- Downloads last month
- 107
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.