FlauBERT-Oral models: Using ASR-Generated Text for Spoken Language Modeling

FlauBERT-Oral are French BERT models trained on a very large amount of automatically transcribed speech from 350,000 hours of diverse French TV shows. They were trained with the FlauBERT software using the same parameters as the flaubert-base-uncased model (12 layers, 12 attention heads, 768 dims, 137M parameters, uncased).

Available FlauBERT-Oral models

flaubert-oral-asr : trained from scratch on ASR data, keeping the BPE tokenizer and vocabulary of flaubert-base-uncased
flaubert-oral-asr_nb : trained from scratch on ASR data, BPE tokenizer is also trained on the same corpus
flaubert-oral-mixed : trained from scratch on a mixed corpus of ASR and text data, BPE tokenizer is also trained on the same corpus
flaubert-oral-ft : fine-tuning of flaubert-base-uncased for a few epochs on ASR data

Usage for sequence classification

flaubert_tokenizer = FlaubertTokenizer.from_pretrained("nherve/flaubert-oral-asr")
flaubert_classif = FlaubertForSequenceClassification.from_pretrained("nherve/flaubert-oral-asr", num_labels=14)
flaubert_classif.sequence_summary.summary_type = 'mean'
# Then, train your model

References

If you use FlauBERT-Oral models for your scientific publication, or if you find the resources in this repository useful, please cite the following papers:

@InProceedings{herve2022flaubertoral,
  author    = {Herv\'{e}, Nicolas and Pelloin, Valentin and Favre, Benoit and Dary, Franck and Laurent, Antoine and Meignier, Sylvain and Besacier, Laurent},
  title     = {Using ASR-Generated Text for Spoken Language Modeling},
  booktitle = {Proceedings of "Challenges & Perspectives in Creating Large Language Models" ACL 2022 Workshop},
  month     = {May},
  year      = {2022}
}