wav2vec2-xls-r-300m-cv8-turkish

Model description

This ASR model is a fine-tuned version of facebook/wav2vec2-xls-r-300m on Turkish language.

Training and evaluation data

The following datasets were used for finetuning:

Training procedure

To support the datasets above, custom pre-processing and loading steps was performed and wav2vec2-turkish repo was used for that purpose.

Training hyperparameters

The following hypermaters were used for finetuning:

  • learning_rate 2.5e-4
  • num_train_epochs 20
  • warmup_steps 500
  • freeze_feature_extractor
  • mask_time_prob 0.1
  • mask_feature_prob 0.1
  • feat_proj_dropout 0.05
  • attention_dropout 0.05
  • final_dropout 0.1
  • activation_dropout 0.05
  • per_device_train_batch_size 8
  • per_device_eval_batch_size 8
  • gradient_accumulation_steps 8

Framework versions

  • Transformers 4.17.0.dev0
  • Pytorch 1.10.1
  • Datasets 1.17.0
  • Tokenizers 0.10.3

Language Model

N-gram language model is trained on a Turkish Wikipedia articles using KenLM and ngram-lm-wiki repo was used to generate arpa LM and convert it into binary format.

Evaluation Commands

Please install unicode_tr package before running evaluation. It is used for Turkish text processing.

  1. To evaluate on mozilla-foundation/common_voice_8_0 with split test
python eval.py --model_id mpoyraz/wav2vec2-xls-r-300m-cv8-turkish --dataset mozilla-foundation/common_voice_8_0 --config tr --split test
  1. To evaluate on speech-recognition-community-v2/dev_data
python eval.py --model_id mpoyraz/wav2vec2-xls-r-300m-cv8-turkish --dataset speech-recognition-community-v2/dev_data --config tr --split validation --chunk_length_s 5.0 --stride_length_s 1.0

Evaluation results:

Dataset WER CER
Common Voice 8 TR test split 10.61 2.67
Speech Recognition Community dev data 36.46 12.38
Downloads last month
10
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train mpoyraz/wav2vec2-xls-r-300m-cv8-turkish

Evaluation results