--- library_name: transformers base_model: - facebook/wav2vec2-xls-r-300m tags: - ASR - Nepali ASR - OpenSLR Nepali - Nepali ASR Wav2Vec2 - XLS-R # model-index: # - name: Wav2Vec2_XLS-R-300m_Nepali_ASR # # results: [16.82%, 2.72%] # results: # - task: speech_recognition # metrics: # - metric: wer # value: 16.82% # - metric: cer # value: 2.72% # model-index: # - name: Wav2Vec2_XLS-R-300m_Nepali_ASR # results: # - task: # name: speech_recognition # metrics: # - type: wer # value: 16.82 # - type: cer # value: 2.72 datasets: - iamTangsang/OpenSLR54-Nepali-ASR - mozilla-foundation/common_voice_17_0 metrics: - wer - cer pipeline_tag: automatic-speech-recognition license: mit language: - ne --- # Wav2Vec2_XLS-R-300m_Nepali_ASR This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co./facebook/wav2vec2-xls-r-300m) on: - [Large Nepali ASR training data set from OpenSLR (SLR-54)] (https://www.openslr.org/54/) - [Common Voice Corpus 17.0] (https://huggingface.co./datasets/mozilla-foundation/common_voice_17_0) ## Model description The model is a fine-tuned version of Wav2Vec2 XLS-R 300 million parameters version fine-tuned for Nepali Automatic Speech Recognition. The reported results are on the OpenSLR test split. - WER on OpenSLR: 16.82% - CER on OpenSLR: 2.72% ## Intended uses & limitations - Research on Nepali ASR - Transcriptions on Nepali audio - Further Fine-tuning - ### Limitations: - The model is trained on the OpenSLR Nepali ASR dataset which upon inspection was found to be quite noisy and inconsistent. - Due to resources limitations, utterances longer than 5 sec have been filtered out from the dataset during training and evaluation. - Numerals have been filtered out as well. - The vocabulary doesn't contain all the Nepali alphabets. - Might perform poorly on audio segments longer than 5 seconds. Or, needs some mechanism to segment audio into 5 seconds chunks which might increase processing time. - May struggle with background noises and overlapping speech. ## Training and evaluation data ### Common Voice v17.0 - This model has been fine-tuned on OpenSLR-54 (Nepali ASR training dataset) and CommonVoice Corpus v17.0 - Initially, the model was trained on [CommonVoice v17.0 ne-NP](https://huggingface.co./datasets/mozilla-foundation/common_voice_17_0/viewer/ne-NP) which consists of about 2 hours of voice data of which 1 hours have been manually validated. - We combined the `validated` and `other` split first since the dataset is very small. So, we had a total of 1337 utterances. - We have preprocessed the data by removing all punctuations and symbols. - Then, we used 80% of the total utterances for training and 10% for evaluation. - And, we used the `test` split consisting of 217 utterances for testing. (It might have been present in the `train split` as well.) - It was trained for 30 epochs. The WER started fluctuating around 37% to 39%. ### OpenSLR Nepali ASR training data - Then, it was further trained on the larger OpenSLR Nepali ASR training dataset which has 157,000 utterances. - Firstly, the numerals were removed as the utterances were inconsistent with transcriptions. - And, segments longer than 5 seconds were removed because of resource limitations. - Less frequently used 'alphabets' were removed to reduce the vocabulary size. - Finally, we ended up with 136083 utterances for whole dataset. The dataset has been uploaded [here](https://huggingface.co./datasets/iamTangsang/OpenSLR54-Nepali-ASR). - 80% was used for training, 10% for evaluation and 10% for testing. ## Training procedure ### Training on CommonVoice 17.0 The following hyperparameters were used during training: - learning_rate: 3e-04 - train_batch_size: 16 - eval_batch_size: 8 - seed: 42 - gradient_accumulation_steps: 2 - total_train_batch_size: 32 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - lr_scheduler_warmup_steps: 400 - num_epochs: 30 - mixed_precision_training: Native AMP ### Initial Training on OpenSLR-54 for 16 epochs The following hyperparameters were used: - learning_rate: 3e-04 - train_batch_size: 16 - eval_batch_size: 8 - seed: 42 - gradient_accumulation_steps: 2 - warmup_steps: 500 - total_train_batch_size: 32 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - lr_scheduler_warmup_steps: 500 - num_epochs: 16 - mixed_precision_training: Native AMP ### Further Training on OpenSLR-54 for further 3 epochs We used the following: - learning_rate: 2e-5 - train_batch_size: 16 - eval_batch_size: 8 - seed: 42 - gradient_accumulation_steps: 2 - total_train_batch_size: 32 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - lr_scheduler_warmup_steps: 700 - num_epochs: 3 - mixed_precision_training: Native AMP ### Framework versions - Transformers 4.44.2 - Pytorch 2.4.1+cu121 - Datasets 3.0.0 - Tokenizers 0.19.1