---
library_name: transformers
base_model:
- facebook/wav2vec2-xls-r-300m
tags:
- ASR
- Nepali ASR
- OpenSLR Nepali
- Nepali ASR Wav2Vec2
- XLS-R
# model-index:
# - name: Wav2Vec2_XLS-R-300m_Nepali_ASR
#   # results: [16.82%, 2.72%]
#   results:
#     - task: speech_recognition
#       metrics: 
#        - metric: wer
#          value: 16.82%
#        - metric: cer
#          value: 2.72%
# model-index: 
#   - name: Wav2Vec2_XLS-R-300m_Nepali_ASR
#     results: 
#       - task: 
#           name: speech_recognition
#         metrics: 
#           - type: wer
#             value: 16.82
#           - type: cer
#             value: 2.72
datasets:
- iamTangsang/OpenSLR54-Nepali-ASR
- mozilla-foundation/common_voice_17_0
metrics:
- wer
- cer
pipeline_tag: automatic-speech-recognition
license: mit
language:
- ne
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# Wav2Vec2_XLS-R-300m_Nepali_ASR

This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co./facebook/wav2vec2-xls-r-300m) on:
- [Large Nepali ASR training data set from OpenSLR (SLR-54)] (https://www.openslr.org/54/)
- [Common Voice Corpus 17.0] (https://huggingface.co./datasets/mozilla-foundation/common_voice_17_0)

## Model description

The model is a fine-tuned version of Wav2Vec2 XLS-R 300 million parameters version fine-tuned for Nepali Automatic Speech Recognition. The reported results are on the OpenSLR test split.
- WER on OpenSLR: 16.82%
- CER on OpenSLR: 2.72%


## Intended uses & limitations
- Research on Nepali ASR
- Transcriptions on Nepali audio
- Further Fine-tuning
- ### Limitations:
- The model is trained on the OpenSLR Nepali ASR dataset which upon inspection was found to be quite noisy and inconsistent.
- Due to resources limitations, utterances longer than 5 sec have been filtered out from the dataset during training and evaluation.
- Numerals have been filtered out as well.
- The vocabulary doesn't contain all the Nepali alphabets.
- Might perform poorly on audio segments longer than 5 seconds. Or, needs some mechanism to segment audio into 5 seconds chunks which might increase processing time.
- May struggle with background noises and overlapping speech.

## Training and evaluation data

### Common Voice v17.0
- This model has been fine-tuned on OpenSLR-54 (Nepali ASR training dataset) and CommonVoice Corpus v17.0
- Initially, the model was trained on [CommonVoice v17.0 ne-NP](https://huggingface.co./datasets/mozilla-foundation/common_voice_17_0/viewer/ne-NP) which consists of about 2 hours of voice data of which 1 hours have been manually validated.
- We combined the `validated` and `other` split first since the dataset is very small. So, we had a total of 1337 utterances.
- We have preprocessed the data by removing all punctuations and symbols.
- Then, we used 80% of the total utterances for training and 10% for evaluation. 
- And, we used the `test` split consisting of 217 utterances for testing. (It might have been present in the `train split` as well.)
- It was trained for 30 epochs. The WER started fluctuating around 37% to 39%.

### OpenSLR Nepali ASR training data
- Then, it was further trained on the larger OpenSLR Nepali ASR training dataset which has 157,000 utterances.
- Firstly, the numerals were removed as the utterances were inconsistent with transcriptions.
- And, segments longer than 5 seconds were removed because of resource limitations.
- Less frequently used 'alphabets' were removed to reduce the vocabulary size.
- Finally, we ended up with 136083 utterances for whole dataset. The dataset has been uploaded [here](https://huggingface.co./datasets/iamTangsang/OpenSLR54-Nepali-ASR).
- 80% was used for training, 10% for evaluation and 10% for testing.

## Training procedure

### Training on CommonVoice 17.0

The following hyperparameters were used during training:
- learning_rate: 3e-04
- train_batch_size: 16
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 400
- num_epochs: 30
- mixed_precision_training: Native AMP

### Initial Training on OpenSLR-54 for 16 epochs

The following hyperparameters were used:
- learning_rate: 3e-04
- train_batch_size: 16
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 2
- warmup_steps: 500
- total_train_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- num_epochs: 16
- mixed_precision_training: Native AMP

### Further Training on OpenSLR-54 for further 3 epochs

We used the following:
- learning_rate: 2e-5
- train_batch_size: 16
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 700
- num_epochs: 3
- mixed_precision_training: Native AMP


### Framework versions

- Transformers 4.44.2
- Pytorch 2.4.1+cu121
- Datasets 3.0.0
- Tokenizers 0.19.1