Whisper Small Arabic - Neethu VM

This model is a fine-tuned version of openai/whisper-small on the Arabic Common Voice 11.0 dataset . It achieves the following results on the evaluation set:

Loss: 0.3402
Wer: 44.8627

Model description

This model is a fine-tuned version of openai/whisper-small, tailored specifically for Arabic speech recognition tasks. The model was trained using the Arabic subset of the Common Voice 11.0 dataset, which is a large-scale, open-source collection of transcribed speech data provided by the Mozilla Foundation.

Intended uses & limitations

Speech-to-Text Conversion: This model is designed to transcribe spoken Arabic into written text. It is suitable for applications requiring accurate and efficient conversion of audio data to text.

Voice-Activated Interfaces: Enhance applications and devices with voice recognition capabilities, enabling users to interact with technology in Arabic.

Accessibility Tools: Assist in making audio content accessible to those with hearing impairments or in environments where audio cannot be played.

Content Creation and Archiving: Streamline the transcription process for content creators, journalists, and researchers working with Arabic audio materials.

Training and evaluation data

Dataset: The model was fine-tuned using the Arabic subset of the Common Voice 11.0 dataset, a large-scale, open-source dataset created by Mozilla.

Data Characteristics: The Common Voice dataset is a diverse collection of voice recordings contributed by volunteers worldwide, encompassing a wide range of speakers, accents, and environments. The Arabic subset includes various dialects and speech styles, contributing to the model's ability to generalize across different Arabic-speaking regions.

Preprocessing: The audio data was preprocessed to standardize sampling rates and formats, ensuring compatibility with the Whisper model's input requirements. Dataset: The evaluation was conducted using a designated test split of the Common Voice Arabic dataset. This ensures that the model's performance metrics are unbiased and reflective of its ability to generalize to new data.

Metrics: The primary metric used for evaluating the model's performance is the Word Error Rate (WER), which measures the accuracy of the transcriptions by comparing the predicted text to the ground truth.

Training procedure

Steps Involved Data Preparation:

Data Collection: Gathered the Arabic subset from the Common Voice 11.0 dataset. Preprocessing: Standardized the audio data by normalizing sampling rates and formats. Transcriptions were cleaned and aligned with the audio files to ensure accurate training pairs. Model Setup:

Base Model: The Whisper-small model was used as the base model due to its capability to handle diverse speech recognition tasks. Environment Configuration: Training was conducted on a machine equipped with a suitable GPU to handle the model's computational requirements efficiently. Fine-Tuning:

Hyperparameters: The learning rate, batch size, and other training hyperparameters were chosen to balance performance and training time. Training Process: The model was trained over multiple epochs, with regular checkpoints to save progress and evaluate performance on the validation set. Loss Function: Cross-entropy loss was used to optimize the model's predictions against the ground truth transcriptions. Evaluation:

Validation Set: A portion of the dataset was reserved for validation to monitor the model's performance and avoid overfitting. Metrics: Word Error Rate (WER) and validation loss were used as the primary metrics to assess the model's accuracy and generalization capability. Optimization:

Early Stopping: Implemented to prevent overfitting, stopping the training when the validation loss ceased to improve significantly. Fine-Tuning Adjustments: Hyperparameters and learning strategies were adjusted based on validation performance to enhance model accuracy.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 16
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
training_steps: 4000
mixed_precision_training: Native AMP

Training results

The table below shows the model's training and validation progress over multiple epochs, highlighting improvements in both loss and Word Error Rate (WER) as training progressed.

Training Loss	Epoch	Step	Validation Loss	Wer
0.3059	0.4156	1000	0.4141	49.8008
0.2894	0.8313	2000	0.3603	46.8148
0.1908	1.2469	3000	0.3519	46.4806
0.1699	1.6625	4000	0.3402	44.8627

Analysis Training Loss: This metric reflects the model's performance on the training data. A decrease in training loss over time indicates that the model is learning to fit the training data more accurately.

Validation Loss: This metric indicates how well the model generalizes to unseen data. The consistent decrease in validation loss suggests improved generalization.

Word Error Rate (WER): This is the key metric for evaluating the model's accuracy in transcribing speech. A reduction in WER from 49.80% to 44.86% demonstrates significant improvements in the model's ability to accurately convert Arabic speech to text.

These results showcase the model's learning curve and highlight its increased proficiency with further training. This information can help users understand the model's training dynamics and its expected performance in practical applications.

Framework versions

Transformers 4.41.1
Pytorch 2.2.1+cu121
Datasets 2.19.1
Tokenizers 0.19.1

neethuvm
/

whisper-small-arnw