Whisper-medium-et

This is a Whisper-medium model openai/whisper-medium finetuned on around 800 hours of diverse Estonian data.

Model description

This is a general-purpose Estonian ASR model trained in the Lab of Language Technology at TalTech.

Intended uses & limitations

This model is intended for general-purpose speech recognition, such as broadcast conversations, interviews, talks, etc.

How to use

Use as any other Whisper model via HF transformers, or use a faster decoder like faster-whisper.

Limitations and bias

Since this model was trained on mostly broadcast speech and texts from the web, it might have problems correctly decoding the following:

  • Speech containing technical and other domain-specific terms
  • Children's speech
  • Non-native speech
  • Speech recorded under very noisy conditions or with a microphone far from the speaker
  • Very spontaneous and overlapping speech

Training data

Acoustic training data:

Type Amount (h)
Broadcast speech 591
Spontaneous speech 53
Elderly speech corpus 53
Talks, lectures 49
Parliament speeches 31
Total 761

Training procedure

Finetuned using Espnet, and then comverted to transformers format using this script. Finetuning procedure is similar to this model.

Evaluation results

WER

WER results below are obtained using greedy decoding (i.e., beam size 1).

Dataset WER
Common Voice 8.0 13.8
Common Voice 11.0 14.7
Downloads last month
26
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Evaluation results