Whisper-medium-et

This is a Whisper-medium model openai/whisper-medium finetuned on around 800 hours of diverse Estonian data.

Model description

This is a general-purpose Estonian ASR model trained in the Lab of Language Technology at TalTech.

This model is intended for general-purpose speech recognition, such as broadcast conversations, interviews, talks, etc.

Use as any other Whisper model via HF transformers, or use a faster decoder like faster-whisper.

Since this model was trained on mostly broadcast speech and texts from the web, it might have problems correctly decoding the following:

Speech containing technical and other domain-specific terms
Children's speech
Non-native speech
Speech recorded under very noisy conditions or with a microphone far from the speaker
Very spontaneous and overlapping speech

Acoustic training data:

Finetuned using Espnet, and then comverted to transformers format using this script. Finetuning procedure is similar to this model.

WER results below are obtained using greedy decoding (i.e., beam size 1).

Dataset	WER
Common Voice 8.0	13.8
Common Voice 11.0	14.7