metadata
language: sv
tags:
- speech
- audio
- automatic-speech-recognition
Wav2Vec 2.0 XLSR Swedish
Swedish version of Wav2Vec2.0 XLSR finetuned on NST Swedish Dictation and evaluated using Common Voice
WER: 23.3%
Does not work in the browser for some reason, but can be used as follows (code somewhat copied from Huggingface):
#!/usr/bin/env python3
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import soundfile as sf
from sys import argv,exit
import torch
import transformers
from os.path import basename
if __name__ == '__main__':
if len(argv) < 3:
print(f'usage: {argv[0]} <model> <file 1>')
exit(1)
processor = Wav2Vec2Processor.from_pretrained(argv[1])
model = Wav2Vec2ForCTC.from_pretrained(argv[1])
f = argv[2]
s,sample_rate = sf.read(f)
input_values = processor(s, return_tensors="pt").input_values
logits = model(input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.decode(predicted_ids[0])
print(transcription)