marma's picture
Update README.md
ce19cc0
|
raw
history blame
1.06 kB
metadata
language: sv
tags:
  - speech
  - audio
  - automatic-speech-recognition

Wav2Vec 2.0 XLSR Swedish

Swedish version of Wav2Vec2.0 XLSR finetuned on NST Swedish Dictation and evaluated using Common Voice

WER: 23.3%

Does not work in the browser for some reason, but can be used as follows (code somewhat copied from Huggingface):

#!/usr/bin/env python3
  
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import soundfile as sf
from sys import argv,exit
import torch
import transformers
from os.path import basename

if __name__ == '__main__':
    if len(argv) < 3:
        print(f'usage: {argv[0]} <model> <file 1>')
        exit(1)

    processor = Wav2Vec2Processor.from_pretrained(argv[1])
    model = Wav2Vec2ForCTC.from_pretrained(argv[1])

    f = argv[2]
    s,sample_rate = sf.read(f)
    input_values = processor(s, return_tensors="pt").input_values
    logits = model(input_values).logits
    predicted_ids = torch.argmax(logits, dim=-1)

    transcription = processor.decode(predicted_ids[0])

    print(transcription)