Wav2Vec2-Large-Tedlium

The Wav2Vec2 large model fine-tuned on the TEDLIUM corpus.

The model is initialised with Facebook's Wav2Vec2 large LV-60k checkpoint pre-trained on 60,000h of audiobooks from the LibriVox project. It is fine-tuned on 452h of TED talks from the TEDLIUM corpus (Release 3). When using the model, make sure that your speech input is sampled at 16Khz.

The model achieves a word error rate (WER) of 8.4% on the dev set and 8.2% on the test set. Training logs document the training and evaluation progress over 50k steps of fine-tuning.

See this notebook for more information on how this model was fine-tuned.

Usage

To transcribe audio files the model can be used as a standalone acoustic model as follows:

 from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
 from datasets import load_dataset
 import torch
 
 # load model and processor
 processor = Wav2Vec2Processor.from_pretrained("sanchit-gandhi/wav2vec2-large-tedlium")
 model = Wav2Vec2ForCTC.from_pretrained("sanchit-gandhi/wav2vec2-large-tedlium")
     
 # load dummy dataset
 ds = load_dataset("sanchit-gandhi/tedlium_dummy", split="validation")
 
 # process audio inputs
 input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values  # Batch size 1
 
 # retrieve logits
 logits = model(input_values).logits
 
 # take argmax and decode
 predicted_ids = torch.argmax(logits, dim=-1)
 transcription = processor.batch_decode(predicted_ids)
 print("Target: ", ds["text"][0])
 print("Transcription: ", transcription[0])

Evaluation

This code snippet shows how to evaluate Wav2Vec2-Large-Tedlium on the TEDLIUM test data.

from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
from jiwer import wer

tedlium_eval = load_dataset("LIUM/tedlium", "release3", split="test")
model = Wav2Vec2ForCTC.from_pretrained("sanchit-gandhi/wav2vec2-large-tedlium").to("cuda")
processor = Wav2Vec2Processor.from_pretrained("sanchit-gandhi/wav2vec2-large-tedlium")
def map_to_pred(batch):
    input_values = processor(batch["audio"]["array"], return_tensors="pt", padding="longest").input_values
    with torch.no_grad():
        logits = model(input_values.to("cuda")).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
    batch["transcription"] = transcription
    return batch
result = tedlium_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["speech"])
print("WER:", wer(result["text"], result["transcription"]))
Downloads last month
16
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train sanchit-gandhi/wav2vec2-large-tedlium

Space using sanchit-gandhi/wav2vec2-large-tedlium 1