WhisperD

WhisperD is a fine-tuned version of whisper-large-v2 that is able to transcribe multi-speaker, conversational speech. It was used to generate synthetic transcriptions for training Parakeet. Diarization is performed implicitly by the model, where "[S1]", "[S2]", etc. denote speaker identity. WhisperD is (often) able to transcribe non-speech events, e.g. "(coughs)", "(laughs)". Outputs include disfluencies.

Example Output:

[S1] What's sort of cool is that, uh, you can produce coughs if you have to. [S2] What do you mean? [S1] Well, (coughs) there, I just coughed.

More details can be found in the WhisperD blog post.

Caution:

This model has only been tested on segments up to 30 seconds in length. It may be unable to handle conditioning on previous text, as this was not included during fine-tuning. Thus, if a pipeline / codebase uses this feature in order to transcribe audio with duration over 30 seconds, generation quality may be poor.

Usage:

import torch
import torchaudio
from transformers import WhisperForConditionalGeneration, WhisperProcessor, WhisperTokenizer

model = WhisperForConditionalGeneration.from_pretrained('jordand/whisper-d-v1a', torch_dtype=torch.float16).cuda()
processor = WhisperProcessor.from_pretrained('openai/whisper-large-v2')
tokenizer = WhisperTokenizer.from_pretrained('openai/whisper-large-v2')

model.generation_config.suppress_tokens = None
model.generation_config.forced_decoder_ids = None

audio, sr = torchaudio.load('PATH_TO_AUDIO_FILE')
audio = audio.mean(dim=0, keepdim=True)
audio = torchaudio.transforms.Resample(sr, 16000)(audio)
audio = audio[0, :16000*30] # whisper-d-v1 only can handle up to 30 seconds of audio

inputs = processor(audio, return_tensors="pt")
model_out = model.generate(inputs['input_features'].cuda().half())
text = tokenizer.decode(model_out[0], skip_special_tokens=True)
print(text)

Citation:

For now, please cite:

@misc{darefsky2024parakeet,
    author = {Darefsky, Jordan and Zhu, Ge and Duan, Zhiyao},
    title = {Parakeet},
    year = {2024},
    url = {https://jordandarefsky.com/blog/2024/parakeet/}
}