Vietnamese Automatic Lyrics Transcription

This project aims to perform automatic lyrics transcription on Vietnamese songs, the pre-trained model used for this task is Whisper from Robust Speech Recognition via Large-Scale Weak Supervision.

Fine-Tuning

The model is fine-tuned on 8,000 Vietnamese songs scraped from zingmp3.vn (Vietnamese version of Spotify). The average song duration is 4.7 minutes, with a word per minute of 90.7.

7,000 Songs are used as training and 1,000 songs are used as validation. The reported metrics below are for the 1,000 validation songs.

Evaluation

Model WER (Lowercase) WER (Case-Sensitive) CER (Lowercase) CER (Case-Sensitive)
whisper-medium 23.15 26.42 17.01 17.03
whisper-large-v2 20.52 24.61 16.09 17.14

Lyrics Transcription

To generate the transcription for a song, we can use the Transformers pipeline. Chunking is enabled by setting chunk_length_s=30 when instantiating the pipeline. With chunking enabled, the pipeline can be run with batched inference. It can also be extended to predict sequence level timestamps by passing return_timestamps=True: In the following example we are passing return_timestamps="word" that provides precise timestamps for when each individual word in the audio starts and ends.

>>> from transformers import pipeline
>>> asr_pipeline = pipeline(
>>>    "automatic-speech-recognition",
>>>    model="xyzDivergence/whisper-medium-vietnamese-lyrics-transcription", chunk_length_s=30, device='cuda',
>>>    tokenizer="xyzDivergence/whisper-medium-vietnamese-lyrics-transcription"
>>> )
>>> transcription = asr_pipeline("sample_audio.mp3", return_timestamps="word")

Training Data

The training dataset consists of 7,000 Vietnamese songs, in total of roughly 550 hours of audio, across various Vietnamese music genres, dialects and accents. Due to copyright concerns, the raw data is not publicly available. However, the CSV files, which contain links to the songs and lyrics, can be used for downloading and are available in our repository. Each song includes lyrics along with corresponding line-level timestamps, enabling precise mapping of audio segments to their respective lyrics based on the provided timestamp information.

Technical report coming soon. This project was made through equal contributions from:

Downloads last month
51
Safetensors
Model size
764M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for xyzDivergence/whisper-medium-vietnamese-lyrics-transcription

Finetuned
(498)
this model