Question about manifest files of commonvoice

#12
by Jaelyn96 - opened

Hi, excellent work! I have a few questions about the manifest files (CV_*.tsv) of commonvoice:
Manifest File Differences: I noticed that different CV_*.tsv files contain data for different languages. Does the number in the filename (e.g., CV_12.tsv) simply represent the file number, or does it indicate a specific version of the CommonVoice dataset?

Language-Specific Usage: If I only need data for a few specific languages, can I download and use only the manifest files that contain those languages (e.g., files marked with the corresponding language codes)? Or is there a recommended way to filter the dataset for specific languages?

Thank you for your help!

I have an additional question regarding the manifest files and dataset preparation. For the file paths specified in the manifest, can they directly match the file paths of the dataset I downloaded from the official website linked in the paper, aside from converting .mp3 files to .wav format?

Specifically, I would like to know if any additional processing has been applied to the dataset, such as:

Concatenating shorter audio files into longer ones while retaining the original file names.

Performing other modifications to the audios.

Is it sufficient to download the corresponding manifest files and dataset, and simply match the paths to obtain a training dataset consistent with the one used in the paper? Or are there additional steps required to ensure compatibility?

Thank you for your clarification!

UTTER - Unified Transcription and Translation for Extended Reality org

Hi,

Thanks for the interest!
Answering the questions from the first post first:

Manifest File Differences: it's simply a number. The number of audio files was too large, so I split it in several files.
However, you will not see the same language appearing on several of them (that is, all english data is in one CV_X.tsv files, it is not spread across several).
The version of commonvoice used is indicated in the paper.

Language-Specific Usage: you can launch training with a subset of the files listed, or you can even add new ones. Just make sure you have a corresponding and aligned CV_*.km for every CV_*.tsv you keep, this is the only think that matters. However, if you are restarting training with a different collection of languages, you might want to change the language_upsampling_factor and dataset_upsampling_factor upsampling parameters (https://github.com/utter-project/fairseq/blob/main/examples/mHuBERT-147/config/pretrain/fp32_fix_3rd_iter.yaml).

UTTER - Unified Transcription and Translation for Extended Reality org

For the pre-processing, you can find more information on section 2.1 of the extended paper (https://arxiv.org/pdf/2406.06371).
CV is not one of the datasets where we performed concatenation, we only filtered it ([2,30]s range).
The pre-processing scripts are here: https://github.com/utter-project/mHuBERT-147-scripts/tree/main/01_dataset_preproc
Also be aware that several take down requests from commonvoice happened since we collected this data in 2013, so you might not find some of the wave files listed in our manifest.

All the best!

Thank you for your prompt response!
Regarding the pre-processing described in Section 2.1, I would like to confirm my understanding. If no additional pre-processings are mentioned in this section, does it mean that the dataset was only filtered to remove audio files outside the [2, 30] second duration range, and no other processing (e.g., concatenation, removing out noisy utterances) was applied?

In other words, can I assume that the dataset is simply a filtered version of the original data, with no further transformations if otherwise specified?

Thank you for clarifying!

UTTER - Unified Transcription and Translation for Extended Reality org

Hi again,

Yes, that's correct. For this specific dataset, only filtering was applied.

Thanks a lot! May I kindly ask the reason of filtering out audio files that fall outside [2,30]s range? If the dataset I want to use does not have much samples (e.g. 80, 000 samples), do I need to do the filtering? Thank you!

Sign up or log in to comment