What is Retrieval-based Voice Conversion WebUI?

Community Article Published August 18, 2024

Retrieval-based Voice Conversion WebUI is an open-source framework designed to make voice conversion simple and efficient. Built on the VITS model, it provides an easy-to-use interface for both inference and training, making it accessible even to those with limited experience in machine learning or audio processing. The WebUI supports a range of features, including voice conversion, real-time voice changing, and the ability to train models using small datasets.

UI preview

Training and inference Webui Real-time voice changing GUI
go-web.bat - infer-web.py go-realtime-gui.bat

Features:

  • Reduce tone leakage by replacing the source feature to training-set feature using top1 retrieval;
  • Easy + fast training, even on poor graphics cards;
  • Training with a small amounts of data (>=10min low noise speech recommended);
  • Model fusion to change timbres (using ckpt processing tab->ckpt merge);
  • Easy-to-use WebUI;
  • UVR5 model to quickly separate vocals and instruments;
  • High-pitch Voice Extraction Algorithm InterSpeech2023-RMVPE to prevent a muted sound problem. Provides the best results (significantly) and is faster with lower resource consumption than Crepe_full;
  • AMD/Intel graphics cards acceleration supported;
  • Intel ARC graphics cards acceleration with IPEX supported.

Getting Started with Inference and Training

1. Set Up the Environment

To start using the Retrieval-based Voice Conversion WebUI, you’ll first need to prepare your environment. The framework requires Python 3.8 or higher.

Install Core Dependencies:

For NVIDIA GPUs:

pip install torch torchvision torchaudio

For AMD GPUs on Linux:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2

Install Other Dependencies:

Install using poetry:

curl -sSL https://install.python-poetry.org | python3 -
poetry install

Or using pip:

pip install -r requirements.txt

2. Download Pre-trained Models

The WebUI requires several pre-trained models to function properly. You can download these automatically using a provided script:

python tools/download_models.py

Alternatively, download the models manually from Hugging Face.

3. Install FFmpeg

FFmpeg is necessary for handling audio files. Installation steps vary depending on your operating system:

  • Ubuntu/Debian:
    sudo apt install ffmpeg
    
  • macOS:
    brew install ffmpeg
    
  • Windows: Download ffmpeg.exe and ffprobe.exe from Hugging Face and place them in the root folder.

4. Start the WebUI

Once your environment is set up and the necessary models are downloaded, you can start the WebUI.

For general usage:

python infer-web.py

For Windows users, you can also start the WebUI by double-clicking go-web.bat.

Training a New Model

Training your own voice conversion model with Retrieval-based Voice Conversion WebUI is straightforward and can be done with as little as 10 minutes of low-noise speech data.

  1. Prepare Your Dataset: Collect and preprocess your audio data.
  2. Start the Training Interface: Launch the WebUI as described above and navigate to the training section.
  3. Set Training Parameters: Configure the model parameters and training options based on your dataset.
  4. Begin Training: Start the training process. The WebUI will guide you through each step, providing feedback on the model's progress.

Links and Resources

Instructions and tips for RVC training

====================================== This TIPS explains how data training is done.

Training flow

I will explain along the steps in the training tab of the GUI.

step1

Set the experiment name here.

You can also set here whether the model should take pitch into account. If the model doesn't consider pitch, the model will be lighter, but not suitable for singing.

Data for each experiment is placed in /logs/your-experiment-name/.

step2a

Loads and preprocesses audio.

load audio

If you specify a folder with audio, the audio files in that folder will be read automatically. For example, if you specify C:Users\hoge\voices, C:Users\hoge\voices\voice.mp3 will be loaded, but C:Users\hoge\voices\dir\voice.mp3 will Not loaded.

Since ffmpeg is used internally for reading audio, if the extension is supported by ffmpeg, it will be read automatically. After converting to int16 with ffmpeg, convert to float32 and normalize between -1 to 1.

denoising

The audio is smoothed by scipy's filtfilt.

Audio Split

First, the input audio is divided by detecting parts of silence that last longer than a certain period (max_sil_kept=5 seconds?). After splitting the audio on silence, split the audio every 4 seconds with an overlap of 0.3 seconds. For audio separated within 4 seconds, after normalizing the volume, convert the wav file to /logs/your-experiment-name/0_gt_wavs and then convert it to 16k sampling rate to /logs/your-experiment-name/1_16k_wavs as a wav file.

step2b

Extract pitch

Extract pitch information from wav files. Extract the pitch information (=f0) using the method built into parselmouth or pyworld and save it in /logs/your-experiment-name/2a_f0. Then logarithmically convert the pitch information to an integer between 1 and 255 and save it in /logs/your-experiment-name/2b-f0nsf.

Extract feature_print

Convert the wav file to embedding in advance using HuBERT. Read the wav file saved in /logs/your-experiment-name/1_16k_wavs, convert the wav file to 256-dimensional features with HuBERT, and save in npy format in /logs/your-experiment-name/3_feature256.

step3

train the model.

Glossary for Beginners

In deep learning, the data set is divided and the learning proceeds little by little. In one model update (step), batch_size data are retrieved and predictions and error corrections are performed. Doing this once for a dataset counts as one epoch.

Therefore, the learning time is the learning time per step x (the number of data in the dataset / batch size) x the number of epochs. In general, the larger the batch size, the more stable the learning becomes (learning time per step ÷ batch size) becomes smaller, but it uses more GPU memory. GPU RAM can be checked with the nvidia-smi command. Learning can be done in a short time by increasing the batch size as much as possible according to the machine of the execution environment.

Specify pretrained model

RVC starts training the model from pretrained weights instead of from 0, so it can be trained with a small dataset.

By default

  • If you consider pitch, it loads rvc-location/pretrained/f0G40k.pth and rvc-location/pretrained/f0D40k.pth.
  • If you don't consider pitch, it loads rvc-location/pretrained/f0G40k.pth and rvc-location/pretrained/f0D40k.pth.

When learning, model parameters are saved in logs/your-experiment-name/G_{}.pth and logs/your-experiment-name/D_{}.pth for each save_every_epoch, but by specifying this path, you can start learning. You can restart or start training from model weights learned in a different experiment.

learning index

RVC saves the HuBERT feature values used during training, and during inference, searches for feature values that are similar to the feature values used during learning to perform inference. In order to perform this search at high speed, the index is learned in advance. For index learning, we use the approximate Instructions and tips for RVC training

This TIPS explains how data training is done.

Training flow

I will explain along the steps in the training tab of the GUI.

step1

Set the experiment name here.

You can also set here whether the model should take pitch into account. If the model doesn't consider pitch, the model will be lighter, but not suitable for singing.

Data for each experiment is placed in /logs/your-experiment-name/.

step2a

Loads and preprocesses audio.

load audio

If you specify a folder with audio, the audio files in that folder will be read automatically. For example, if you specify C:Users\hoge\voices, C:Users\hoge\voices\voice.mp3 will be loaded, but C:Users\hoge\voices\dir\voice.mp3 will Not loaded.

Since ffmpeg is used internally for reading audio, if the extension is supported by ffmpeg, it will be read automatically. After converting to int16 with ffmpeg, convert to float32 and normalize between -1 to 1.

denoising

The audio is smoothed by scipy's filtfilt.

Audio Split

First, the input audio is divided by detecting parts of silence that last longer than a certain period (max_sil_kept=5 seconds?). After splitting the audio on silence, split the audio every 4 seconds with an overlap of 0.3 seconds. For audio separated within 4 seconds, after normalizing the volume, convert the wav file to /logs/your-experiment-name/0_gt_wavs and then convert it to 16k sampling rate to /logs/your-experiment-name/1_16k_wavs as a wav file.

step2b

Extract pitch

Extract pitch information from wav files. Extract the pitch information (=f0) using the method built into parselmouth or pyworld and save it in /logs/your-experiment-name/2a_f0. Then logarithmically convert the pitch information to an integer between 1 and 255 and save it in /logs/your-experiment-name/2b-f0nsf.

Extract feature_print

Convert the wav file to embedding in advance using HuBERT. Read the wav file saved in /logs/your-experiment-name/1_16k_wavs, convert the wav file to 256-dimensional features with HuBERT, and save in npy format in /logs/your-experiment-name/3_feature256.

step3

train the model.

Glossary for Beginners

In deep learning, the data set is divided and the learning proceeds little by little. In one model update (step), batch_size data are retrieved and predictions and error corrections are performed. Doing this once for a dataset counts as one epoch.

Therefore, the learning time is the learning time per step x (the number of data in the dataset / batch size) x the number of epochs. In general, the larger the batch size, the more stable the learning becomes (learning time per step ÷ batch size) becomes smaller, but it uses more GPU memory. GPU RAM can be checked with the nvidia-smi command. Learning can be done in a short time by increasing the batch size as much as possible according to the machine of the execution environment.

Specify pretrained model

RVC starts training the model from pretrained weights instead of from 0, so it can be trained with a small dataset.

By default

  • If you consider pitch, it loads rvc-location/pretrained/f0G40k.pth and rvc-location/pretrained/f0D40k.pth.
  • If you don't consider pitch, it loads rvc-location/pretrained/f0G40k.pth and rvc-location/pretrained/f0D40k.pth.

When learning, model parameters are saved in logs/your-experiment-name/G_{}.pth and logs/your-experiment-name/D_{}.pth for each save_every_epoch, but by specifying this path, you can start learning. You can restart or start training from model weights learned in a different experiment.

learning index

RVC saves the HuBERT feature values used during training, and during inference, searches for feature values that are similar to the feature values used during learning to perform inference. In order to perform this search at high speed, the index is learned in advance. For index learning, we use the approximate neighborhood search library faiss. Read the feature value of logs/your-experiment-name/3_feature256 and use it to learn the index, and save it as logs/your-experiment-name/add_XXX.index.

(From the 20230428update version, it is read from the index, and saving / specifying is no longer necessary.)

Button description

  • Train model: After executing step2b, press this button to train the model.
  • Train feature index: After training the model, perform index learning.
  • One-click training: step2b, model training and feature index training all at once. search library faiss. Read the feature value of logs/your-experiment-name/3_feature256 and use it to learn the index, and save it as logs/your-experiment-name/add_XXX.index.

(From the 20230428update version, it is read from the index, and saving / specifying is no longer necessary.)

Button description

  • Train model: After executing step2b, press this button to train the model.
  • Train feature index: After training the model, perform index learning.
  • One-click training: step2b, model training and feature index training all at once.

Conclusion

Retrieval-based Voice Conversion (RVC) is an open source voice conversion AI algorithm that enables realistic speech-to-speech transformations, accurately preserving the intonation and audio characteristics of the original speaker. For more detailed instructions and updates, visit the official GitHub repository.