What is Retrieval-based Voice Conversion WebUI?
Retrieval-based Voice Conversion WebUI is an open-source framework designed to make voice conversion simple and efficient. Built on the VITS model, it provides an easy-to-use interface for both inference and training, making it accessible even to those with limited experience in machine learning or audio processing. The WebUI supports a range of features, including voice conversion, real-time voice changing, and the ability to train models using small datasets.
UI preview
Training and inference Webui | Real-time voice changing GUI |
go-web.bat - infer-web.py | go-realtime-gui.bat |
Features:
- Reduce tone leakage by replacing the source feature to training-set feature using top1 retrieval;
- Easy + fast training, even on poor graphics cards;
- Training with a small amounts of data (>=10min low noise speech recommended);
- Model fusion to change timbres (using ckpt processing tab->ckpt merge);
- Easy-to-use WebUI;
- UVR5 model to quickly separate vocals and instruments;
- High-pitch Voice Extraction Algorithm InterSpeech2023-RMVPE to prevent a muted sound problem. Provides the best results (significantly) and is faster with lower resource consumption than Crepe_full;
- AMD/Intel graphics cards acceleration supported;
- Intel ARC graphics cards acceleration with IPEX supported.
Getting Started with Inference and Training
1. Set Up the Environment
To start using the Retrieval-based Voice Conversion WebUI, you’ll first need to prepare your environment. The framework requires Python 3.8 or higher.
Install Core Dependencies:
For NVIDIA GPUs:
pip install torch torchvision torchaudio
For AMD GPUs on Linux:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2
Install Other Dependencies:
Install using poetry
:
curl -sSL https://install.python-poetry.org | python3 -
poetry install
Or using pip
:
pip install -r requirements.txt
2. Download Pre-trained Models
The WebUI requires several pre-trained models to function properly. You can download these automatically using a provided script:
python tools/download_models.py
Alternatively, download the models manually from Hugging Face.
3. Install FFmpeg
FFmpeg is necessary for handling audio files. Installation steps vary depending on your operating system:
- Ubuntu/Debian:
sudo apt install ffmpeg
- macOS:
brew install ffmpeg
- Windows:
Download
ffmpeg.exe
andffprobe.exe
from Hugging Face and place them in the root folder.
4. Start the WebUI
Once your environment is set up and the necessary models are downloaded, you can start the WebUI.
For general usage:
python infer-web.py
For Windows users, you can also start the WebUI by double-clicking go-web.bat
.
Training a New Model
Training your own voice conversion model with Retrieval-based Voice Conversion WebUI is straightforward and can be done with as little as 10 minutes of low-noise speech data.
- Prepare Your Dataset: Collect and preprocess your audio data.
- Start the Training Interface: Launch the WebUI as described above and navigate to the training section.
- Set Training Parameters: Configure the model parameters and training options based on your dataset.
- Begin Training: Start the training process. The WebUI will guide you through each step, providing feedback on the model's progress.
Links and Resources
- Colab Notebook: Run the WebUI in a Colab environment.
Or colab mod here - GitHub Repository: Access the source code and documentation.
- Hugging Face Models: Download pre-trained models and other necessary files.
Instructions and tips for RVC training
====================================== This TIPS explains how data training is done.
Training flow
I will explain along the steps in the training tab of the GUI.
step1
Set the experiment name here.
You can also set here whether the model should take pitch into account. If the model doesn't consider pitch, the model will be lighter, but not suitable for singing.
Data for each experiment is placed in /logs/your-experiment-name/
.
step2a
Loads and preprocesses audio.
load audio
If you specify a folder with audio, the audio files in that folder will be read automatically.
For example, if you specify C:Users\hoge\voices
, C:Users\hoge\voices\voice.mp3
will be loaded, but C:Users\hoge\voices\dir\voice.mp3
will Not loaded.
Since ffmpeg is used internally for reading audio, if the extension is supported by ffmpeg, it will be read automatically. After converting to int16 with ffmpeg, convert to float32 and normalize between -1 to 1.
denoising
The audio is smoothed by scipy's filtfilt.
Audio Split
First, the input audio is divided by detecting parts of silence that last longer than a certain period (max_sil_kept=5 seconds?). After splitting the audio on silence, split the audio every 4 seconds with an overlap of 0.3 seconds. For audio separated within 4 seconds, after normalizing the volume, convert the wav file to /logs/your-experiment-name/0_gt_wavs
and then convert it to 16k sampling rate to /logs/your-experiment-name/1_16k_wavs
as a wav file.
step2b
Extract pitch
Extract pitch information from wav files. Extract the pitch information (=f0) using the method built into parselmouth or pyworld and save it in /logs/your-experiment-name/2a_f0
. Then logarithmically convert the pitch information to an integer between 1 and 255 and save it in /logs/your-experiment-name/2b-f0nsf
.
Extract feature_print
Convert the wav file to embedding in advance using HuBERT. Read the wav file saved in /logs/your-experiment-name/1_16k_wavs
, convert the wav file to 256-dimensional features with HuBERT, and save in npy format in /logs/your-experiment-name/3_feature256
.
step3
train the model.
Glossary for Beginners
In deep learning, the data set is divided and the learning proceeds little by little. In one model update (step), batch_size data are retrieved and predictions and error corrections are performed. Doing this once for a dataset counts as one epoch.
Therefore, the learning time is the learning time per step x (the number of data in the dataset / batch size) x the number of epochs. In general, the larger the batch size, the more stable the learning becomes (learning time per step ÷ batch size) becomes smaller, but it uses more GPU memory. GPU RAM can be checked with the nvidia-smi command. Learning can be done in a short time by increasing the batch size as much as possible according to the machine of the execution environment.
Specify pretrained model
RVC starts training the model from pretrained weights instead of from 0, so it can be trained with a small dataset.
By default
- If you consider pitch, it loads
rvc-location/pretrained/f0G40k.pth
andrvc-location/pretrained/f0D40k.pth
. - If you don't consider pitch, it loads
rvc-location/pretrained/f0G40k.pth
andrvc-location/pretrained/f0D40k.pth
.
When learning, model parameters are saved in logs/your-experiment-name/G_{}.pth
and logs/your-experiment-name/D_{}.pth
for each save_every_epoch, but by specifying this path, you can start learning. You can restart or start training from model weights learned in a different experiment.
learning index
RVC saves the HuBERT feature values used during training, and during inference, searches for feature values that are similar to the feature values used during learning to perform inference. In order to perform this search at high speed, the index is learned in advance. For index learning, we use the approximate Instructions and tips for RVC training
This TIPS explains how data training is done.
Training flow
I will explain along the steps in the training tab of the GUI.
step1
Set the experiment name here.
You can also set here whether the model should take pitch into account. If the model doesn't consider pitch, the model will be lighter, but not suitable for singing.
Data for each experiment is placed in /logs/your-experiment-name/
.
step2a
Loads and preprocesses audio.
load audio
If you specify a folder with audio, the audio files in that folder will be read automatically.
For example, if you specify C:Users\hoge\voices
, C:Users\hoge\voices\voice.mp3
will be loaded, but C:Users\hoge\voices\dir\voice.mp3
will Not loaded.
Since ffmpeg is used internally for reading audio, if the extension is supported by ffmpeg, it will be read automatically. After converting to int16 with ffmpeg, convert to float32 and normalize between -1 to 1.
denoising
The audio is smoothed by scipy's filtfilt.
Audio Split
First, the input audio is divided by detecting parts of silence that last longer than a certain period (max_sil_kept=5 seconds?). After splitting the audio on silence, split the audio every 4 seconds with an overlap of 0.3 seconds. For audio separated within 4 seconds, after normalizing the volume, convert the wav file to /logs/your-experiment-name/0_gt_wavs
and then convert it to 16k sampling rate to /logs/your-experiment-name/1_16k_wavs
as a wav file.
step2b
Extract pitch
Extract pitch information from wav files. Extract the pitch information (=f0) using the method built into parselmouth or pyworld and save it in /logs/your-experiment-name/2a_f0
. Then logarithmically convert the pitch information to an integer between 1 and 255 and save it in /logs/your-experiment-name/2b-f0nsf
.
Extract feature_print
Convert the wav file to embedding in advance using HuBERT. Read the wav file saved in /logs/your-experiment-name/1_16k_wavs
, convert the wav file to 256-dimensional features with HuBERT, and save in npy format in /logs/your-experiment-name/3_feature256
.
step3
train the model.
Glossary for Beginners
In deep learning, the data set is divided and the learning proceeds little by little. In one model update (step), batch_size data are retrieved and predictions and error corrections are performed. Doing this once for a dataset counts as one epoch.
Therefore, the learning time is the learning time per step x (the number of data in the dataset / batch size) x the number of epochs. In general, the larger the batch size, the more stable the learning becomes (learning time per step ÷ batch size) becomes smaller, but it uses more GPU memory. GPU RAM can be checked with the nvidia-smi command. Learning can be done in a short time by increasing the batch size as much as possible according to the machine of the execution environment.
Specify pretrained model
RVC starts training the model from pretrained weights instead of from 0, so it can be trained with a small dataset.
By default
- If you consider pitch, it loads
rvc-location/pretrained/f0G40k.pth
andrvc-location/pretrained/f0D40k.pth
. - If you don't consider pitch, it loads
rvc-location/pretrained/f0G40k.pth
andrvc-location/pretrained/f0D40k.pth
.
When learning, model parameters are saved in logs/your-experiment-name/G_{}.pth
and logs/your-experiment-name/D_{}.pth
for each save_every_epoch, but by specifying this path, you can start learning. You can restart or start training from model weights learned in a different experiment.
learning index
RVC saves the HuBERT feature values used during training, and during inference, searches for feature values that are similar to the feature values used during learning to perform inference. In order to perform this search at high speed, the index is learned in advance.
For index learning, we use the approximate neighborhood search library faiss. Read the feature value of logs/your-experiment-name/3_feature256
and use it to learn the index, and save it as logs/your-experiment-name/add_XXX.index
.
(From the 20230428update version, it is read from the index, and saving / specifying is no longer necessary.)
Button description
- Train model: After executing step2b, press this button to train the model.
- Train feature index: After training the model, perform index learning.
- One-click training: step2b, model training and feature index training all at once. search library faiss. Read the feature value of
logs/your-experiment-name/3_feature256
and use it to learn the index, and save it aslogs/your-experiment-name/add_XXX.index
.
(From the 20230428update version, it is read from the index, and saving / specifying is no longer necessary.)
Button description
- Train model: After executing step2b, press this button to train the model.
- Train feature index: After training the model, perform index learning.
- One-click training: step2b, model training and feature index training all at once.
Conclusion
Retrieval-based Voice Conversion (RVC) is an open source voice conversion AI algorithm that enables realistic speech-to-speech transformations, accurately preserving the intonation and audio characteristics of the original speaker. For more detailed instructions and updates, visit the official GitHub repository.