Model Summary
AUTOVC is a many-to-many voice style transfer algorithm.
This model is used to extract speaker-agnostic content representation from an audio file. A good way to think about the term speaker-agnostic
is that, for example, no matter who speaks the word βHa!β, the lips are expected to be open. This means the opening motion of the mouth is only dictated by the content and not the speaker.
The AutoVC_Conversion model is designed to capture the neutral, general motion of just the lips and nearby regions. It leverages AutoVC from Qian et al. 2019. This specific pre-trained model is from the open source audio driven talking head project, MakeItTalk. You can demo the space and check out how its used in the code here.
The framework consists of three modules:
- a content encoder
Ec(Β·)
, that produces a content embedding from speech - aspeaker encoder
Es(Β·)
that produces a speaker embedding from speech - a decoder
D(Β·, Β·)
that produces speech from content and speaker embeddings.
The model network utilizes an LSTM-based encoder that compresses the input audio into a compact representation trained to abandon the original speaker identity but preserve content. It extracts a content embedding A β RπΓπ·
from the AutoVC network, where π
is the total number of input audio frames, and π·
is the content dimension.
Training
The speaker encoder is a pre-trained model proveded Wan et al. [2018]. Only the content encoder and the decoder are trained. A training source speech from a dataset of speakers is processed through the content encoder. Then another utterance of the same source speaker is used to extract the speaker embedding, which is passed to the decoder along with the audio content embedding to reconstruct the original source. The training deliberately assumes that parallel data is not available and so only self-reconstruction is needed for training.
Performance
The evaluation of AutoVC was performed on the VCTK corpus (Veauxet al., 2016), which contains 44 hours of utterances from 109 speakers. Each speaker reads a different set of sentences.
Two subjective tests on Amazon Mechanical Turk (MTurk) were performed. In the first test, called the mean opinionscore (MOS) test, the subjects are presented with converted utterances. For each utterance, the subjects are asked to assign a score of 1-5 on the naturalness of the converted speech. In the second test, called the similarity test, the subjects are presented with pairs of utterances. In each pair,there is one converted utterance, and one utterance from the target speaker uttering the same sentence. For each pair, the subjects were asked to assign a score of 1-5 on the voice similarity.The subjects were explicitly asked to focus on the voice rather than intonation and accent.
The MOS scores of AUTOVC are above 3 for all groups, whereas those for the baselines almost all fall below 3. The MOS for 16kHz natural speech is around 4.5. The MOS scores of the current state-of-the-art speech synthesizers are between 4 and 4.5. These subjective evaluation results show that AUTOVC approaches the performance of parallel conversion systems in terms of naturalness, and is much better than existing non-parallel conversion systems.