Speaker inconsistency limits usefullness

#17
by liambarryarm - opened

Even when setting seed and generating speech with near identical prompts there is a noticeable difference between runs when using the same description and preset speaker voice e.g. Brenda.

This limits the usefulness of the model - are there planned improvements or tips for ensuring consistency?

I write the voice name multiple times in the prompt. I take it is a tag that Parler uses. Still does not mean that it will consistently maintain that voice.
https://huggingface.co./spaces/Pendrokar/TTS-Spaces-Arena/discussions/8

Parler TTS org

Hey @liambarryarm ,
For the Parler-TTS versions highlighted in this demo, there are some speakers that are more consistent than others, you can find lists here. Brenda doesn't seem to rank that high (not present in the top 20 of the Mini version). Hope that helps!

Large Model - Top 20 Speakers

Speaker Similarity Score
Will 0.906055
Eric 0.887598
Laura 0.877930
Alisa 0.877393
Patrick 0.873682
Rose 0.873047
Jerry 0.871582
Jordan 0.870703
Lauren 0.867432
Jenna 0.866455
Karen 0.866309
Rick 0.863135
Bill 0.862207
James 0.856934
Yann 0.856787
Emily 0.856543
Anna 0.848877
Jon 0.848828
Brenda 0.848291
Barbara 0.847998

Mini Model - Top 20 Speakers

Speaker Similarity Score
Jon 0.908301
Lea 0.904785
Gary 0.903516
Jenna 0.901807
Mike 0.885742
Laura 0.882666
Lauren 0.878320
Eileen 0.875635
Alisa 0.874219
Karen 0.872363
Barbara 0.871509
Carol 0.863623
Emily 0.854932
Rose 0.852246
Will 0.851074
Patrick 0.850977
Eric 0.845459
Rick 0.845020
Anna 0.844922
Tina 0.839160
ylacombe changed discussion status to closed
ylacombe changed discussion status to open
Parler TTS org

@Pendrokar , speaker consistency doesn't work with speakers that are not present in the training dataset.Elisabeth is not. I'd recommend using another speaker for voice consistency!
And in that case, no need to repeat the name in the prompt.

For example, you could do: Jenna speaks in a monotone tone at a slightly slower than normal pace, with the recording coming across as very clear and very close-sounding.

Sign up or log in to comment