Curb Your Expectations: Data & Architecture Limitations
Two things you should be aware of
- Kokoro's data mix is mostly synthetic, neutral speech.
- Kokoro has 82M params and relies mostly on explicit G2P preprocessing rather than predicting latent audio tokens (more on this later).
What does this mean?
- The model likely cannot laugh at inference, because it has seen about 0 laughs in its training dataset.
- The model cannot sound extremely angry, because there are few to none clips of true anger in its training dataset.
- Ditto for sad.
- Most questions along the lines of "why can't the model sound like this?" Probably can be answered with: it wasn't in the training dataset. Keep in mind the training set for the first model (v0.19) is <100 hours; this will grow to a few hundred hours for the next one, but all of the above still applies.
Could the model do those things if it was trained on more emotional data?
Maybe, but if the audio-text data loses alignment (e.g. there are laughs in audio but unlabeled in text, or vice versa) then you will start to see artifacts and hallucinations. It is relatively easy to acquire in-the-wild data with a wide emotional range. It is more difficult to do so in a licensed way, and it is separately difficult to label these things with perfect alignment.
There are architectural limitations
Kokoro relies on G2P preprocessing, i.e. another model or ruleset converts raw text into tokens first: for v0.19 the vocabulary is the same as the 178 tokens declared here: https://github.com/yl4579/StyleTTS2/blob/main/text_utils.py#L3-L6
Because of the G2P, Kokoro is very fast, but the types of speech it can deliver is also restricted to that relatively small vocabulary.
Piper https://github.com/rhasspy/piper is an even smaller model (various sizes from 5-32M params) that I think also consumes espeak-ng
phonemes. It uses a VITS architecture and probably won't sound as natural, but it is definitely going to be faster than 82M.
Latent audio tokens
Larger TTS models often take raw text as input and predict latent audio tokens over a codebook, then decode the latent audio tokens into audio. Each "latent audio token" represents some sound. Here are some models that do this:
- OuteTTS, 500M, uses Qwen-0.5B to predict audio tokens: https://hf.co/OuteAI/OuteTTS-0.3-500M
- Fish Speech, 500M, I think uses Llama to do the same: https://speech.fish.audio/finetune/
- Tortoise TTS, >779M params, a fairly old model which uses GPT: https://nonint.com/2022/04/25/tortoise-architectural-design-doc/
- Parakeet, 3B, described here: https://jordandarefsky.com/blog/2024/parakeet/
- Llasa-3B "is a text-to-speech (TTS) system that extends the text-based LLaMA (1B,3B, and 8B) language model by incorporating speech tokens from the XCodec2 codebook, which contains 65,536 tokens." https://hf.co/HKUST-Audio/Llasa-3B
- Multimodal models like GPT-4o and https://hf.co/openbmb/MiniCPM-o-2_6 are also predicting latent audio tokens too, if I understand correctly.
These models are usually bigger, trained on more audio data, and can perform some of the emotions you might be looking for. But, since predicting the next latent audio token is done via transformer/softmax, hallucination is a risk. (Whereas lookup or rules-based G2P fails in deterministic, maybe more interpretable ways.)
I'll wrap up with a gun analogy
- Pistol: Fast and light, but cannot fire over long distances.
- Sniper rifle: Can deliver a bullet e.g. 500 meters, but is heavier to carry and thus offers less mobility.
Different tools for different situations. You could be in a situation where the sniper rifle is always better, or vice versa. Or it just depends what your need is in the moment.
Maybe use ak47 fast, light and can carry along 😝.
Jokes apart, if we had properly annotated data that had laugh tokens in the text then would it be possible for us to fine-tune the model on that data? (If the encoder is also opensourced)
would it be possible to add PAUSE tags? SSML or other format. [PAUSE=3]
The model likely cannot laugh at inference because it has seen about 0 laughs in its training dataset.
But if you use pseudo sounds, like “ha ha” for a laugh, and some others like ugh, hmm, oh, then it would sound as it should be, isn’t it?
The model likely cannot laugh at inference because it has seen about 0 laughs in its training dataset.
But if you use pseudo sounds, like “ha ha” for a laugh, and some others like ugh, hmm, oh, then it would sound as it should be, isn’t it?
Na won't work. I tried and instead of fluently laughing for haha, it tried to pronounce haha
Kokoro has 82M params and relies mostly on explicit G2P preprocessing rather than predicting latent audio tokens
Also, what does this says? Is Kokoro cannot generate different audio every time for same text like a vocaloid because it doesn’t have an abstract tokens like others, or what?
Let me clarify the differences side-by-side.
Kokoro:
- Has 82 million parameters
- Has a vocabulary with 178 tokens (of which less than half are actually being used): https://github.com/yl4579/StyleTTS2/blob/main/text_utils.py#L3-L6
- Relies on a G2P engine like
espeak-ng
(few MB in size of rules/dictionaries) to convert raw text to those <178 tokens before ever reaching Kokoro's 82 million parameters. Hence, preprocessing.
Llasa 3B (excellent model btw, you should check it out): https://huggingface.co./HKUSTAudio/Llasa-3B
- First uses Llama 3B (could also be 1B or 8B) to predict "speech tokens from the XCodec2 codebook, which contains 65,536 tokens" at "50 Tokens per Second": https://huggingface.co./HKUSTAudio/xcodec2
- Then, XCodec2, which has a 3.2 GB torch checkpoint—assuming no gradients that's 800M params if FP32, or 1.6B if FP16—decodes these tokens into audio
Kokoro does incorporate PL-BERT https://github.com/yl4579/PL-BERT but this is relatively tiny, if I recall just 6.3M parameters. This a few OOMs down from Llama 3B, or 1B/8B, and instead of operating on raw text tokens, PL-BERT operates on phonemes that were pre-converted by a separate G2P engine.
The differences in vocabulary size and token resolution are also notable. In Kokoro, 178 token vocabulary—really just a few dozen actually being used—and each of these tokens might end up being decoded to audio on the order of 0.1 to 1 second (ballpark OOM). In Llasa/XCodec2, the vocab is size 65k, and there are 50 speech tokens per second, or 1 speech token every 0.02 seconds.
Again, pistol vs sniper rifle. Kokoro and Llasa are both Text to Speech models, but not only is there a size difference; the architectures are also quite different.
I think the models OuteTTS and Fish Speech lower the size of the LM from 3B down to 500M—still using the "predicting audio tokens" approach. IMO, Fish is likely better than Oute in this 500M size class due to dataset size/quality and additional rounds of Reinforcement Learning.
Like with LLMs and other types of models, the bigger model could easily be the better option if it fits within your compute and latency budgets.
Also, what does this says? Is Kokoro cannot generate different audio every time for same text like a vocaloid because it doesn’t have an abstract tokens like others, or what?
In theory, StyleTTS2 models have a diffusion component, which Kokoro leaves out. Even without diffusion, you can generate slightly different audio by varying the style vector, or by varying the text with different punctuation / stress. But since Kokoro is trained mostly on synthetic & neutral speech, it is almost entirely going to be throwing 90 mph fastballs down the middle, not curveballs.
Then we can use a programatical diffuser that varies the style(voice pack vector) on each generation, hence having different sort of style every generation?
So like the nicole voice we can have an angry voice or a sad by just creating a style pack?
Does this mean that the encoder used to train kokoro is actually a diffusion model?
So like the nicole voice we can have an angry voice or a sad by just creating a style pack?
If by "just creating a style pack" for Nicole, you mean ~10 hours worth of nearly perfectly labeled audio-text pairs for that specific voice trained into the model for ~20 epochs, then yes. If there is somehow this data is lying around with similar label quality and volume, and it is trained into the model for a similar number of epochs, then you can probably expect similar results at inference.
Are you also auto-labeling with a whisper model? Or is there a better ASR English model?
He's geeting the text and the audio - the "prompt" and the "output" of - for example - elevenlabs. => no need for ASR.