The SOTA Text-to-speech and Zero Shot Voice cloning model that no one knows about...

Community Article Published January 20, 2025

Quick Links:

Hello everyone, I've been having a lot of fun lately playing around with Llasa (https://huggingface.co./HKUST-Audio/Llasa-3B). An open source llama3 3B finetune that acts as a text to speech model. Not only does it do incredibly realistic text to speech, it can also clone any voice with only a couple seconds of sample audio.

Its so good that I had to sign up to huggingface pro, get zero gpu access and write a blog to show it off to the community. While the authors note that their paper is coming soon that didnt stop me from tinkering and figuring out how to use this model.

Voice Cloning

This is a Llama 3.2 3B finetune/continued pretrain to adapt the model to generate speech tokens without any change in model architecture. The only addition is the audio tokenizer xcodec2

Before I ramble about all the cool things I discovered it can do. I set up a space for people to try here and here are some sample in the wild voice clones i made (These are not real people, I used sample audio from elevenlabs voices)

Alex

Reference Let me know in the comment section below. This is the COD Archive, and I'll see you tomorrow. Take care. Clone Hey guys, what's up? Alex here, back at it again with another video. Today we will be learning how to clone voices with a state-of-the-art text-to-speech model. Exciting, right? Let's dive right in.

Amelia

Reference Hi! I'm Amelia, a super high quality English voice. I love to read. Seriously, I'm a total bookworm. So what are you waiting for? Get me reading! Clone All you need is a short clean audio sample of just 5 to 10 seconds. Then the model can generate a high quality speech sample mimicking the voice, tone and style of speech and even accent.

Russel

Reference it is not enough to have a good mind the main thing is to use it well Clone The model was trained on a 160,000 250,000 hours of audio tokenized by Xcodec2, which converts audio to tokens at a very efficient 50 tokens per second.

Varying style of speech

Whisper

The given sample audio is very important. It dictates how the rest of the audio that follows sounds like. So whispers in equals whispers out.

Emotions

Confusion I don't know what to say. It will be sunny? Or rainy? The weather is completely unpredictable. I'm just so confused.

Anger I don't know what to say. It will be sunny? Or rainy? The weather is completely unpredictable. I'm just so annoyed.

Laughing I don't know what to say. It will be sunny? Or rainy? The weather is completely unpredictable. It's actually quite funny.

Optimus prime? This is an example of where the model struggles a lot. It cant really quite capture how Peter Cullen is voicing optimus prime.

8B?

The authors have an 8B model space which is currently empty, it would be interesting to see how good that is given that the 3B is already so good for most voices. Also does lora finetuning work? Can we merge and mix voices? There is so much to tinker with and I can't wait for the official paper to come out.

Hope you enjoyed my first blog post/ramble.

P.S i love that its basically just a llama model in disguise.

As I mentioned earlier the only addition is the xcodec2 audio tokenizer model, everything else is just llama 3 inference with some correct prompt templating and tokenisation. See my ZERO Space's app.py file for inference code in hf transformers. But since its just a llama 3 model, theres nothing stopping us from using a more optimised inference library like vllm like this:

note i cloned the repos into my profile since they are gated so its easier to run as a demo...

from transformers import pipeline, AutoTokenizer
import torch
import soundfile as sf
from xcodec2.modeling_xcodec2 import XCodec2Model
from IPython import display
import torchaudio
from vllm import LLM, SamplingParams

llm = LLM(model="srinivasbilla/llasa-3b", gpu_memory_utilization=0.5, max_model_len=4096)
tokenizer = AutoTokenizer.from_pretrained('srinivasbilla/llasa-3b')

model_path = "srinivasbilla/xcodec2"
 
Codec_model = XCodec2Model.from_pretrained(model_path)
Codec_model.eval().cuda()

whisper_turbo_pipe = pipeline("automatic-speech-recognition", model="openai/whisper-large-v3-turbo", device='cuda')

sampling_params = SamplingParams(temperature=0.8, top_p=1, max_tokens=2048, stop=['<|SPEECH_GENERATION_END|>'], stop_token_ids=[128261])


def ids_to_speech_tokens(speech_ids):
 
    speech_tokens_str = []
    for speech_id in speech_ids:
        speech_tokens_str.append(f"<|s_{speech_id}|>")
    return speech_tokens_str

def extract_speech_ids(speech_tokens_str):
    return [int(x.replace('s_', '')) for x in speech_tokens_str[2:-2].split('|><|')]


def text_to_speech(sample_audio_path, target_text, sampling_params, prompt_text=None):
    waveform, sample_rate = torchaudio.load(sample_audio_path)

    # Check if the audio is stereo (i.e., has more than one channel)
    if waveform.size(0) > 1:
        # Convert stereo to mono by averaging the channels
        waveform_mono = torch.mean(waveform, dim=0, keepdim=True)
    else:
        # If already mono, just use the original waveform
        waveform_mono = waveform

    waveform_16k = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)(waveform_mono)
    torchaudio.save('/local_disk0/input.wav', waveform_16k, 16000)

    # only 16khz speech support!
    prompt_wav, sr = sf.read("/local_disk0/input.wav") # English prompt
    prompt_wav = torch.from_numpy(prompt_wav).float().unsqueeze(0)

    if prompt_text is None:
        prompt_text = whisper_turbo_pipe('/local_disk0/input.wav')['text'].strip()
        print(prompt_text)

    input_text = prompt_text + ' ' + target_text

    #TTS start!
    with torch.no_grad():
        # Encode the prompt wav
        vq_code_prompt = Codec_model.encode_code(input_waveform=prompt_wav)

        vq_code_prompt = vq_code_prompt[0,0,:]
        # Convert int 12345 to token <|s_12345|>
        speech_ids_prefix = ids_to_speech_tokens(vq_code_prompt)

        formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>"

        # Tokenize the text and the speech prefix
        chat = [
            {"role": "user", "content": "Convert the text to speech:" + formatted_text},
            {"role": "assistant", "content": "<|SPEECH_GENERATION_START|>" + ''.join(speech_ids_prefix)}
        ]

        input_ids = tokenizer.apply_chat_template(
            chat, 
            tokenize=False, 
            continue_final_message=True
        )

        outputs = llm.generate([input_ids], sampling_params)

        generated_text = outputs[0].outputs[0].text

        speech_tokens = extract_speech_ids(generated_text)
        speech_tokens = torch.tensor(speech_tokens).cuda().unsqueeze(0).unsqueeze(0)
        gen_wav = Codec_model.decode_code(speech_tokens)
    
    return gen_wav[0, 0, :].cpu().numpy()

and then inference:

target_text = """The model was trained on a 160 *thousand* hours of audio tokenized by X codec 2. Which converts audio to tokens at a very efficient 50 tokens per second."""
audio_out, text_out = text_to_speech(
    "./sample_voices/voice_preview_neal.mp3",
    target_text,
    sampling_params=sampling_params,
    prompt_text="it is not enough to have a good mind the main thing is to use it well",
)
display.Audio(audio_out, rate=16000)

Community

Holy moly my goodness this model is amazing, thank you for writing this blog and hosting a demo, this is literally the best the best TTS I've ever seen 10 times better than any other model I've seen

8b repo empty and dataset empty too .. well its a little off from sota .... tbh glm4voice had better results - but its certainly a "ok" poc

gh repo empty / no paper

·
Article author

Yeah the authors said 8b by end of month and paper not sure. I havent heard of glm4voice tbf. ill check it out

This comment has been hidden

I really want to run this but i'm having a really hard time getting vllm and xcodec2 to run in the same environment. Can anyone possibly help me out with what versions would work together?

·

just limit vllm to 1 gpu and run the rest on a other one .. or use -gmu

Can you explain how you achieved the emotions? Was it the reference voice that had the emotion or was it in the text prompt?

·

Did you figure this out? The emotion changes for me depending on the content sometimes but I haven't been able to guide it successfully toward a specific emotion

If it's "just llama", presumably that means you can use existing control vector implementations to steer the model. Would that give you prosody and emotion control independent of the input sample?

·
Article author

That is interesting, not sure ive never tried vector control implementation. Can you give an example on how to do it?

Where does this directory come from?

"./sample_voices/voice_preview_neal.mp3",

?

I tried some Chinese, and it seems the effect is much worse than English.

Sign up or log in to comment