The SOTA Text-to-speech and Zero Shot Voice cloning model that no one knows about...
Quick Links:
- Spaces DEMO : https://huggingface.co./spaces/srinivasbilla/llasa-3b-tts
- Model : https://huggingface.co./HKUST-Audio/Llasa-3B
- Github : https://github.com/zhenye234/LLaSA_training
Hello everyone, I've been having a lot of fun lately playing around with Llasa (https://huggingface.co./HKUST-Audio/Llasa-3B). An open source llama3 3B finetune that acts as a text to speech model. Not only does it do incredibly realistic text to speech, it can also clone any voice with only a couple seconds of sample audio.
Its so good that I had to sign up to huggingface pro, get zero gpu access and write a blog to show it off to the community. While the authors note that their paper is coming soon that didnt stop me from tinkering and figuring out how to use this model.
Voice Cloning
This is a Llama 3.2 3B finetune/continued pretrain to adapt the model to generate speech tokens without any change in model architecture. The only addition is the audio tokenizer xcodec2
Before I ramble about all the cool things I discovered it can do. I set up a space for people to try here and here are some sample in the wild voice clones i made (These are not real people, I used sample audio from elevenlabs voices)
Alex
Reference Let me know in the comment section below. This is the COD Archive, and I'll see you tomorrow. Take care. Clone Hey guys, what's up? Alex here, back at it again with another video. Today we will be learning how to clone voices with a state-of-the-art text-to-speech model. Exciting, right? Let's dive right in.
Amelia
Reference Hi! I'm Amelia, a super high quality English voice. I love to read. Seriously, I'm a total bookworm. So what are you waiting for? Get me reading! Clone All you need is a short clean audio sample of just 5 to 10 seconds. Then the model can generate a high quality speech sample mimicking the voice, tone and style of speech and even accent.
Russel
Reference
it is not enough to have a good mind the main thing is to use it well
Clone
The model was trained on a 160,000 250,000 hours of audio tokenized by Xcodec2, which converts audio to tokens at a very efficient 50 tokens per second.
Varying style of speech
Whisper
The given sample audio is very important. It dictates how the rest of the audio that follows sounds like. So whispers in equals whispers out.
Emotions
Confusion I don't know what to say. It will be sunny? Or rainy? The weather is completely unpredictable. I'm just so confused.
Anger I don't know what to say. It will be sunny? Or rainy? The weather is completely unpredictable. I'm just so annoyed.
Laughing I don't know what to say. It will be sunny? Or rainy? The weather is completely unpredictable. It's actually quite funny.
Optimus prime? This is an example of where the model struggles a lot. It cant really quite capture how Peter Cullen is voicing optimus prime.
8B?
The authors have an 8B model space which is currently empty, it would be interesting to see how good that is given that the 3B is already so good for most voices. Also does lora finetuning work? Can we merge and mix voices? There is so much to tinker with and I can't wait for the official paper to come out.
Hope you enjoyed my first blog post/ramble.
P.S i love that its basically just a llama model in disguise.
As I mentioned earlier the only addition is the xcodec2 audio tokenizer model, everything else is just llama 3 inference with some correct prompt templating and tokenisation. See my ZERO Space's app.py
file for inference code in hf transformers. But since its just a llama 3 model, theres nothing stopping us from using a more optimised inference library like vllm like this:
note i cloned the repos into my profile since they are gated so its easier to run as a demo...
from transformers import pipeline, AutoTokenizer
import torch
import soundfile as sf
from xcodec2.modeling_xcodec2 import XCodec2Model
from IPython import display
import torchaudio
from vllm import LLM, SamplingParams
llm = LLM(model="srinivasbilla/llasa-3b", gpu_memory_utilization=0.5, max_model_len=4096)
tokenizer = AutoTokenizer.from_pretrained('srinivasbilla/llasa-3b')
model_path = "srinivasbilla/xcodec2"
Codec_model = XCodec2Model.from_pretrained(model_path)
Codec_model.eval().cuda()
whisper_turbo_pipe = pipeline("automatic-speech-recognition", model="openai/whisper-large-v3-turbo", device='cuda')
sampling_params = SamplingParams(temperature=0.8, top_p=1, max_tokens=2048, stop=['<|SPEECH_GENERATION_END|>'], stop_token_ids=[128261])
def ids_to_speech_tokens(speech_ids):
speech_tokens_str = []
for speech_id in speech_ids:
speech_tokens_str.append(f"<|s_{speech_id}|>")
return speech_tokens_str
def extract_speech_ids(speech_tokens_str):
return [int(x.replace('s_', '')) for x in speech_tokens_str[2:-2].split('|><|')]
def text_to_speech(sample_audio_path, target_text, sampling_params, prompt_text=None):
waveform, sample_rate = torchaudio.load(sample_audio_path)
# Check if the audio is stereo (i.e., has more than one channel)
if waveform.size(0) > 1:
# Convert stereo to mono by averaging the channels
waveform_mono = torch.mean(waveform, dim=0, keepdim=True)
else:
# If already mono, just use the original waveform
waveform_mono = waveform
waveform_16k = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)(waveform_mono)
torchaudio.save('/local_disk0/input.wav', waveform_16k, 16000)
# only 16khz speech support!
prompt_wav, sr = sf.read("/local_disk0/input.wav") # English prompt
prompt_wav = torch.from_numpy(prompt_wav).float().unsqueeze(0)
if prompt_text is None:
prompt_text = whisper_turbo_pipe('/local_disk0/input.wav')['text'].strip()
print(prompt_text)
input_text = prompt_text + ' ' + target_text
#TTS start!
with torch.no_grad():
# Encode the prompt wav
vq_code_prompt = Codec_model.encode_code(input_waveform=prompt_wav)
vq_code_prompt = vq_code_prompt[0,0,:]
# Convert int 12345 to token <|s_12345|>
speech_ids_prefix = ids_to_speech_tokens(vq_code_prompt)
formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>"
# Tokenize the text and the speech prefix
chat = [
{"role": "user", "content": "Convert the text to speech:" + formatted_text},
{"role": "assistant", "content": "<|SPEECH_GENERATION_START|>" + ''.join(speech_ids_prefix)}
]
input_ids = tokenizer.apply_chat_template(
chat,
tokenize=False,
continue_final_message=True
)
outputs = llm.generate([input_ids], sampling_params)
generated_text = outputs[0].outputs[0].text
speech_tokens = extract_speech_ids(generated_text)
speech_tokens = torch.tensor(speech_tokens).cuda().unsqueeze(0).unsqueeze(0)
gen_wav = Codec_model.decode_code(speech_tokens)
return gen_wav[0, 0, :].cpu().numpy()
and then inference:
target_text = """The model was trained on a 160 *thousand* hours of audio tokenized by X codec 2. Which converts audio to tokens at a very efficient 50 tokens per second."""
audio_out, text_out = text_to_speech(
"./sample_voices/voice_preview_neal.mp3",
target_text,
sampling_params=sampling_params,
prompt_text="it is not enough to have a good mind the main thing is to use it well",
)
display.Audio(audio_out, rate=16000)