Text-to-Speech
GGUF
Inference Endpoints

Truncated audios and Latency in generation of speech

#1
by tushar310 - opened

Hi Team,
Using Llama-cpp-python version of this also in Q4 variant. The results seem very different. The audio is 90%+ truncated and at the same time it takes a good number of seconds to generate a sentence. Should we say this wont suffice realtime aspects that the likes of Azure, ElevenLabs, Google can do? Correct? If i am wrong, please suggest the apt implementation strategy.

OuteAI org

What do you mean by "audio is 90%+ truncated"? Sounds very unusual. What hardware specs are you running this model on?

Hey Edwko, i had a similar issue with the audio being truncated it cuts of half a second at the begining and half a second at the end of the audio clip.

OuteAI org

@rakker are you playing the audio via output.play() or the saved file? If you're playing with .play() it's probably related to this issue https://github.com/edwko/OuteTTS/issues/45#issuecomment-2525099911 there might be some compatibility issues with the sounddevice library.

@edwko no i am playing the saved file

OuteAI org

@rakker Seems like a playback issue on your end then. Try resampling the audio, maybe your device doesn't like 24k sr

# generate audio ...

import torchaudio
new_sr = 44100 
resampler = torchaudio.transforms.Resample(orig_freq=output.sr, new_freq=new_sr).to(output.audio.device)
resampled_audio = resampler(output.audio)
output.sr = new_sr
output.audio = resampled_audio
output.save("output.wav")
edwko changed discussion status to closed

Sign up or log in to comment