SebastianBodza/auto_data_pipe_de_ger_v1
Preview
•
Updated
•
290
•
1
You could try to use the inference code from:
https://huggingface.co./MultiLlasa/Llasa-1B-Multilingual-German
or duplicate the space and replace the model. There you can also adjust the temperature:
https://huggingface.co./spaces/SebastianBodza/Kartoffel-1B-v0.1-llasa-1b-tts
You should not really need LigerKernels for inference. Keep in mind the worse the model training is the higher temperature it is needed.
Your first example just contains some noise. I tried it out in a colab with the following code:
import torch
from xcodec2.modeling_xcodec2 import XCodec2Model
import soundfile as sf
import re
model_path = "HKUST-Audio/xcodec2"
Codec_model = XCodec2Model.from_pretrained(model_path)
Codec_model.cuda()
def extract_speech_ids(speech_tokens_str):
pattern = r"<\|s_(\d+)\|>"
matches = re.findall(pattern, speech_tokens_str)
return [int(num) for num in matches]
speech_tokens = "<|s_62770|><|s_63794|><|s_60710|><|s_43305|><|s_59942|><|s_15051|><|s_64054|><|s_62770|><|s_65078|><|s_61235|><|s_59702|><|s_55594|><|s_64822|><|s_59702|>"
speech_tokens = extract_speech_ids(speech_tokens)
speech_tokens = torch.tensor(speech_tokens).cuda().unsqueeze(0).unsqueeze(0)
gen_wav = Codec_model.decode_code(speech_tokens)
sf.write("generation.wav", gen_wav[0, 0, :].cpu().numpy(), 16000)
train_input_ids_shape.npy and train_input_ids.memmap is correct. How many samples and how many epochs did you use? For mine I used 2.5M examples.