SebastianB's picture

SebastianB PRO

SebastianBodza

AI & ML interests

Senior Data Scientist @ Qnovi GmbH

Recent Activity

updated a dataset about 8 hours ago
SebastianBodza/auto_data_pipe_de_ger_v1
published a model about 14 hours ago
SebastianBodza/Kartoffel-1B-v0.3
updated a model about 15 hours ago
SebastianBodza/Kartoffel-1B-v0.3
View all activity

Organizations

StyleTTS 2 Community's profile picture AI Starter Pack's profile picture MultiLlasa's profile picture

SebastianBodza's activity

view reply

You could try to use the inference code from:
https://huggingface.co./MultiLlasa/Llasa-1B-Multilingual-German
or duplicate the space and replace the model. There you can also adjust the temperature:
https://huggingface.co./spaces/SebastianBodza/Kartoffel-1B-v0.1-llasa-1b-tts

You should not really need LigerKernels for inference. Keep in mind the worse the model training is the higher temperature it is needed.

Your first example just contains some noise. I tried it out in a colab with the following code:

import torch 
from xcodec2.modeling_xcodec2 import XCodec2Model
import soundfile as sf
import re 

model_path = "HKUST-Audio/xcodec2"
Codec_model = XCodec2Model.from_pretrained(model_path)
Codec_model.cuda()

def extract_speech_ids(speech_tokens_str):
    pattern = r"<\|s_(\d+)\|>"
    matches = re.findall(pattern, speech_tokens_str)
    return [int(num) for num in matches]


speech_tokens = "<|s_62770|><|s_63794|><|s_60710|><|s_43305|><|s_59942|><|s_15051|><|s_64054|><|s_62770|><|s_65078|><|s_61235|><|s_59702|><|s_55594|><|s_64822|><|s_59702|>"
speech_tokens = extract_speech_ids(speech_tokens)
speech_tokens = torch.tensor(speech_tokens).cuda().unsqueeze(0).unsqueeze(0)
gen_wav = Codec_model.decode_code(speech_tokens)


sf.write("generation.wav", gen_wav[0, 0, :].cpu().numpy(), 16000)

train_input_ids_shape.npy and train_input_ids.memmap is correct. How many samples and how many epochs did you use? For mine I used 2.5M examples.