20 8 114

SebastianB PRO

SebastianBodza

https://bitbasti.com/

AI & ML interests

Senior Data Scientist @ Qnovi GmbH

Recent Activity

updated a dataset about 8 hours ago

SebastianBodza/auto_data_pipe_de_ger_v1

published a model about 14 hours ago

SebastianBodza/Kartoffel-1B-v0.3

updated a model about 15 hours ago

SebastianBodza/Kartoffel-1B-v0.3

View all activity

Organizations

SebastianBodza's activity

updated a dataset about 8 hours ago

SebastianBodza/auto_data_pipe_de_ger_v1

Preview • Updated about 8 hours ago • 290 • 1

published a model about 14 hours ago

SebastianBodza/Kartoffel-1B-v0.3

Updated about 15 hours ago • 28 • 2

updated a model about 15 hours ago

SebastianBodza/Kartoffel-1B-v0.3

Updated about 15 hours ago • 28 • 2

updated a model 1 day ago

MultiLlasa/Llasa-1B-Multilingual-German

Updated 1 day ago • 354 • 7

liked a model 2 days ago

dashtoon/hunyuan-video-keyframe-control-lora

Updated 3 days ago • 57

liked 2 models 5 days ago

SparkAudio/Spark-TTS-0.5B

Text-to-Speech • Updated 3 days ago • 1.97k • 180

Banafo/Kroko-ASR

Text-to-Speech • Updated Feb 1 • 5

commented on From Llasa to Llasagna 🍕: Finetuning LLaSA to generates Italian speech and other languages 5 days ago

You could try to use the inference code from:
https://huggingface.co./MultiLlasa/Llasa-1B-Multilingual-German
or duplicate the space and replace the model. There you can also adjust the temperature:
https://huggingface.co./spaces/SebastianBodza/Kartoffel-1B-v0.1-llasa-1b-tts

You should not really need LigerKernels for inference. Keep in mind the worse the model training is the higher temperature it is needed.

Your first example just contains some noise. I tried it out in a colab with the following code:

import torch 
from xcodec2.modeling_xcodec2 import XCodec2Model
import soundfile as sf
import re 

model_path = "HKUST-Audio/xcodec2"
Codec_model = XCodec2Model.from_pretrained(model_path)
Codec_model.cuda()

def extract_speech_ids(speech_tokens_str):
    pattern = r"<\|s_(\d+)\|>"
    matches = re.findall(pattern, speech_tokens_str)
    return [int(num) for num in matches]


speech_tokens = "<|s_62770|><|s_63794|><|s_60710|><|s_43305|><|s_59942|><|s_15051|><|s_64054|><|s_62770|><|s_65078|><|s_61235|><|s_59702|><|s_55594|><|s_64822|><|s_59702|>"
speech_tokens = extract_speech_ids(speech_tokens)
speech_tokens = torch.tensor(speech_tokens).cuda().unsqueeze(0).unsqueeze(0)
gen_wav = Codec_model.decode_code(speech_tokens)


sf.write("generation.wav", gen_wav[0, 0, :].cpu().numpy(), 16000)

train_input_ids_shape.npy and train_input_ids.memmap is correct. How many samples and how many epochs did you use? For mine I used 2.5M examples.

liked a model 6 days ago

Respair/Tsukasa_Speech

Text-to-Speech • Updated 23 days ago • 44

liked a model 7 days ago

fffiloni/cozy-book-800

Text-to-Image • Updated 17 days ago • 2.96k • • 27

updated 2 datasets 8 days ago

SebastianBodza/auto_data_pipe_de_ger_v1

Preview • Updated about 8 hours ago • 290 • 1

SebastianBodza/auto_data_pipe_de_ger_v1

Preview • Updated about 8 hours ago • 290 • 1