Dataset for german language

#86

by junkstage - opened 17 days ago

17 days ago

•

Hi @hexgrad ,
since i have read that you trained the new model on "public domain" audio as well, i searched for good german public domain datasets and found some here: https://opendata.iisys.de/dataset/hui-audio-corpus-german/

It's a selection of the LibriVox dataset with good quality from the university in Hof. Particularly the "clean" datasets are high quality.
i asked one of authors of the collected subset for the license as well and he meant it's public domain. (https://github.com/iisys-hof/HUI-Audio-Corpus-German/issues/4)

Steffstoff

16 days ago

Yes, would be awesome to have also German voices!

James3

15 days ago

I second that.

James3

15 days ago

Here are also 23+ hours of german voice with permissive license https://www.thorsten-voice.de/datasets/

alexkyk

14 days ago

I third that

emasuriano

11 days ago

+1 for german support, would be amazing to add it to the list of voices

neffetzz

10 days ago

Hi,
is there a chance to get a german voice for kokoro?

jnkstr

6 days ago

Yes +1 for German. But for general purposes I think we need better samples. The ones posted here as an example sound more like for book audio readings.

nesterran

2 days ago

german is the only voice preventing me to use it, would be great if you can add it!

hexgrad

Owner 2 days ago

•

edited 2 days ago

I can only add eligible audio that I find, or that others give to me. So far, I have received no German audio in response to https://hf.co/posts/hexgrad/846477530846098

Edit: there are quality standards for audio that enters training, and things like Common Voice or Thorsten Voice do not meet that bar

nesterran

2 days ago

Are you aware of the following datasets?

•	M-AILABS German and CSS10 – German for additional single or multi–speaker data. https://github.com/imdatceleste/m-ailabs-dataset
•	HUI-Audio-Corpus-German for a large, multi–speaker resource with excellent audio quality: https://github.com/iisys-hof/HUI-Audio-Corpus-German
•	LibriVoxDeEn for audiobook–based, read speech data: https://www.cl.uni-heidelberg.de/statnlpgroup/librivoxdeen/

junkstage

about 11 hours ago

•

edited about 11 hours ago

@hexgrad
here is dataset created from the OpenAI TTS API: https://huggingface.co./datasets/laion/laions_got_talent
there is a lot of german audio in this dataset.

Would that data qualify for your training?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment