--- language: - ja pipeline_tag: automatic-speech-recognition --- WIP turbo encoder frozen + 2 decoder layers. Trained 2^19 steps batch size 8 (~160 hours on 3060). Almost certainly undertrained. # Goals * Japanese transcription * Focus on anime adjacent domain * No hallucination * Drop in replacement (trained 50% with prompt, 25% notimestamps) # Acknowledgements * Train sets: OOPPEENN, Reazon, Common Voice 19, 小虫哥_, deepghs * Validation sets: simon3000, grider-withourai, kotoba-tech * Test sets: KitsuneX07, TEDxJP # Test set | | air | himanatsu | kanon | proseka | sakuuta | tedxjp | |------------|------|-----------|-------|---------|---------|--------| |turbo_b1 | 25.8 | 60.6 | 22.5 | 13.1 | 21.1 | 10.8 | |turbo_b5 | 20.9 | 48.3 | 19.1 | 11.8 | 18.9 | | |turbo_b1_nt | 25.8 | 61.6 | 23.1 | 13.6 | 20.4 | | |turbo_b5_nt | 17.1 | 25.8 | 23.5 | 9.4 | 12.5 | | |anime_b1 | 15.9 | 20.2 | 12.8 | 8.9 | 10.9 | 41.8 | |anime_b5 | 14.4 | 18.3 | 12.6 | 8.6 | 10.0 | | |anime_b1_n5 | 15.0 | 18.4 | 12.7 | 8.9 | 10.1 | | |anime_b5_n5 | 14.4 | 18.1 | 12.5 | 8.6 | 10.0 | | |anime_b1_nt | 14.4 | 18.7 | 11.4 | 8.3 | 10.1 | | |anime_b5_nt | 13.4 | 17.5 | 11.4 | 8.1 | 9.6 | | || |b1 | 15.6 | 20.1 | 11.8 | 8.8 | 10.5 | 11.5 | |b5 | 15.2 | 19.8 | 11.6 | 8.8 | 10.7 | | |b1_nt | 15.6 | 20.1 | 11.9 | 8.7 | 10.5 | | |b5_nt | 15.3 | 19.4 | 11.8 | 8.6 | 10.5 | | * b1 beam_size=1 * b5 beam_size=5 * n5 no_repeat_ngram_size=5 * nt <|notimestamps|> * Anime sets equal to worse compared to anime-whisper, better than turbo (out of domain). * 273 videos from TEDxJP-10K with youtube subtitles for long form with faster-whisper. * Slightly worse than turbo. Kotoba/anime-whisper not trained for long form. # Validation set Used only for hyperparameter optimization. | | bluearchive | genshin5.1 | nekopara | genshin | starrail | reazon | jsut | cv8 | cv19 | jsl | loopers | tedx10 | |-----------------------------------------------------------------------------------|-------------|------------|----------|---------|----------|--------|------|-------|-------|-------|---------|--------| | [large-v3_b1](https://huggingface.co./openai/whisper-large-v3) | 12.2 | 10.1 | 70.8 | 11.9 | 10.0 | 16.0 | 7.1 | 8.6 | 15.1 | 12.2 | | 7.7 | | large-v3_b5 | 11.0 | 10.0 | 63.7 | 11.6 | 9.8 | 14.1 | 7.1 | 8.3 | 14.8 | 11.0 | | | | [large-v2_b1](https://huggingface.co./openai/whisper-large-v2) | | 14.4 | 103.4 | 18.3 | 12.9 | 31.6 | 8.2 | 9.8 | 18.5 | 18.0 | | 8.0 | | large-v2_b5 | | 12.7 | 100.9 | 16.8 | 12.9 | 28.0 | 8.0 | 9.5 | 17.5 | 16.2 | | | | [turbo_b1](https://huggingface.co./openai/whisper-large-v3-turbo) | 12.8 | 11.1 | 72.3 | 11.6 | 11.1 | 11.6 | 7.3 | 9.6 | 17.5 | 12.0 | 28.0 | 7.9 | | turbo_b5 | 10.4 | 10.0 | 64.3 | 12.0 | 10.2 | 10.4 | 7.2 | 9.1 | 16.6 | 10.8 | 20.2 | 8.8 | | [kotoba-v1_b1](https://huggingface.co./kotoba-tech/kotoba-whisper-v1.0) | 8.5 | 9.4 | 27.8 | 9.9 | 10.3 | 12.7 | 8.4 | 9.5 | 17.1 | 12.2 | | 34.9 | | kotoba-v1_b5 | 8.4 | 9.3 | 27.8 | 9.8 | 10.3 | 12.3 | 8.3 | 9.3 | 16.7 | 12.1 | | | | [kotoba-v2_b1](https://huggingface.co./kotoba-tech/kotoba-whisper-v2.0) | 8.5 | 9.6 | 27.7 | 10.2 | 10.4 | 11.6 | 8.2 | 9.2 | 16.9 | 12.3 | | 25.3 | | kotoba-v2_b5 | 8.6 | 9.5 | 27.7 | 10.1 | 10.5 | 11.4 | 8.2 | 9.0 | 16.6 | 12.2 | | | | [kotoba-bi_b1](https://huggingface.co./kotoba-tech/kotoba-whisper-bilingual-v1.0) | 8.9 | 10.1 | 28.1 | 10.5 | 10.8 | 17.5 | 9.1 | 9.8 | 17.5 | 12.7 | | 27.8 | | kotoba-bi_b5 | 8.8 | 10.0 | 28.0 | 10.5 | 10.7 | 17.1 | 9.1 | 9.6 | 17.2 | 12.6 | | | | [anime_b1](https://huggingface.co./litagin/anime-whisper) | 7.5 | 11.5 | 24.7 | 11.0 | 11.2 | 30.1 | 8.0 | 10.0 | 19.1 | 9.0 | 18.9 | 32.0 | | anime_b5 | 7.2 | 10.4 | 22.0 | 10.3 | 10.4 | 26.6 | 7.8 | 9.8 | 18.8 | 8.5 | 15.3 | 51.8 | || | b1 | 6.9 | 6.3 | 22.8 | 6.7 | 7.4 | 16.2 | 7.1 | 8.9 | 17.1 | 8.5 | 14.7 | 8.2 | | b5 | 7.5 | 6.2 | 22.8 | 6.6 | 7.3 | 15.7 | 7.0 | 8.7 | 17.0 | 8.5 | 14.5 | 9.1 | * bluearchive.wiki: beam 5 worse from extra usage of kana. Learnt from MiHoYo games? * genshin5.1: Trained on 5.0, new audio from 5.1, possible minor overlap. * nekopara: Hallucination test, anime would be better if not for increased hallucination. Openai is unusable. * genshin/starrail: Mostly in the train set. * reazon: Significantly higher cer from transcribing background/secondary audio. * jsut: Surprisingly good? * cv8: cv19 train includes some of cv8 test. * cv19: No contamination, struggles with accents. * jsl: Anime set. * loopers: Anime set, has hallucination prone audio. * tedxjp: 10 videos subset. See comments in test set. b1=batched, b5=sequential, beam_size=1, temperature=0, condition_on_previous_text=False