---
language:
- ja
pipeline_tag: automatic-speech-recognition
---
WIP turbo encoder frozen + 2 decoder layers. Trained 2^19 steps batch size 8 (~160 hours on 3060). Almost certainly undertrained.

# Goals
* Japanese transcription
* Focus on anime adjacent domain
* No hallucination
* Drop in replacement (trained 50% with prompt, 25% notimestamps)

# Acknowledgements

* Train sets: OOPPEENN, Reazon, Common Voice 19, 小虫哥_, deepghs
* Validation sets: simon3000, grider-withourai, kotoba-tech
* Test sets: KitsuneX07, TEDxJP

# Test set
|            | air  | himanatsu | kanon | proseka | sakuuta | tedxjp |
|------------|------|-----------|-------|---------|---------|--------|
|turbo_b1    | 25.8 | 60.6      | 22.5  | 13.1    | 21.1    | 10.8   |
|turbo_b5    | 20.9 | 48.3      | 19.1  | 11.8    | 18.9    |        |
|turbo_b1_nt | 25.8 | 61.6      | 23.1  | 13.6    | 20.4    |        |
|turbo_b5_nt | 17.1 | 25.8      | 23.5  | 9.4     | 12.5    |        |
|anime_b1    | 15.9 | 20.2      | 12.8  | 8.9     | 10.9    | 41.8   |
|anime_b5    | 14.4 | 18.3      | 12.6  | 8.6     | 10.0    |        |
|anime_b1_n5 | 15.0 | 18.4      | 12.7  | 8.9     | 10.1    |        |
|anime_b5_n5 | 14.4 | 18.1      | 12.5  | 8.6     | 10.0    |        |
|anime_b1_nt | 14.4 | 18.7      | 11.4  | 8.3     | 10.1    |        |
|anime_b5_nt | 13.4 | 17.5      | 11.4  | 8.1     | 9.6     |        |
||
|b1          | 15.6 | 20.1      | 11.8  | 8.8     | 10.5    | 11.5   |
|b5          | 15.2 | 19.8      | 11.6  | 8.8     | 10.7    |        |
|b1_nt       | 15.6 | 20.1      | 11.9  | 8.7     | 10.5    |        |
|b5_nt       | 15.3 | 19.4      | 11.8  | 8.6     | 10.5    |        |

* b1 beam_size=1
* b5 beam_size=5
* n5 no_repeat_ngram_size=5
* nt <|notimestamps|>

* Anime sets equal to worse compared to anime-whisper, better than turbo (out of domain).
* 273 videos from TEDxJP-10K with youtube subtitles for long form with faster-whisper.
* Slightly worse than turbo. Kotoba/anime-whisper not trained for long form.


# Validation set

Used only for hyperparameter optimization.


|                                                                                   | bluearchive | genshin5.1 | nekopara | genshin | starrail | reazon | jsut | cv8   | cv19  | jsl   | loopers | tedx10 |
|-----------------------------------------------------------------------------------|-------------|------------|----------|---------|----------|--------|------|-------|-------|-------|---------|--------|
| [large-v3_b1](https://huggingface.co./openai/whisper-large-v3)                     | 12.2        | 10.1       | 70.8     | 11.9    | 10.0     | 16.0   | 7.1  | 8.6   | 15.1  | 12.2  |         | 7.7    |
| large-v3_b5                                                                       | 11.0        | 10.0       | 63.7     | 11.6    | 9.8      | 14.1   | 7.1  | 8.3   | 14.8  | 11.0  |         |        |
| [large-v2_b1](https://huggingface.co./openai/whisper-large-v2)                     |             | 14.4       | 103.4    | 18.3    | 12.9     | 31.6   | 8.2  | 9.8   | 18.5  | 18.0  |         | 8.0    |
| large-v2_b5                                                                       |             | 12.7       | 100.9    | 16.8    | 12.9     | 28.0   | 8.0  | 9.5   | 17.5  | 16.2  |         |        |
| [turbo_b1](https://huggingface.co./openai/whisper-large-v3-turbo)                  | 12.8        | 11.1       | 72.3     | 11.6    | 11.1     | 11.6   | 7.3  | 9.6   | 17.5  | 12.0  | 28.0    | 7.9    |
| turbo_b5                                                                          | 10.4        | 10.0       | 64.3     | 12.0    | 10.2     | 10.4   | 7.2  | 9.1   | 16.6  | 10.8  | 20.2    | 8.8    |
| [kotoba-v1_b1](https://huggingface.co./kotoba-tech/kotoba-whisper-v1.0)            | 8.5         | 9.4        | 27.8     | 9.9     | 10.3     | 12.7   | 8.4  | 9.5   | 17.1  | 12.2  |         | 34.9   |
| kotoba-v1_b5                                                                      | 8.4         | 9.3        | 27.8     | 9.8     | 10.3     | 12.3   | 8.3  | 9.3   | 16.7  | 12.1  |         |        |
| [kotoba-v2_b1](https://huggingface.co./kotoba-tech/kotoba-whisper-v2.0)            | 8.5         | 9.6        | 27.7     | 10.2    | 10.4     | 11.6   | 8.2  | 9.2   | 16.9  | 12.3  |         | 25.3   |
| kotoba-v2_b5                                                                      | 8.6         | 9.5        | 27.7     | 10.1    | 10.5     | 11.4   | 8.2  | 9.0   | 16.6  | 12.2  |         |        |
| [kotoba-bi_b1](https://huggingface.co./kotoba-tech/kotoba-whisper-bilingual-v1.0)  | 8.9         | 10.1       | 28.1     | 10.5    | 10.8     | 17.5   | 9.1  | 9.8   | 17.5  | 12.7  |         | 27.8   |
| kotoba-bi_b5                                                                      | 8.8         | 10.0       | 28.0     | 10.5    | 10.7     | 17.1   | 9.1  | 9.6   | 17.2  | 12.6  |         |        |
| [anime_b1](https://huggingface.co./litagin/anime-whisper)                          | 7.5         | 11.5       | 24.7     | 11.0    | 11.2     | 30.1   | 8.0  | 10.0  | 19.1  | 9.0   | 18.9    | 32.0   |
| anime_b5                                                                          | 7.2         | 10.4       | 22.0     | 10.3    | 10.4     | 26.6   | 7.8  | 9.8   | 18.8  | 8.5   | 15.3    | 51.8   |
||
| b1                                                                                | 6.9         | 6.3        | 22.8     | 6.7     | 7.4      | 16.2   | 7.1  | 8.9   | 17.1  | 8.5   | 14.7    | 8.2    |
| b5                                                                                | 7.5         | 6.2        | 22.8     | 6.6     | 7.3      | 15.7   | 7.0  | 8.7   | 17.0  | 8.5   | 14.5    | 9.1    |

* bluearchive.wiki: beam 5 worse from extra usage of kana. Learnt from MiHoYo games?
* genshin5.1: Trained on 5.0, new audio from 5.1, possible minor overlap.
* nekopara: Hallucination test, anime would be better if not for increased hallucination. Openai is unusable.
* genshin/starrail: Mostly in the train set.
* reazon: Significantly higher cer from transcribing background/secondary audio.
* jsut: Surprisingly good?
* cv8: cv19 train includes some of cv8 test.
* cv19: No contamination, struggles with accents.
* jsl: Anime set.
* loopers: Anime set, has hallucination prone audio.
* tedxjp: 10 videos subset. See comments in test set. b1=batched, b5=sequential, beam_size=1, temperature=0, condition_on_previous_text=False