Long-Form Test Sets - a distil-whisper Collection

distil-whisper 's Collections

updated Mar 21

A collection of long-form (samples > 30s) datasets used to evaluate the Distil-Whisper models.

distil-whisper/earnings21

Viewer • Updated Oct 13, 2023 • 44 • 121 • 2

Note Config: "full" Split: "test"
distil-whisper/earnings22

Viewer • Updated Oct 13, 2023 • 57.5k • 1.45k • 2

Note Config: "full" Split: "test"
distil-whisper/meanwhile

Viewer • Updated Oct 17, 2023 • 64 • 6.81k

Note Config: "default" Split: "test"
distil-whisper/rev16

Viewer • Updated Oct 17, 2023 • 46 • 85

Note Config: "whisper_subset" Split: "test" We evaluate on a subset of 16 files from the 30 total podcast episodes. The Whisper paper states that in the other files, the audio and labels do not match. Thus, they are excluded from the benchmark. This 16 file subset corresponds to the config "whisper_subset".
distil-whisper/tedlium-long-form

Viewer • Updated May 22, 2023 • 19 • 60