--- license: apache-2.0 pipeline_tag: automatic-speech-recognition tags: - pytorch - audio - speech - automatic-speech-recognition - whisper - wav2vec2 model-index: - name: whisper_large_v2_fp16_transformers results: - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: type: librispeech_asr name: LibriSpeech (clean) config: clean split: test args: language: en metrics: - type: wer value: 0 name: Test WER description: Word Error Rate - type: mer value: 0 name: Test MER description: Match Error Rate - type: wil value: 0 name: Test WIL description: Word Information Lost - type: wip value: 0 name: Test WIP description: Word Information Preserved - type: cer value: 0 name: Test CER description: Character Error Rate - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: type: librispeech_asr name: LibriSpeech (other) config: other split: test args: language: en metrics: - type: wer value: 0 name: Test WER description: Word Error Rate - type: mer value: 0 name: Test MER description: Match Error Rate - type: wil value: 0 name: Test WIL description: Word Information Lost - type: wip value: 0 name: Test WIP description: Word Information Preserved - type: cer value: 0 name: Test CER description: Character Error Rate - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: type: mozilla-foundation/common_voice_14_0 name: Common Voice (14.0) (Hindi) config: hi split: test args: language: hi metrics: - type: wer value: 44.64 name: Test WER description: Word Error Rate - type: mer value: 41.69 name: Test MER description: Match Error Rate - type: wil value: 59.53 name: Test WIL description: Word Information Lost - type: wip value: 40.46 name: Test WIP description: Word Information Preserved - type: cer value: 16.80 name: Test CER description: Character Error Rate widget: - example_title: Hinglish Sample src: https://huggingface.co./devasheeshG/whisper_large_v2_fp16_transformers/resolve/main/test.wav - example_title: Librispeech sample 1 src: https://cdn-media.huggingface.co/speech_samples/sample1.flac - example_title: Librispeech sample 2 src: https://cdn-media.huggingface.co/speech_samples/sample2.flac language: - en - zh - de - es - ru - ko - fr - ja - pt - tr - pl - ca - nl - ar - sv - it - id - hi - fi - vi - he - uk - el - ms - cs - ro - da - hu - ta - "no" - th - ur - hr - bg - lt - la - mi - ml - cy - sk - te - fa - lv - bn - sr - az - sl - kn - et - mk - br - eu - is - hy - ne - mn - bs - kk - sq - sw - gl - mr - pa - si - km - sn - yo - so - af - oc - ka - be - tg - sd - gu - am - yi - lo - uz - fo - ht - ps - tk - nn - mt - sa - lb - my - bo - tl - mg - as - tt - haw - ln - ha - ba - jw - su --- ## Versions: - CUDA: 12.1 - cuDNN Version: 8.9.2.26_1.0-1_amd64 * tensorflow Version: 2.12.0 * torch Version: 2.1.0.dev20230606+cu12135 * transformers Version: 4.30.2 * accelerate Version: 0.20.3 ## Model Benchmarks: - RAM: 3 GB (Original_Model: 6GB) - VRAM: 3.7 GB (Original_Model: 11GB) - test.wav: 23 s (Multilingual Speech i.e. English+Hindi) - **Time in seconds for Processing by each device** | Device Name | float32 (Original) | float16 | CudaCores | TensorCores | | ----------------- | ------------------ | ------- | --------- | ----------- | | 3060 | 2.2 | 1.3 | 3,584 | 112 | | 1660 Super | OOM | 6 | 1,408 | N/A | | Collab (Tesla T4) | - | - | 2,560 | 320 | | Collab (CPU) | - | N/A | N/A | N/A | | M1 (CPU) | - | - | N/A | N/A | | M1 (GPU -> 'mps') | - | - | N/A | N/A | - **NOTE: TensorCores are efficient in mixed-precision calculations** - **CPU -> torch.float16 not supported on CPU (AMD Ryzen 5 3600 or Collab CPU)** - Punchuation: Sometimes False ('I don't know the exact reason why this is happening') ## Model Error Benchmarks: - **WER: Word Error Rate** - **MER: Match Error Rate** - **WIL: Word Information Lost** - **WIP: Word Information Preserved** - **CER: Character Error Rate** ### Hindi to Hindi (test.tsv) [Common Voice 14.0](https://commonvoice.mozilla.org/en/datasets) **Test done on RTX 3060 on 1000 Samples** | | WER | MER | WIL | WIP | CER | | ----------------------- | ----- | ----- | ----- | ----- | ----- | | Original_Model (30 min) | 43.99 | 41.65 | 59.47 | 40.52 | 16.23 | | This_Model (20 min) | 44.64 | 41.69 | 59.53 | 40.46 | 16.80 | ### Hindi to English (test.csv) [Custom Dataset](https://huggingface.co./datasets/devasheeshG/common_voices_14_0_hi2en_hi2hi) **Test done on RTX 3060 on 1000 Samples** | | WER | MER | WIL | WIP | CER | | ----------------------- | --- | --- | --- | --- | --- | | Original_Model (30 min) | - | - | - | - | - | | This_Model (20 min) | - | - | - | - | - | ### English ([LibriSpeech](https://huggingface.co./datasets/librispeech_asr) -> test-clean) **Test done on RTX 3060 on \_\_\_ Samples** | | WER | MER | WIL | WIP | CER | | -------------- | --- | --- | --- | --- | --- | | Original_Model | - | - | - | - | - | | This_Model | - | - | - | - | - | ### English ([LibriSpeech](https://huggingface.co./datasets/librispeech_asr) -> test-other) **Test done on RTX 3060 on \_\_\_ Samples** | | WER | MER | WIL | WIP | CER | | -------------- | --- | --- | --- | --- | --- | | Original_Model | - | - | - | - | - | | This_Model | - | - | - | - | - | - **'jiwer' library is used for calculations** ## Code for conversion: - ### [Will be soon Uploaded on Github](https://github.com/devasheeshG) ## Usage A file `__init__.py` is contained inside this repo which contains all the code to use this model. Firstly, clone this repo and place all the files inside a folder. ### Make sure you have git-lfs installed (https://git-lfs.com) ```bash git lfs install git clone https://huggingface.co./devasheeshG/whisper_large_v2_fp16_transformers ``` **Please try in jupyter notebook** ```python # Import the Model from whisper_large_v2_fp16_transformers import Model, load_audio, pad_or_trim ``` ```python # Initilise the model model = Model( model_name_or_path='whisper_large_v2_fp16_transformers', cuda_visible_device="0", device='cuda', ) ``` ```python # Load Audio audio = load_audio('whisper_large_v2_fp16_transformers/test.wav') audio = pad_or_trim(audio) ``` ```python # Transcribe (First transcription takes time) model.transcribe(audio) ``` ## Credits It is fp16 version of ``openai/whisper-large-v2``