|
--- |
|
license: apache-2.0 |
|
pipeline_tag: automatic-speech-recognition |
|
tags: |
|
- pytorch |
|
- audio |
|
- speech |
|
- automatic-speech-recognition |
|
- whisper |
|
- wav2vec2 |
|
|
|
model-index: |
|
- name: whisper_large_v2_fp16_transformers |
|
results: |
|
- task: |
|
type: automatic-speech-recognition |
|
name: Automatic Speech Recognition |
|
dataset: |
|
type: librispeech_asr |
|
name: LibriSpeech (clean) |
|
config: clean |
|
split: test |
|
args: |
|
language: en |
|
metrics: |
|
- type: wer |
|
value: 0 |
|
name: Test WER |
|
description: Word Error Rate |
|
- type: mer |
|
value: 0 |
|
name: Test MER |
|
description: Match Error Rate |
|
- type: wil |
|
value: 0 |
|
name: Test WIL |
|
description: Word Information Lost |
|
- type: wip |
|
value: 0 |
|
name: Test WIP |
|
description: Word Information Preserved |
|
- type: cer |
|
value: 0 |
|
name: Test CER |
|
description: Character Error Rate |
|
|
|
- task: |
|
type: automatic-speech-recognition |
|
name: Automatic Speech Recognition |
|
dataset: |
|
type: librispeech_asr |
|
name: LibriSpeech (other) |
|
config: other |
|
split: test |
|
args: |
|
language: en |
|
metrics: |
|
- type: wer |
|
value: 0 |
|
name: Test WER |
|
description: Word Error Rate |
|
- type: mer |
|
value: 0 |
|
name: Test MER |
|
description: Match Error Rate |
|
- type: wil |
|
value: 0 |
|
name: Test WIL |
|
description: Word Information Lost |
|
- type: wip |
|
value: 0 |
|
name: Test WIP |
|
description: Word Information Preserved |
|
- type: cer |
|
value: 0 |
|
name: Test CER |
|
description: Character Error Rate |
|
|
|
- task: |
|
type: automatic-speech-recognition |
|
name: Automatic Speech Recognition |
|
dataset: |
|
type: mozilla-foundation/common_voice_14_0 |
|
name: Common Voice (14.0) (Hindi) |
|
config: hi |
|
split: test |
|
args: |
|
language: hi |
|
metrics: |
|
- type: wer |
|
value: 44.64 |
|
name: Test WER |
|
description: Word Error Rate |
|
- type: mer |
|
value: 41.69 |
|
name: Test MER |
|
description: Match Error Rate |
|
- type: wil |
|
value: 59.53 |
|
name: Test WIL |
|
description: Word Information Lost |
|
- type: wip |
|
value: 40.46 |
|
name: Test WIP |
|
description: Word Information Preserved |
|
- type: cer |
|
value: 16.80 |
|
name: Test CER |
|
description: Character Error Rate |
|
|
|
widget: |
|
- example_title: Hinglish Sample |
|
src: https://huggingface.co./devasheeshG/whisper_large_v2_fp16_transformers/resolve/main/test.wav |
|
- example_title: Librispeech sample 1 |
|
src: https://cdn-media.huggingface.co/speech_samples/sample1.flac |
|
- example_title: Librispeech sample 2 |
|
src: https://cdn-media.huggingface.co/speech_samples/sample2.flac |
|
|
|
language: |
|
- en |
|
- zh |
|
- de |
|
- es |
|
- ru |
|
- ko |
|
- fr |
|
- ja |
|
- pt |
|
- tr |
|
- pl |
|
- ca |
|
- nl |
|
- ar |
|
- sv |
|
- it |
|
- id |
|
- hi |
|
- fi |
|
- vi |
|
- he |
|
- uk |
|
- el |
|
- ms |
|
- cs |
|
- ro |
|
- da |
|
- hu |
|
- ta |
|
- "no" |
|
- th |
|
- ur |
|
- hr |
|
- bg |
|
- lt |
|
- la |
|
- mi |
|
- ml |
|
- cy |
|
- sk |
|
- te |
|
- fa |
|
- lv |
|
- bn |
|
- sr |
|
- az |
|
- sl |
|
- kn |
|
- et |
|
- mk |
|
- br |
|
- eu |
|
- is |
|
- hy |
|
- ne |
|
- mn |
|
- bs |
|
- kk |
|
- sq |
|
- sw |
|
- gl |
|
- mr |
|
- pa |
|
- si |
|
- km |
|
- sn |
|
- yo |
|
- so |
|
- af |
|
- oc |
|
- ka |
|
- be |
|
- tg |
|
- sd |
|
- gu |
|
- am |
|
- yi |
|
- lo |
|
- uz |
|
- fo |
|
- ht |
|
- ps |
|
- tk |
|
- nn |
|
- mt |
|
- sa |
|
- lb |
|
- my |
|
- bo |
|
- tl |
|
- mg |
|
- as |
|
- tt |
|
- haw |
|
- ln |
|
- ha |
|
- ba |
|
- jw |
|
- su |
|
--- |
|
## Versions: |
|
|
|
- CUDA: 12.1 |
|
- cuDNN Version: 8.9.2.26_1.0-1_amd64 |
|
|
|
* tensorflow Version: 2.12.0 |
|
* torch Version: 2.1.0.dev20230606+cu12135 |
|
* transformers Version: 4.30.2 |
|
* accelerate Version: 0.20.3 |
|
|
|
## Model Benchmarks: |
|
|
|
- RAM: 3 GB (Original_Model: 6GB) |
|
- VRAM: 3.7 GB (Original_Model: 11GB) |
|
- test.wav: 23 s (Multilingual Speech i.e. English+Hindi) |
|
|
|
- **Time in seconds for Processing by each device** |
|
|
|
| Device Name | float32 (Original) | float16 | CudaCores | TensorCores | |
|
| ----------------- | ------------------ | ------- | --------- | ----------- | |
|
| 3060 | 2.2 | 1.3 | 3,584 | 112 | |
|
| 1660 Super | OOM | 6 | 1,408 | N/A | |
|
| Collab (Tesla T4) | - | - | 2,560 | 320 | |
|
| Collab (CPU) | - | N/A | N/A | N/A | |
|
| M1 (CPU) | - | - | N/A | N/A | |
|
| M1 (GPU -> 'mps') | - | - | N/A | N/A | |
|
|
|
|
|
- **NOTE: TensorCores are efficient in mixed-precision calculations** |
|
- **CPU -> torch.float16 not supported on CPU (AMD Ryzen 5 3600 or Collab CPU)** |
|
- Punchuation: Sometimes False ('I don't know the exact reason why this is happening') |
|
|
|
## Model Error Benchmarks: |
|
|
|
- **WER: Word Error Rate** |
|
- **MER: Match Error Rate** |
|
- **WIL: Word Information Lost** |
|
- **WIP: Word Information Preserved** |
|
- **CER: Character Error Rate** |
|
|
|
### Hindi to Hindi (test.tsv) [Common Voice 14.0](https://commonvoice.mozilla.org/en/datasets) |
|
|
|
**Test done on RTX 3060 on 1000 Samples** |
|
|
|
| | WER | MER | WIL | WIP | CER | |
|
| ----------------------- | ----- | ----- | ----- | ----- | ----- | |
|
| Original_Model (30 min) | 43.99 | 41.65 | 59.47 | 40.52 | 16.23 | |
|
| This_Model (20 min) | 44.64 | 41.69 | 59.53 | 40.46 | 16.80 | |
|
|
|
### Hindi to English (test.csv) [Custom Dataset](https://huggingface.co./datasets/devasheeshG/common_voices_14_0_hi2en_hi2hi) |
|
|
|
**Test done on RTX 3060 on 1000 Samples** |
|
|
|
| | WER | MER | WIL | WIP | CER | |
|
| ----------------------- | --- | --- | --- | --- | --- | |
|
| Original_Model (30 min) | - | - | - | - | - | |
|
| This_Model (20 min) | - | - | - | - | - | |
|
|
|
### English ([LibriSpeech](https://huggingface.co./datasets/librispeech_asr) -> test-clean) |
|
|
|
**Test done on RTX 3060 on \_\_\_ Samples** |
|
|
|
| | WER | MER | WIL | WIP | CER | |
|
| -------------- | --- | --- | --- | --- | --- | |
|
| Original_Model | - | - | - | - | - | |
|
| This_Model | - | - | - | - | - | |
|
|
|
### English ([LibriSpeech](https://huggingface.co./datasets/librispeech_asr) -> test-other) |
|
|
|
**Test done on RTX 3060 on \_\_\_ Samples** |
|
|
|
| | WER | MER | WIL | WIP | CER | |
|
| -------------- | --- | --- | --- | --- | --- | |
|
| Original_Model | - | - | - | - | - | |
|
| This_Model | - | - | - | - | - | |
|
|
|
- **'jiwer' library is used for calculations** |
|
|
|
## Code for conversion: |
|
|
|
- ### [Will be soon Uploaded on Github](https://github.com/devasheeshG) |
|
|
|
## Usage |
|
|
|
A file `__init__.py` is contained inside this repo which contains all the code to use this model. |
|
|
|
Firstly, clone this repo and place all the files inside a folder. |
|
|
|
### Make sure you have git-lfs installed (https://git-lfs.com) |
|
|
|
```bash |
|
git lfs install |
|
git clone https://huggingface.co./devasheeshG/whisper_large_v2_fp16_transformers |
|
``` |
|
|
|
**Please try in jupyter notebook** |
|
|
|
```python |
|
# Import the Model |
|
from whisper_large_v2_fp16_transformers import Model, load_audio, pad_or_trim |
|
``` |
|
|
|
```python |
|
# Initilise the model |
|
model = Model( |
|
model_name_or_path='whisper_large_v2_fp16_transformers', |
|
cuda_visible_device="0", |
|
device='cuda', |
|
) |
|
``` |
|
|
|
```python |
|
# Load Audio |
|
audio = load_audio('whisper_large_v2_fp16_transformers/test.wav') |
|
audio = pad_or_trim(audio) |
|
``` |
|
|
|
```python |
|
# Transcribe (First transcription takes time) |
|
model.transcribe(audio) |
|
``` |
|
|
|
## Credits |
|
|
|
It is fp16 version of ``openai/whisper-large-v2`` |
|
|