devasheeshG commited on
Commit
b289b78
1 Parent(s): 0838193

Upload 2 files

Browse files
Files changed (2) hide show
  1. README.md +344 -0
  2. __init__.py +125 -0
README.md ADDED
@@ -0,0 +1,344 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: automatic-speech-recognition
4
+ tags:
5
+ - pytorch
6
+ - audio
7
+ - speech
8
+ - automatic-speech-recognition
9
+ - whisper
10
+ - wav2vec2
11
+
12
+ model-index:
13
+ - name: whisper_large_v2_fp16_transformers
14
+ results:
15
+ - task:
16
+ type: automatic-speech-recognition
17
+ name: Automatic Speech Recognition
18
+ dataset:
19
+ type: librispeech_asr
20
+ name: LibriSpeech (clean)
21
+ config: clean
22
+ split: test
23
+ args:
24
+ language: en
25
+ metrics:
26
+ - type: wer
27
+ value: 0
28
+ name: Test WER
29
+ description: Word Error Rate
30
+ - type: mer
31
+ value: 0
32
+ name: Test MER
33
+ description: Match Error Rate
34
+ - type: wil
35
+ value: 0
36
+ name: Test WIL
37
+ description: Word Information Lost
38
+ - type: wip
39
+ value: 0
40
+ name: Test WIP
41
+ description: Word Information Preserved
42
+ - type: cer
43
+ value: 0
44
+ name: Test CER
45
+ description: Character Error Rate
46
+
47
+ - task:
48
+ type: automatic-speech-recognition
49
+ name: Automatic Speech Recognition
50
+ dataset:
51
+ type: librispeech_asr
52
+ name: LibriSpeech (other)
53
+ config: other
54
+ split: test
55
+ args:
56
+ language: en
57
+ metrics:
58
+ - type: wer
59
+ value: 0
60
+ name: Test WER
61
+ description: Word Error Rate
62
+ - type: mer
63
+ value: 0
64
+ name: Test MER
65
+ description: Match Error Rate
66
+ - type: wil
67
+ value: 0
68
+ name: Test WIL
69
+ description: Word Information Lost
70
+ - type: wip
71
+ value: 0
72
+ name: Test WIP
73
+ description: Word Information Preserved
74
+ - type: cer
75
+ value: 0
76
+ name: Test CER
77
+ description: Character Error Rate
78
+
79
+ - task:
80
+ type: automatic-speech-recognition
81
+ name: Automatic Speech Recognition
82
+ dataset:
83
+ type: mozilla-foundation/common_voice_14_0
84
+ name: Common Voice (14.0) (Hindi)
85
+ config: hi
86
+ split: test
87
+ args:
88
+ language: hi
89
+ metrics:
90
+ - type: wer
91
+ value: 44.64
92
+ name: Test WER
93
+ description: Word Error Rate
94
+ - type: mer
95
+ value: 41.69
96
+ name: Test MER
97
+ description: Match Error Rate
98
+ - type: wil
99
+ value: 59.53
100
+ name: Test WIL
101
+ description: Word Information Lost
102
+ - type: wip
103
+ value: 40.46
104
+ name: Test WIP
105
+ description: Word Information Preserved
106
+ - type: cer
107
+ value: 16.80
108
+ name: Test CER
109
+ description: Character Error Rate
110
+
111
+ widget:
112
+ - example_title: Hinglish Sample
113
+ src: https://huggingface.co/devasheeshG/whisper_large_v2_fp16_transformers/resolve/main/test.wav
114
+ - example_title: Librispeech sample 1
115
+ src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
116
+ - example_title: Librispeech sample 2
117
+ src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
118
+
119
+ language:
120
+ - en
121
+ - zh
122
+ - de
123
+ - es
124
+ - ru
125
+ - ko
126
+ - fr
127
+ - ja
128
+ - pt
129
+ - tr
130
+ - pl
131
+ - ca
132
+ - nl
133
+ - ar
134
+ - sv
135
+ - it
136
+ - id
137
+ - hi
138
+ - fi
139
+ - vi
140
+ - he
141
+ - uk
142
+ - el
143
+ - ms
144
+ - cs
145
+ - ro
146
+ - da
147
+ - hu
148
+ - ta
149
+ - "no"
150
+ - th
151
+ - ur
152
+ - hr
153
+ - bg
154
+ - lt
155
+ - la
156
+ - mi
157
+ - ml
158
+ - cy
159
+ - sk
160
+ - te
161
+ - fa
162
+ - lv
163
+ - bn
164
+ - sr
165
+ - az
166
+ - sl
167
+ - kn
168
+ - et
169
+ - mk
170
+ - br
171
+ - eu
172
+ - is
173
+ - hy
174
+ - ne
175
+ - mn
176
+ - bs
177
+ - kk
178
+ - sq
179
+ - sw
180
+ - gl
181
+ - mr
182
+ - pa
183
+ - si
184
+ - km
185
+ - sn
186
+ - yo
187
+ - so
188
+ - af
189
+ - oc
190
+ - ka
191
+ - be
192
+ - tg
193
+ - sd
194
+ - gu
195
+ - am
196
+ - yi
197
+ - lo
198
+ - uz
199
+ - fo
200
+ - ht
201
+ - ps
202
+ - tk
203
+ - nn
204
+ - mt
205
+ - sa
206
+ - lb
207
+ - my
208
+ - bo
209
+ - tl
210
+ - mg
211
+ - as
212
+ - tt
213
+ - haw
214
+ - ln
215
+ - ha
216
+ - ba
217
+ - jw
218
+ - su
219
+ ---
220
+ ## Versions:
221
+
222
+ - CUDA: 12.1
223
+ - cuDNN Version: 8.9.2.26_1.0-1_amd64
224
+
225
+ * tensorflow Version: 2.12.0
226
+ * torch Version: 2.1.0.dev20230606+cu12135
227
+ * transformers Version: 4.30.2
228
+ * accelerate Version: 0.20.3
229
+
230
+ ## Model Benchmarks:
231
+
232
+ - RAM: 3 GB (Original_Model: 6GB)
233
+ - VRAM: 3.7 GB (Original_Model: 11GB)
234
+ - test.wav: 23 s (Multilingual Speech i.e. English+Hindi)
235
+
236
+ - **Time in seconds for Processing by each device**
237
+
238
+ | Device Name | float32 (Original) | float16 | CudaCores | TensorCores |
239
+ | ----------------- | ------------------ | ------- | --------- | ----------- |
240
+ | 3060 | 2.2 | 1.3 | 3,584 | 112 |
241
+ | 1660 Super | OOM | 6 | 1,408 | N/A |
242
+ | Collab (Tesla T4) | - | - | 2,560 | 320 |
243
+ | Collab (CPU) | - | N/A | N/A | N/A |
244
+ | M1 (CPU) | - | - | N/A | N/A |
245
+ | M1 (GPU -> 'mps') | - | - | N/A | N/A |
246
+
247
+
248
+ - **NOTE: TensorCores are efficient in mixed-precision calculations**
249
+ - **CPU -> torch.float16 not supported on CPU (AMD Ryzen 5 3600 or Collab CPU)**
250
+ - Punchuation: Sometimes False ('I don't know the exact reason why this is happening')
251
+
252
+ ## Model Error Benchmarks:
253
+
254
+ - **WER: Word Error Rate**
255
+ - **MER: Match Error Rate**
256
+ - **WIL: Word Information Lost**
257
+ - **WIP: Word Information Preserved**
258
+ - **CER: Character Error Rate**
259
+
260
+ ### Hindi to Hindi (test.tsv) [Common Voice 14.0](https://commonvoice.mozilla.org/en/datasets)
261
+
262
+ **Test done on RTX 3060 on 1000 Samples**
263
+
264
+ | | WER | MER | WIL | WIP | CER |
265
+ | ----------------------- | ----- | ----- | ----- | ----- | ----- |
266
+ | Original_Model (30 min) | 43.99 | 41.65 | 59.47 | 40.52 | 16.23 |
267
+ | This_Model (20 min) | 44.64 | 41.69 | 59.53 | 40.46 | 16.80 |
268
+
269
+ ### Hindi to English (test.csv) [Custom Dataset](https://huggingface.co/datasets/devasheeshG/common_voices_14_0_hi2en_hi2hi)
270
+
271
+ **Test done on RTX 3060 on 1000 Samples**
272
+
273
+ | | WER | MER | WIL | WIP | CER |
274
+ | ----------------------- | --- | --- | --- | --- | --- |
275
+ | Original_Model (30 min) | - | - | - | - | - |
276
+ | This_Model (20 min) | - | - | - | - | - |
277
+
278
+ ### English ([LibriSpeech](https://huggingface.co/datasets/librispeech_asr) -> test-clean)
279
+
280
+ **Test done on RTX 3060 on \_\_\_ Samples**
281
+
282
+ | | WER | MER | WIL | WIP | CER |
283
+ | -------------- | --- | --- | --- | --- | --- |
284
+ | Original_Model | - | - | - | - | - |
285
+ | This_Model | - | - | - | - | - |
286
+
287
+ ### English ([LibriSpeech](https://huggingface.co/datasets/librispeech_asr) -> test-other)
288
+
289
+ **Test done on RTX 3060 on \_\_\_ Samples**
290
+
291
+ | | WER | MER | WIL | WIP | CER |
292
+ | -------------- | --- | --- | --- | --- | --- |
293
+ | Original_Model | - | - | - | - | - |
294
+ | This_Model | - | - | - | - | - |
295
+
296
+ - **'jiwer' library is used for calculations**
297
+
298
+ ## Code for conversion:
299
+
300
+ - ### [Will be soon Uploaded on Github](https://github.com/devasheeshG)
301
+
302
+ ## Usage
303
+
304
+ A file `__init__.py` is contained inside this repo which contains all the code to use this model.
305
+
306
+ Firstly, clone this repo and place all the files inside a folder.
307
+
308
+ ### Make sure you have git-lfs installed (https://git-lfs.com)
309
+
310
+ ```bash
311
+ git lfs install
312
+ git clone https://huggingface.co/devasheeshG/whisper_large_v2_fp16_transformers
313
+ ```
314
+
315
+ **Please try in jupyter notebook**
316
+
317
+ ```python
318
+ # Import the Model
319
+ from whisper_large_v2_fp16_transformers import Model, load_audio, pad_or_trim
320
+ ```
321
+
322
+ ```python
323
+ # Initilise the model
324
+ model = Model(
325
+ model_name_or_path='whisper_large_v2_fp16_transformers',
326
+ cuda_visible_device="0",
327
+ device='cuda',
328
+ )
329
+ ```
330
+
331
+ ```python
332
+ # Load Audio
333
+ audio = load_audio('whisper_large_v2_fp16_transformers/test.wav')
334
+ audio = pad_or_trim(audio)
335
+ ```
336
+
337
+ ```python
338
+ # Transcribe (First transcription takes time)
339
+ model.transcribe(audio)
340
+ ```
341
+
342
+ ## Credits
343
+
344
+ It is fp16 version of ``openai/whisper-large-v2``
__init__.py ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers import (
2
+ WhisperForConditionalGeneration,
3
+ WhisperProcessor,
4
+ WhisperConfig,
5
+ )
6
+ import torch
7
+ import ffmpeg
8
+ import torch
9
+ import torch.nn.functional as F
10
+ import numpy as np
11
+ import os
12
+
13
+ # load_audio and pad_or_trim functions
14
+ SAMPLE_RATE = 16000
15
+ CHUNK_LENGTH = 30 # 30-second chunks
16
+ N_SAMPLES = CHUNK_LENGTH * SAMPLE_RATE # 480000 samples in a 30-second chunk
17
+
18
+
19
+ # audio = whisper.load_audio('test.wav')
20
+ def load_audio(file: str, sr: int = SAMPLE_RATE, start_time: int = 0, dtype=np.float16):
21
+ """
22
+ Load an audio file into a numpy array at the specified sampling rate.
23
+ """
24
+ try:
25
+ # This launches a subprocess to decode audio while down-mixing and resampling as necessary.
26
+ # Requires the ffmpeg CLI and `ffmpeg-python` package to be installed.
27
+ out, _ = (
28
+ ffmpeg.input(file, ss=start_time, threads=0)
29
+ .output("-", format="s16le", acodec="pcm_s16le", ac=1, ar=sr)
30
+ .run(cmd=["ffmpeg", "-nostdin"], capture_stdout=True, capture_stderr=True)
31
+ )
32
+ except ffmpeg.Error as e:
33
+ raise RuntimeError(f"Failed to load audio: {e.stderr.decode()}") from e
34
+
35
+ # return np.frombuffer(out, np.int16).flatten().astype(np.float32) / 32768.0
36
+ return np.frombuffer(out, np.int16).flatten().astype(dtype) / 32768.0
37
+
38
+
39
+ # audio = whisper.pad_or_trim(audio)
40
+ def pad_or_trim(array, length: int = N_SAMPLES, *, axis: int = -1):
41
+ """
42
+ Pad or trim the audio array to N_SAMPLES, as expected by the encoder.
43
+ """
44
+ if torch.is_tensor(array):
45
+ if array.shape[axis] > length:
46
+ array = array.index_select(
47
+ dim=axis, index=torch.arange(length, device=array.device)
48
+ )
49
+
50
+ if array.shape[axis] < length:
51
+ pad_widths = [(0, 0)] * array.ndim
52
+ pad_widths[axis] = (0, length - array.shape[axis])
53
+ array = F.pad(array, [pad for sizes in pad_widths[::-1] for pad in sizes])
54
+ else:
55
+ if array.shape[axis] > length:
56
+ array = array.take(indices=range(length), axis=axis)
57
+
58
+ if array.shape[axis] < length:
59
+ pad_widths = [(0, 0)] * array.ndim
60
+ pad_widths[axis] = (0, length - array.shape[axis])
61
+ array = np.pad(array, pad_widths)
62
+
63
+ return array
64
+
65
+
66
+ class Model:
67
+ def __init__(
68
+ self,
69
+ model_name_or_path: str,
70
+ cuda_visible_device: str = "0",
71
+ device: str = "cuda", # torch.device("cuda" if torch.cuda.is_available() else "cpu")
72
+ ):
73
+ os.environ["CUDA_VISIBLE_DEVICES"] = cuda_visible_device
74
+ self.DEVICE = device
75
+
76
+ self.processor = WhisperProcessor.from_pretrained(model_name_or_path)
77
+ self.tokenizer = self.processor.tokenizer
78
+
79
+ self.config = WhisperConfig.from_pretrained(model_name_or_path)
80
+
81
+ self.model = WhisperForConditionalGeneration(
82
+ config=self.config
83
+ ).from_pretrained(
84
+ pretrained_model_name_or_path=model_name_or_path,
85
+ torch_dtype=self.config.torch_dtype,
86
+ # device_map=DEVICE, # 'balanced', 'balanced_low_0', 'sequential', 'cuda', 'cpu'
87
+ low_cpu_mem_usage=True,
88
+ )
89
+
90
+ # Move model to GPU
91
+ if self.model.device.type != self.DEVICE:
92
+ print(f"Moving model to {self.DEVICE}")
93
+ self.model = self.model.to(self.DEVICE)
94
+ self.model.eval()
95
+
96
+ else:
97
+ print(f"Model is already on {self.DEVICE}")
98
+ self.model.eval()
99
+
100
+ print("dtype of model acc to config: ", self.config.torch_dtype)
101
+ print("dtype of loaded model: ", self.model.dtype)
102
+
103
+ def transcribe(
104
+ self, audio, language: str = "english", skip_special_tokens: bool = True
105
+ ) -> str:
106
+ input_features = (
107
+ self.processor(audio, sampling_rate=SAMPLE_RATE, return_tensors="pt")
108
+ .input_features.half()
109
+ .to(self.DEVICE)
110
+ )
111
+ with torch.no_grad():
112
+ predicted_ids = self.model.generate(
113
+ input_features,
114
+ num_beams=1,
115
+ language=language,
116
+ task="transcribe",
117
+ use_cache=True,
118
+ is_multilingual=True,
119
+ return_timestamps=True,
120
+ )
121
+
122
+ transcription = self.tokenizer.batch_decode(
123
+ predicted_ids, skip_special_tokens=skip_special_tokens
124
+ )[0]
125
+ return transcription.strip()