tc-mb commited on
Commit
e91b0b6
·
verified ·
1 Parent(s): 6646dc4
README.md CHANGED
@@ -1 +1,1402 @@
1
- We are still uploading and the upload is completed today
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: any-to-any
3
+ datasets:
4
+ - openbmb/RLAIF-V-Dataset
5
+ library_name: transformers
6
+ language:
7
+ - multilingual
8
+ tags:
9
+ - minicpm-o
10
+ - omni
11
+ - vision
12
+ - ocr
13
+ - multi-image
14
+ - video
15
+ - custom_code
16
+ - audio
17
+ - speech
18
+ - voice cloning
19
+ - live Streaming
20
+ - realtime speech conversation
21
+ - asr
22
+ - tts
23
+ ---
24
+
25
+ <h1>A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone</h1>
26
+
27
+ [GitHub](https://github.com/OpenBMB/MiniCPM-o) | [Online Demo](https://minicpm-omni-webdemo-us.modelbest.cn)</a>
28
+
29
+
30
+ ## MiniCPM-o 2.6
31
+
32
+ **MiniCPM-o 2.6** is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.6, and introduces new features for real-time speech conversation and multimodal live streaming. Notable features of MiniCPM-o 2.6 include:
33
+
34
+ - 🔥 **Leading Visual Capability.**
35
+ MiniCPM-o 2.6 achieves an average score of 70.2 on OpenCompass, a comprehensive evaluation over 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding. It also **outperforms GPT-4V and Claude 3.5 Sonnet** in mutli-image and video understanding, and shows promising in-context learning capability.
36
+
37
+ - 🎙 **State-of-the-art Speech Capability.** MiniCPM-o 2.6 supports **bilingual real-time speech conversation with configurable voices** in English and Chinese. It **outperforms GPT-4o-realtime on audio understanding tasks** such as ASR and STT translation, and shows **state-of-the-art performance on speech conversation in both semantic and acoustic evaluations in the open-source community**. It also allows for fun features such as emotion/speed/style control, end-to-end voice cloning, role play, etc.
38
+
39
+ - 🎬 **Strong Multimodal Live Streaming Capability.** As a new feature, MiniCPM-o 2.6 can **accept continous video and audio streams independent of user queries, and support real-time speech interaction**. It **outperforms GPT-4o-202408 and Claude 3.5 Sonnet and shows state-of-art performance in open-source community on StreamingBench**, a comprehensive benchmark for real-time video understanding, omni-source (video & audio) understanding, and multimodal contextual understanding.
40
+
41
+ - 💪 **Strong OCR Capability and Others.**
42
+ Advancing popular visual capabilites from MiniCPM-V series, MiniCPM-o 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench for models under 25B, surpassing proprietary models such as GPT-4o-202405**.
43
+ Based on the the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, outperforming GPT-4o and Claude 3.5 Sonnet on MMHal-Bench, and supports **multilingual capabilities** on more than 30 languages.
44
+
45
+
46
+ - 🚀 **Superior Efficiency.**
47
+ In addition to its friendly size, MiniCPM-o 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-o 2.6 can efficiently support **multimodal live streaming** on end-side devices such as iPad.
48
+
49
+ - 💫 **Easy Usage.**
50
+ MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [LLaMA-Factory](./docs/llamafactory_train.md), (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web demo on [server](https://minicpm-omni-webdemo-us.modelbest.cn/).
51
+
52
+
53
+ **Model Architecture.**
54
+
55
+ - **End-to-end Omni-modal Architecture.** Different modality encoder/decoders are connected and trained in an **end-to-end** fashion to fully exploit rich multimodal knowledge.
56
+ - **Omni-modal Live Streaming Mechanism.** (1) We change the offline modality encoder/decoders into online ones for **streaminig inputs/outputs.** (2) We devise a **time-division multiplexing (TDM) mechanism** for omni-modality streaminig processing in the LLM backbone. It divides parallel omni-modality streams into sequential info within small periodic time slices.
57
+ - **Configurable Speech Modeling Design.** We devise a multimodal system prompt, including traditional text system prompt, and **a new audio system prompt to determine the assistant voice**. This enables flexible voice configurations in inference time, and also facilitates end-to-end voice cloning and description-based voice creation.
58
+
59
+ <div align="center">
60
+ <img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/minicpm-o-26-framework.png" , width=80%>
61
+ </div>
62
+
63
+
64
+ ### Evaluation <!-- omit in toc -->
65
+
66
+ <div align="center">
67
+ <img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/radar.jpg" width=90% />
68
+ </div>
69
+
70
+ <details>
71
+ <summary>Click to view visual understanding results.</summary>
72
+
73
+ **Image Understanding**
74
+
75
+ <div align="center">
76
+ <table style="margin: 0px auto;">
77
+ <thead>
78
+ <tr>
79
+ <th align="left">Model</th>
80
+ <th>Size</th>
81
+ <th>Token Density<sup>+</sup></th>
82
+ <th>OpenCompass</th>
83
+ <th>OCRBench</th>
84
+ <th>MathVista mini</th>
85
+ <th>ChartQA</th>
86
+ <th>MMVet</th>
87
+ <th>MMStar</th>
88
+ <th>MME</th>
89
+ <th>MMB1.1 test</th>
90
+ <th>AI2D</th>
91
+ <th>MMMU val</th>
92
+ <th>HallusionBench</th>
93
+ <th>TextVQA val</th>
94
+ <th>DocVQA test</th>
95
+ <th>MathVerse mini</th>
96
+ <th>MathVision</th>
97
+ <th>MMHal Score</th>
98
+ </tr>
99
+ </thead>
100
+ <tbody align="center">
101
+ <tr>
102
+ <td colspan="19" align="left"><strong>Proprietary</strong></td>
103
+ </tr>
104
+ <tr>
105
+ <td nowrap="nowrap" align="left">GPT-4o-20240513</td>
106
+ <td>-</td>
107
+ <td>1088</td>
108
+ <td><u>69.9</u></td>
109
+ <td>736</td>
110
+ <td>61.3</td>
111
+ <td>85.7</td>
112
+ <td><strong>69.1</strong></td>
113
+ <td>63.9</td>
114
+ <td>2328.7</td>
115
+ <td>82.2</td>
116
+ <td>84.6</td>
117
+ <td><strong>69.2</strong></td>
118
+ <td><strong>55.0</strong></td>
119
+ <td>-</td>
120
+ <td>92.8</td>
121
+ <td><strong>50.2</strong></td>
122
+ <td><strong>30.4</strong></td>
123
+ <td><u>3.6</u></td>
124
+ </tr>
125
+ <tr>
126
+ <td nowrap="nowrap" align="left">Claude3.5-Sonnet</td>
127
+ <td>-</td>
128
+ <td>750</td>
129
+ <td>67.9</td>
130
+ <td>788</td>
131
+ <td>61.6</td>
132
+ <td><strong>90.8</strong></td>
133
+ <td>66.0</td>
134
+ <td>62.2</td>
135
+ <td>1920.0</td>
136
+ <td>78.5</td>
137
+ <td>80.2</td>
138
+ <td><u>65.9</u></td>
139
+ <td>49.9</td>
140
+ <td>-</td>
141
+ <td><strong>95.2</strong></td>
142
+ <td>-</td>
143
+ <td>-</td>
144
+ <td>3.4</td>
145
+ </tr>
146
+ <tr>
147
+ <td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
148
+ <td>-</td>
149
+ <td>-</td>
150
+ <td>64.4</td>
151
+ <td>754</td>
152
+ <td>57.7</td>
153
+ <td>81.3</td>
154
+ <td>64.0</td>
155
+ <td>59.1</td>
156
+ <td>2110.6</td>
157
+ <td>73.9</td>
158
+ <td>79.1</td>
159
+ <td>60.6</td>
160
+ <td>45.6</td>
161
+ <td>73.5</td>
162
+ <td>86.5</td>
163
+ <td>-</td>
164
+ <td>19.2</td>
165
+ <td>-</td>
166
+ </tr>
167
+ <tr>
168
+ <td nowrap="nowrap" align="left">GPT-4o-mini-20240718</td>
169
+ <td>-</td>
170
+ <td>1088</td>
171
+ <td>64.1</td>
172
+ <td>785</td>
173
+ <td>52.4</td>
174
+ <td>-</td>
175
+ <td>66.9</td>
176
+ <td>54.8</td>
177
+ <td>2003.4</td>
178
+ <td>76.0</td>
179
+ <td>77.8</td>
180
+ <td>60.0</td>
181
+ <td>46.1</td>
182
+ <td>-</td>
183
+ <td>-</td>
184
+ <td>-</td>
185
+ <td>-</td>
186
+ <td>3.3</td>
187
+ </tr>
188
+ <tr>
189
+ <td colspan="19" align="left"><strong>Open Source</strong></td>
190
+ </tr>
191
+ <tr>
192
+ <td nowrap="nowrap" align="left">Cambrian-34B</td>
193
+ <td>34B</td>
194
+ <td><u>1820</u></td>
195
+ <td>58.3</td>
196
+ <td>591</td>
197
+ <td>50.3</td>
198
+ <td>75.6</td>
199
+ <td>53.2</td>
200
+ <td>54.2</td>
201
+ <td>2049.9</td>
202
+ <td>77.8</td>
203
+ <td>79.5</td>
204
+ <td>50.4</td>
205
+ <td>41.6</td>
206
+ <td>76.7</td>
207
+ <td>75.5</td>
208
+ <td>-</td>
209
+ <td>-</td>
210
+ <td>-</td>
211
+ </tr>
212
+ <tr>
213
+ <td nowrap="nowrap" align="left">GLM-4V-9B</td>
214
+ <td>13B</td>
215
+ <td>784</td>
216
+ <td>59.1</td>
217
+ <td>776</td>
218
+ <td>51.1</td>
219
+ <td>-</td>
220
+ <td>58.0</td>
221
+ <td>54.8</td>
222
+ <td>2018.8</td>
223
+ <td>67.9</td>
224
+ <td>71.2</td>
225
+ <td>46.9</td>
226
+ <td>45.0</td>
227
+ <td>-</td>
228
+ <td>-</td>
229
+ <td>-</td>
230
+ <td>-</td>
231
+ <td>-</td>
232
+ </tr>
233
+ <tr>
234
+ <td nowrap="nowrap" align="left">Pixtral-12B</td>
235
+ <td>12B</td>
236
+ <td>256</td>
237
+ <td>61.0</td>
238
+ <td>685</td>
239
+ <td>56.9</td>
240
+ <td>81.8</td>
241
+ <td>58.5</td>
242
+ <td>54.5</td>
243
+ <td>-</td>
244
+ <td>72.7</td>
245
+ <td>79.0</td>
246
+ <td>51.1</td>
247
+ <td>47.0</td>
248
+ <td>75.7</td>
249
+ <td>90.7</td>
250
+ <td>-</td>
251
+ <td>-</td>
252
+ <td>-</td>
253
+ </tr>
254
+ <tr>
255
+ <td nowrap="nowrap" align="left">DeepSeek-VL2-27B (4B)</td>
256
+ <td>27B</td>
257
+ <td>672</td>
258
+ <td>66.4</td>
259
+ <td>809</td>
260
+ <td>63.9</td>
261
+ <td>86.0</td>
262
+ <td>60.0</td>
263
+ <td>61.9</td>
264
+ <td>2253.0</td>
265
+ <td>81.2</td>
266
+ <td>83.8</td>
267
+ <td>54.0</td>
268
+ <td>45.3</td>
269
+ <td><u>84.2</u></td>
270
+ <td>93.3</td>
271
+ <td>-</td>
272
+ <td>-</td>
273
+ <td>3.0</td>
274
+ </tr>
275
+ <tr>
276
+ <td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
277
+ <td>8B</td>
278
+ <td>784</td>
279
+ <td>67.1</td>
280
+ <td><u>866</u></td>
281
+ <td>58.2</td>
282
+ <td>83.0</td>
283
+ <td>62.0</td>
284
+ <td>60.7</td>
285
+ <td>2326.0</td>
286
+ <td>81.8</td>
287
+ <td>83.0</td>
288
+ <td>54.1</td>
289
+ <td>50.6</td>
290
+ <td><strong>84.3</strong></td>
291
+ <td><u>94.5</u></td>
292
+ <td>31.9</td>
293
+ <td>16.3</td>
294
+ <td>3.2</td>
295
+ </tr>
296
+ <tr>
297
+ <td nowrap="nowrap" align="left">LLaVA-OneVision-72B</td>
298
+ <td>72B</td>
299
+ <td>182</td>
300
+ <td>68.1</td>
301
+ <td>741</td>
302
+ <td>67.5</td>
303
+ <td>83.7</td>
304
+ <td>60.6</td>
305
+ <td><strong>65.8</strong></td>
306
+ <td>2261.0</td>
307
+ <td><strong>85.0</strong></td>
308
+ <td><u>85.6</u></td>
309
+ <td>56.8</td>
310
+ <td>49.0</td>
311
+ <td>80.5</td>
312
+ <td>91.3</td>
313
+ <td>39.1</td>
314
+ <td>-</td>
315
+ <td>3.5</td>
316
+ </tr>
317
+ <tr>
318
+ <td nowrap="nowrap" align="left">InternVL2.5-8B</td>
319
+ <td>8B</td>
320
+ <td>706</td>
321
+ <td>68.3</td>
322
+ <td>822</td>
323
+ <td><u>64.4</u></td>
324
+ <td>84.8</td>
325
+ <td>62.8</td>
326
+ <td>62.8</td>
327
+ <td>2344.0</td>
328
+ <td><u>83.6</u></td>
329
+ <td>84.5</td>
330
+ <td>56.0</td>
331
+ <td>50.1</td>
332
+ <td>79.1</td>
333
+ <td>93.0</td>
334
+ <td>39.5</td>
335
+ <td>19.7</td>
336
+ <td>3.4</td>
337
+ </tr>
338
+ <tr>
339
+ <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
340
+ <td>8B</td>
341
+ <td><strong>2822</strong></td>
342
+ <td>65.2</td>
343
+ <td>852*</td>
344
+ <td>60.6</td>
345
+ <td>79.4</td>
346
+ <td>60.0</td>
347
+ <td>57.5</td>
348
+ <td><u>2348.4*</u></td>
349
+ <td>78.0</td>
350
+ <td>82.1</td>
351
+ <td>49.8*</td>
352
+ <td>48.1*</td>
353
+ <td>80.1</td>
354
+ <td>90.8</td>
355
+ <td>25.7</td>
356
+ <td>18.3</td>
357
+ <td>3.6</td>
358
+ </tr>
359
+ <tr>
360
+ <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
361
+ <td>8B</td>
362
+ <td><strong>2822</strong></td>
363
+ <td><strong>70.2</strong></td>
364
+ <td><strong>897*</strong></td>
365
+ <td><strong>71.9*</strong></td>
366
+ <td><u>86.9*</u></td>
367
+ <td><u>67.5</u></td>
368
+ <td><u>64.0</u></td>
369
+ <td><strong>2372.0*</strong></td>
370
+ <td>80.5</td>
371
+ <td><strong>85.8</strong></td>
372
+ <td>50.4*</td>
373
+ <td><u>51.9</u></td>
374
+ <td>82.0</td>
375
+ <td>93.5</td>
376
+ <td><u>41.4*</u></td>
377
+ <td><u>23.1*</u></td>
378
+ <td><strong>3.8</strong></td>
379
+ </tr>
380
+ </tbody>
381
+ </table>
382
+ </div>
383
+ * We evaluate this benchmark using chain-of-thought prompting. Specifically, for MME, we used this technique only for the Cognition set.
384
+
385
+
386
+ <sup>+</sup> Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens.
387
+
388
+ Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper-bound estimation.
389
+
390
+
391
+ **Multi-image and Video Understanding**
392
+
393
+ <div align="center">
394
+
395
+ <table style="margin: 0px auto;">
396
+ <thead>
397
+ <tr>
398
+ <th align="left">Model</th>
399
+ <th>Size</th>
400
+ <th>BLINK val</th>
401
+ <th>Mantis Eval</th>
402
+ <th>MIRB</th>
403
+ <th>Video-MME (wo / w subs)</th>
404
+ </tr>
405
+ </thead>
406
+ <tbody align="center">
407
+ <tr>
408
+ <td colspan="6" align="left"><strong>Proprietary</strong></td>
409
+ </tr>
410
+ <tr>
411
+ <td nowrap="nowrap" align="left">GPT-4o-20240513</td>
412
+ <td>-</td>
413
+ <td><strong>68.0</strong></td>
414
+ <td>-</td>
415
+ <td>-</td>
416
+ <td><strong>71.9/77.2<strong></td>
417
+ </tr>
418
+ <tr>
419
+ <td nowrap="nowrap" align="left">GPT4V</td>
420
+ <td>-</td>
421
+ <td>54.6</td>
422
+ <td>62.7</td>
423
+ <td>53.1</td>
424
+ <td>59.9/63.3</td>
425
+ </tr>
426
+ <tr>
427
+ <td colspan="6" align="left"><strong>Open-source</strong></td>
428
+ </tr>
429
+ <tr>
430
+ <td nowrap="nowrap" align="left">LLaVA-NeXT-Interleave 14B</td>
431
+ <td>14B</td>
432
+ <td>52.6</td>
433
+ <td>66.4</td>
434
+ <td>30.2</td>
435
+ <td>-</td>
436
+ </tr>
437
+ <tr>
438
+ <td nowrap="nowrap" align="left">LLaVA-OneVision-72B</td>
439
+ <td>72B</td>
440
+ <td>55.4</td>
441
+ <td><strong>77.6</strong></td>
442
+ <td>-</td>
443
+ <td><u>66.2/69.5</u></td>
444
+ </tr>
445
+ <tr>
446
+ <td nowrap="nowrap" align="left">MANTIS 8B</td>
447
+ <td>8B</td>
448
+ <td>49.1</td>
449
+ <td>59.5</td>
450
+ <td>34.8</td>
451
+ <td>-</td>
452
+ </tr>
453
+ <tr>
454
+ <td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
455
+ <td>8B</td>
456
+ <td>53.2</td>
457
+ <td>69.6*</td>
458
+ <td><strong>67.6*</strong></td>
459
+ <td>63.3/69.0</td>
460
+ </tr>
461
+ <tr>
462
+ <td nowrap="nowrap" align="left">InternVL2.5-8B</td>
463
+ <td>8B</td>
464
+ <td>54.8</td>
465
+ <td>67.7</td>
466
+ <td>52.5</td>
467
+ <td>64.2/66.9</td>
468
+ </tr>
469
+ <tr>
470
+ <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
471
+ <td>8B</td>
472
+ <td>53.0</td>
473
+ <td>69.1</td>
474
+ <td>53.8</td>
475
+ <td>60.9/63.6</td>
476
+ </tr>
477
+ <tr>
478
+ <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
479
+ <td>8B</td>
480
+ <td><u>56.7</u></td>
481
+ <td><u>71.9</u></td>
482
+ <td><u>58.6</u></td>
483
+ <td>63.9/67.9</td>
484
+ </tr>
485
+ </tbody>
486
+ </table>
487
+
488
+ </div>
489
+ * We evaluate officially released checkpoints by ourselves.
490
+
491
+ </details>
492
+
493
+
494
+ <details>
495
+ <summary>Click to view audio understanding and speech conversation results.</summary>
496
+
497
+ **Audio Understanding**
498
+
499
+ <div align="center">
500
+ <table style="margin: 0px auto;">
501
+ <thead>
502
+ <tr>
503
+ <th align="left">Task</th>
504
+ <th>Size</th>
505
+ <th colspan="3">ASR (zh)</th>
506
+ <th colspan="3">ASR (en)</th>
507
+ <th colspan="2">AST</th>
508
+ <th>Emotion</th>
509
+ </tr>
510
+ <tr>
511
+ <th align="left">Metric</th>
512
+ <td></td>
513
+ <th colspan="3">CER↓</th>
514
+ <th colspan="3">WER↓</th>
515
+ <th colspan="2">BLEU↑</th>
516
+ <th>ACC↑</th>
517
+ </tr>
518
+ <tr>
519
+ <th align="left">Dataset</th>
520
+ <td></td>
521
+ <th>AISHELL-1</th>
522
+ <th>Fleurs zh</th>
523
+ <th>WenetSpeech test-net</th>
524
+ <th>LibriSpeech test-clean</th>
525
+ <th>GigaSpeech</th>
526
+ <th>TED-LIUM</th>
527
+ <th>CoVoST en2zh</th>
528
+ <th>CoVoST zh2en</th>
529
+ <th>MELD emotion</th>
530
+ </tr>
531
+ </thead>
532
+ <tbody align="center">
533
+ <tr>
534
+ <td colspan="11" align="left"><strong>Proprietary</strong></td>
535
+ </tr>
536
+ <tr>
537
+ <td nowrap="nowrap" align="left">GPT-4o-Realtime</td>
538
+ <td>-</td>
539
+ <td>7.3*</td>
540
+ <td><u>5.4*</u></td>
541
+ <td>28.9*</td>
542
+ <td>2.6*</td>
543
+ <td>12.9*</td>
544
+ <td>4.8*</td>
545
+ <td>37.1*</td>
546
+ <td>15.7*</td>
547
+ <td>33.2*</td>
548
+ </tr>
549
+ <tr>
550
+ <td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
551
+ <td>-</td>
552
+ <td>4.5*</td>
553
+ <td>5.9*</td>
554
+ <td>14.3*</td>
555
+ <td>2.9*</td>
556
+ <td>10.6*</td>
557
+ <td><strong>3.0*</strong></td>
558
+ <td><u>47.3*</u></td>
559
+ <td>22.6*</td>
560
+ <td>48.4*</td>
561
+ </tr>
562
+ <tr>
563
+ <td colspan="11" align="left"><strong>Open-Source</strong></td>
564
+ </tr>
565
+ <tr>
566
+ <td nowrap="nowrap" align="left">Qwen2-Audio-Base</td>
567
+ <td>8B</td>
568
+ <td>-</td>
569
+ <td>7.5</td>
570
+ <td>-</td>
571
+ <td><strong>1.6</strong></td>
572
+ <td>-</td>
573
+ <td>-</td>
574
+ <td>45.2</td>
575
+ <td><u>24.4</u></td>
576
+ <td><strong>55.3</strong></td>
577
+ </tr>
578
+ <tr>
579
+ <td nowrap="nowrap" align="left">Qwen2-Audio-Instruction</td>
580
+ <td>8B</td>
581
+ <td>2.6*</td>
582
+ <td>6.9*</td>
583
+ <td><u>10.3*</u></td>
584
+ <td>3.1*</td>
585
+ <td><u>9.7</u>*</td>
586
+ <td>5.9*</td>
587
+ <td>39.5*</td>
588
+ <td>22.9*</td>
589
+ <td>17.4*</td>
590
+ </tr>
591
+ <tr>
592
+ <td nowrap="nowrap" align="left">GLM-4-Voice-Base</td>
593
+ <td>9B</td>
594
+ <td><u>2.5</u></td>
595
+ <td>-</td>
596
+ <td>-</td>
597
+ <td>2.8</td>
598
+ <td>-</td>
599
+ <td>-</td>
600
+ <td>-</td>
601
+ <td>-</td>
602
+ </tr>
603
+ <tr>
604
+ <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
605
+ <td>8B</td>
606
+ <td><strong>1.6</strong></td>
607
+ <td><strong>4.4</strong></td>
608
+ <td><strong>6.9</strong></td>
609
+ <td><u>1.7</u></td>
610
+ <td><strong>8.7</strong></td>
611
+ <td><strong>3.0</strong></td>
612
+ <td><strong>48.2</strong></td>
613
+ <td><strong>27.2</strong></td>
614
+ <td><u>52.4</u></td>
615
+ </tr>
616
+ </tbody>
617
+ </table>
618
+ </div>
619
+ * We evaluate officially released checkpoints by ourselves.<br><br>
620
+
621
+ **Speech Generation**
622
+
623
+ <div align="center">
624
+ <table style="margin: 0px auto;">
625
+ <thead>
626
+ <tr>
627
+ <th align="left">Task</th>
628
+ <th>Size</th>
629
+ <th colspan="9">SpeechQA</th>
630
+ </tr>
631
+ <tr>
632
+ <th align="left">Metric</th>
633
+ <th></th>
634
+ <th colspan="3">ACC↑</th>
635
+ <th>G-Eval (10 point)↑</th>
636
+ <th>Semantic ELO score↑</th>
637
+ <th>Acoustic ELO score↑</th>
638
+ <th>Overall ELO score↑</th>
639
+ <th>UTMOS↑</th>
640
+ <th>ASR-WER↓</th>
641
+ </tr>
642
+ <tr>
643
+ <th align="left">Dataset</th>
644
+ <th></th>
645
+ <th>Speech Llama Q.</th>
646
+ <th>Speech Web Q.</th>
647
+ <th>Speech Trivia QA</th>
648
+ <th>Speech AlpacaEval</th>
649
+ <th colspan="5">AudioArena</th>
650
+ </tr>
651
+ </thead>
652
+ <tbody align="center">
653
+ <tr>
654
+ <td colspan="11" align="left"><strong>Proprietary</strong></td>
655
+ </tr>
656
+ <tr>
657
+ <td nowrap="nowrap" align="left">GPT-4o-Realtime</td>
658
+ <td></td>
659
+ <td><strong>71.7</strong></td>
660
+ <td><strong>51.6</strong></td>
661
+ <td><strong>69.7</strong></td>
662
+ <td><strong>7.4</strong></td>
663
+ <td><strong>1157</strong></td>
664
+ <td><strong>1203</strong></td>
665
+ <td><strong>1200</strong></td>
666
+ <td><strong>4.2</strong></td>
667
+ <td><strong>2.3</strong></td>
668
+ </tr>
669
+ <tr>
670
+ <td colspan="11" align="left"><strong>Open-Source</strong></td>
671
+ </tr>
672
+ <tr>
673
+ <td nowrap="nowrap" align="left">GLM-4-Voice</td>
674
+ <td>9B</td>
675
+ <td>50.0</td>
676
+ <td>32.0</td>
677
+ <td>36.4</td>
678
+ <td><u>5.1</u></td>
679
+ <td>999</td>
680
+ <td>1147</td>
681
+ <td>1035</td>
682
+ <td><u>4.1</u></td>
683
+ <td><u>11.7</u></td>
684
+ </tr>
685
+ <tr>
686
+ <td nowrap="nowrap" align="left">Llama-Omni</td>
687
+ <td>8B</td>
688
+ <td>45.3</td>
689
+ <td>22.9</td>
690
+ <td>10.7</td>
691
+ <td>3.9</td>
692
+ <td>960</td>
693
+ <td>878</td>
694
+ <td>897</td>
695
+ <td>3.2</td>
696
+ <td>24.3</td>
697
+ </tr>
698
+ <tr>
699
+ <td nowrap="nowrap" align="left">Moshi</td>
700
+ <td>7B</td>
701
+ <td>43.7</td>
702
+ <td>23.8</td>
703
+ <td>16.7</td>
704
+ <td>2.4</td>
705
+ <td>871</td>
706
+ <td>808</td>
707
+ <td>875</td>
708
+ <td>2.8</td>
709
+ <td>8.2</td>
710
+ </tr>
711
+ <tr>
712
+ <td nowrap="nowrap" align="left">Mini-Omni</td>
713
+ <td>1B</td>
714
+ <td>22.0</td>
715
+ <td>12.8</td>
716
+ <td>6.9</td>
717
+ <td>2.5</td>
718
+ <td>926</td>
719
+ <td>803</td>
720
+ <td>865</td>
721
+ <td>3.4</td>
722
+ <td>10.0</td>
723
+ </tr>
724
+ <tr>
725
+ <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
726
+ <td>8B</td>
727
+ <td><u>61.0</u></td>
728
+ <td><u>40.0</u></td>
729
+ <td><u>40.2</u></td>
730
+ <td><u>5.1</u></td>
731
+ <td><u>1088</u></td>
732
+ <td><u>1163</u></td>
733
+ <td><u>1131</u></td>
734
+ <td><strong>4.2</strong></td>
735
+ <td>9.8</td>
736
+ </tr>
737
+ </tbody>
738
+ </table>
739
+ </div>
740
+ All results are from AudioEvals, and the evaluation methods along with further details can be found in <a href="https://github.com/OpenBMB/UltraEval-Audio" target="_blank">AudioEvals</a>.<br><br>
741
+
742
+ **End-to-end Voice Cloning**
743
+
744
+ <div align="center">
745
+ <table style="margin: 0px auto;">
746
+ <thead>
747
+ <tr>
748
+ <th align="left">Task</th>
749
+ <th colspan="2">Voice cloning</th>
750
+ </tr>
751
+ <tr>
752
+ <th align="left">Metric</th>
753
+ <th>SIMO↑</th>
754
+ <th>SIMO↑</th>
755
+ </tr>
756
+ <tr>
757
+ <th align="left">Dataset</th>
758
+ <th>Seed-TTS test-zh</th>
759
+ <th>Seed-TTS test-en</th>
760
+ </tr>
761
+ </thead>
762
+ <tbody align="center">
763
+ <tr>
764
+ <td nowrap="nowrap" align="left">F5-TTS</td>
765
+ <td><strong>76</strong></td>
766
+ <td><strong>67</strong></td>
767
+ </tr>
768
+ <tr>
769
+ <td nowrap="nowrap" align="left">CosyVoice</td>
770
+ <td><u>75</u></td>
771
+ <td><u>64</u></td>
772
+ </tr>
773
+ <tr>
774
+ <td nowrap="nowrap" align="left">FireRedTTS</td>
775
+ <td>63</td>
776
+ <td>46</td>
777
+ </tr>
778
+ <tr>
779
+ <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
780
+ <td>57</td>
781
+ <td>47</td>
782
+ </tr>
783
+ </tbody>
784
+ </table>
785
+ </div>
786
+
787
+ </details>
788
+
789
+ <details>
790
+ <summary>Click to view multimodal live streaming results.</summary>
791
+
792
+ **Multimodal Live Streaming**: results on StreamingBench
793
+
794
+ <table style="margin: 0px auto;">
795
+ <thead>
796
+ <tr>
797
+ <th align="left">Model</th>
798
+ <th>Size</th>
799
+ <th>Real-Time Video Understanding</th>
800
+ <th>Omni-Source Understanding</th>
801
+ <th>Contextual Understanding</th>
802
+ <th>Overall</th>
803
+ </tr>
804
+ </thead>
805
+ <tbody align="center">
806
+ <tr>
807
+ <td colspan="7" align="left"><strong>Proprietary</strong></td>
808
+ </tr>
809
+ <tr>
810
+ <td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
811
+ <td>-</td>
812
+ <td><u>77.4</u></td>
813
+ <td><strong>67.8</strong></td>
814
+ <td><strong>51.1</strong></td>
815
+ <td><strong>70.3</strong></td>
816
+ </tr>
817
+ <tr>
818
+ <td nowrap="nowrap" align="left">GPT-4o-202408</td>
819
+ <td>-</td>
820
+ <td>74.5</td>
821
+ <td>51.0</td>
822
+ <td><u>48.0</u></td>
823
+ <td>64.1</td>
824
+ </tr>
825
+ <tr>
826
+ <td nowrap="nowrap" align="left">Claude-3.5-Sonnet</td>
827
+ <td>-</td>
828
+ <td>74.0</td>
829
+ <td>41.4</td>
830
+ <td>37.8</td>
831
+ <td>59.7</td>
832
+ </tr>
833
+ <tr>
834
+ <td colspan="9" align="left"><strong>Open-source</strong></td>
835
+ </tr>
836
+ <tr>
837
+ <td nowrap="nowrap" align="left">VILA-1.5</td>
838
+ <td>8B</td>
839
+ <td>61.5</td>
840
+ <td>37.5</td>
841
+ <td>26.7</td>
842
+ <td>49.5</td>
843
+ </tr>
844
+ <tr>
845
+ <td nowrap="nowrap" align="left">LongVA</td>
846
+ <td>7B</td>
847
+ <td>63.1</td>
848
+ <td>35.9</td>
849
+ <td>30.2</td>
850
+ <td>50.7</td>
851
+ </tr>
852
+ <tr>
853
+ <td nowrap="nowrap" align="left">LLaVA-Next-Video-34B</td>
854
+ <td>34B</td>
855
+ <td>69.8</td>
856
+ <td>41.7</td>
857
+ <td>34.3</td>
858
+ <td>56.7</td>
859
+ </tr>
860
+ <tr>
861
+ <td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
862
+ <td>8B</td>
863
+ <td>71.2</td>
864
+ <td>40.7</td>
865
+ <td>33.1</td>
866
+ <td>57.0</td>
867
+ </tr>
868
+ <tr>
869
+ <td nowrap="nowrap" align="left">InternVL2-8B</td>
870
+ <td>8B</td>
871
+ <td>70.1</td>
872
+ <td>42.7</td>
873
+ <td>34.1</td>
874
+ <td>57.0</td>
875
+ </tr>
876
+ <tr>
877
+ <td nowrap="nowrap" align="left">VITA-1.5</td>
878
+ <td>8B</td>
879
+ <td>70.9</td>
880
+ <td>40.8</td>
881
+ <td>35.8</td>
882
+ <td>57.4</td>
883
+ </tr>
884
+ <tr>
885
+ <td nowrap="nowrap" align="left">LLaVA-OneVision-7B</td>
886
+ <td>8B</td>
887
+ <td>74.3</td>
888
+ <td>40.8</td>
889
+ <td>31.0</td>
890
+ <td>58.4</td>
891
+ </tr>
892
+ <tr>
893
+ <td nowrap="nowrap" align="left">InternLM-XC2.5-OL-7B</td>
894
+ <td>8B</td>
895
+ <td>75.4</td>
896
+ <td>46.2</td>
897
+ <td>33.6</td>
898
+ <td>60.8</td>
899
+ </tr>
900
+ <tr>
901
+ <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
902
+ <td>8B</td>
903
+ <td>72.4</td>
904
+ <td>40.2</td>
905
+ <td>33.4</td>
906
+ <td>57.7</td>
907
+ </tr>
908
+ <tr>
909
+ <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
910
+ <td>8B</td>
911
+ <td><strong>79.9</strong></td>
912
+ <td><u>53.4</u></td>
913
+ <td>38.5</td>
914
+ <td><u>66.0</u></td>
915
+ </tr>
916
+ </tbody>
917
+ </table>
918
+
919
+ </details>
920
+
921
+
922
+ ### Examples <!-- omit in toc -->
923
+
924
+ We deploy MiniCPM-o 2.6 on end devices. The demo video is the raw-speed recording on an iPad Pro and a Web demo.
925
+
926
+ <div align="center">
927
+ <a href="https://youtu.be/JFJg9KZ_iZk"><img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/o-2dot6-demo-video-preview.png", width=70%></a>
928
+ </div>
929
+
930
+ <br>
931
+
932
+
933
+ <div style="display: flex; flex-direction: column; align-items: center;">
934
+ <img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/minicpmo2_6/minicpmo2_6_math_intersect.png" alt="math" style="margin-bottom: 5px;">
935
+ <img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/minicpmo2_6/minicpmo2_6_diagram_train_NN.png" alt="diagram" style="margin-bottom: 5px;">
936
+ <img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/minicpmo2_6/minicpmo2_6_multi-image_bike.png" alt="bike" style="margin-bottom: 5px;">
937
+ </div>
938
+
939
+
940
+
941
+
942
+ ## Online Demo
943
+ Click here to try the online demo of [MiniCPM-o 2.6](https://minicpm-omni-webdemo-us.modelbest.cn).
944
+
945
+
946
+ ## Usage
947
+ Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.10:
948
+ ```
949
+ Pillow==10.1.0
950
+ torch==2.2.0
951
+ torchaudio==2.2.0
952
+ torchvision==0.17.0
953
+ transformers==4.44.2
954
+ librosa==0.9.0
955
+ soundfile==0.12.1
956
+ vector-quantize-pytorch==1.18.5
957
+ vocos==0.1.0
958
+ decord
959
+ moviepy
960
+ ```
961
+
962
+
963
+ ### Model initialization
964
+ ```python
965
+
966
+ import torch
967
+ from PIL import Image
968
+ from transformers import AutoModel, AutoTokenizer
969
+
970
+ # load omni model default, the default init_vision/init_audio/init_tts is True
971
+ # if load vision-only model, please set init_audio=False and init_tts=False
972
+ # if load audio-only model, please set init_vision=False
973
+ model = AutoModel.from_pretrained(
974
+ 'openbmb/MiniCPM-o-2_6',
975
+ trust_remote_code=True,
976
+ attn_implementation='sdpa', # sdpa or flash_attention_2
977
+ torch_dtype=torch.bfloat16,
978
+ init_vision=True,
979
+ init_audio=True,
980
+ init_tts=True
981
+ )
982
+
983
+
984
+ model = model.eval().cuda()
985
+ tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
986
+
987
+ # In addition to vision-only mode, tts processor and vocos also needs to be initialized
988
+ model.init_tts()
989
+ model.tts.float()
990
+ ```
991
+ ### Omni mode
992
+ we provide two inference modes: chat and streaming
993
+
994
+ #### Chat inference
995
+ ```python
996
+ import math
997
+ import numpy as np
998
+ from PIL import Image
999
+ from moviepy.editor import VideoFileClip
1000
+ import tempfile
1001
+ import librosa
1002
+ import soundfile as sf
1003
+
1004
+ def get_video_chunk_content(video_path, flatten=True):
1005
+ video = VideoFileClip(video_path)
1006
+ print('video_duration:', video.duration)
1007
+
1008
+ with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as temp_audio_file:
1009
+ temp_audio_file_path = temp_audio_file.name
1010
+ video.audio.write_audiofile(temp_audio_file_path, codec="pcm_s16le", fps=16000)
1011
+ audio_np, sr = librosa.load(temp_audio_file_path, sr=16000, mono=True)
1012
+ num_units = math.ceil(video.duration)
1013
+
1014
+ # 1 frame + 1s audio chunk
1015
+ contents= []
1016
+ for i in range(num_units):
1017
+ frame = video.get_frame(i+1)
1018
+ image = Image.fromarray((frame).astype(np.uint8))
1019
+ audio = audio_np[sr*i:sr*(i+1)]
1020
+ if flatten:
1021
+ contents.extend(["<unit>", image, audio])
1022
+ else:
1023
+ contents.append(["<unit>", image, audio])
1024
+
1025
+ return contents
1026
+
1027
+ video_path="/path/to/video"
1028
+ # if use voice clone prompt, please set ref_audio
1029
+ ref_audio_path = 'assets/demo.wav'
1030
+ ref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True)
1031
+ sys_msg = model.get_sys_prompt(ref_audio=ref_audio, mode='omni', language='en')
1032
+ # or use default prompt
1033
+ # sys_msg = model.get_sys_prompt(mode='omni', language='en')
1034
+
1035
+ contents = get_video_chunk_content(video_path)
1036
+ msg = {"role":"user", "content": contents}
1037
+ msgs = [sys_msg, msg]
1038
+
1039
+ # please set generate_audio=True and output_audio_path to save the tts result
1040
+ generate_audio = True
1041
+ output_audio_path = 'output.wav'
1042
+
1043
+ res = model.chat(
1044
+ msgs=msgs,
1045
+ tokenizer=tokenizer,
1046
+ sampling=True,
1047
+ temperature=0.5,
1048
+ max_new_tokens=4096,
1049
+ omni_input=True, # please set omni_input=True when omni inference
1050
+ use_tts_template=True,
1051
+ generate_audio=generate_audio,
1052
+ output_audio_path=output_audio_path,
1053
+ max_slice_nums=1,
1054
+ use_image_id=False,
1055
+ return_dict=True
1056
+ )
1057
+ print(res)
1058
+ ```
1059
+ #### Streaming inference
1060
+ ```python
1061
+ # a new conversation need reset session first, it will reset the kv-cache
1062
+ model.reset_session()
1063
+
1064
+ contents = get_video_chunk_content(video_path, flatten=False)
1065
+ session_id = '123'
1066
+ generate_audio = True
1067
+
1068
+ # 1. prefill system prompt
1069
+ res = model.streaming_prefill(
1070
+ session_id=session_id,
1071
+ msgs=[sys_msg],
1072
+ tokenizer=tokenizer
1073
+ )
1074
+
1075
+ # 2. prefill video/audio chunks
1076
+ for content in contents:
1077
+ msgs = [{"role":"user", "content": content}]
1078
+ res = model.streaming_prefill(
1079
+ session_id=session_id,
1080
+ msgs=msgs,
1081
+ tokenizer=tokenizer
1082
+ )
1083
+
1084
+ # 3. generate
1085
+ res = model.streaming_generate(
1086
+ session_id=session_id,
1087
+ tokenizer=tokenizer,
1088
+ temperature=0.5,
1089
+ generate_audio=generate_audio
1090
+ )
1091
+
1092
+ audios = []
1093
+ text = ""
1094
+
1095
+ if generate_audio:
1096
+ for r in res:
1097
+ audio_wav = r.audio_wav
1098
+ sampling_rate = r.sampling_rate
1099
+ txt = r.text
1100
+
1101
+ audios.append(audio_wav)
1102
+ text += txt
1103
+
1104
+ res = np.concatenate(audios)
1105
+ sf.write("output.wav", res, samplerate=sampling_rate)
1106
+ print("text:", text)
1107
+ print("audio saved to output.wav")
1108
+ else:
1109
+ for r in res:
1110
+ text += r['text']
1111
+ print("text:", text)
1112
+
1113
+ ```
1114
+
1115
+ ### Audio-Only mode
1116
+ #### Mimick
1117
+ `Mimick` task reflects a model's end-to-end speech modeling capability. The model takes audio input, and outputs an ASR transcription and subsequently reconstructs the original audio with high similarity. The higher the similarity between the reconstructed audio and the original audio, the stronger the model's foundational capability in end-to-end speech modeling.
1118
+ ```python
1119
+ mimick_prompt = "Please repeat each user's speech, including voice style and speech content."
1120
+ audio_input, _ = librosa.load('xxx.wav', sr=16000, mono=True)
1121
+ msgs = [{'role': 'user', 'content': [mimick_prompt,audio_input]}]
1122
+
1123
+ res = model.chat(
1124
+ msgs=msgs,
1125
+ tokenizer=tokenizer,
1126
+ sampling=True,
1127
+ max_new_tokens=128,
1128
+ use_tts_template=True,
1129
+ temperature=0.3,
1130
+ generate_audio=True,
1131
+ output_audio_path='output.wav', # save the tts result to output_audio_path
1132
+ )
1133
+ ```
1134
+
1135
+ #### General Speech Conversation with Configurable Voices
1136
+ <details> <summary>Click to view the Python code for enabling MiniCPM-o 2.6 to interact with you in a specified voice.</summary>
1137
+
1138
+ ```python
1139
+ ref_audio, _ = librosa.load('assets/demo.wav', sr=16000, mono=True) # load the reference audio
1140
+
1141
+ # Choose the mode you want to use
1142
+ # Audio RolePlay: # With this mode, model will role-play the character based on the audio prompt. (More human-like conversation but unstable)
1143
+ # sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_roleplay', language='en')
1144
+ # user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
1145
+
1146
+ Audio Assistant: # With this mode, model will speak with the voice in ref_audio as a AI assistant. (Stable and more suitable for general conversation)
1147
+ sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_assistant', language='en')
1148
+ user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # Try to ask something by recording it in 'xxx.wav'!!!
1149
+ ```
1150
+ ```python
1151
+ msgs = [sys_prompt, user_question]
1152
+ # round one
1153
+ res = model.chat(
1154
+ msgs=msgs,
1155
+ tokenizer=tokenizer,
1156
+ sampling=True,
1157
+ max_new_tokens=128,
1158
+ use_tts_template=True,
1159
+ generate_audio=True,
1160
+ temperature=0.3,
1161
+ output_audio_path='result.wav',
1162
+ )
1163
+
1164
+ # round two
1165
+ history = msgs.append({'role': 'assistant', 'content': res})
1166
+ user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
1167
+ msgs = history.append(user_question)
1168
+ res = model.chat(
1169
+ msgs=msgs,
1170
+ tokenizer=tokenizer,
1171
+ sampling=True,
1172
+ max_new_tokens=128,
1173
+ use_tts_template=True,
1174
+ generate_audio=True,
1175
+ temperature=0.3,
1176
+ output_audio_path='result_round_2.wav',
1177
+ )
1178
+ print(res)
1179
+ ```
1180
+
1181
+ </details>
1182
+
1183
+ #### Addressing various audio tasks
1184
+ <details>
1185
+ <summary> Click to show Python code running MiniCPM-o 2.6 with specific audioQA task. </summary>
1186
+
1187
+ ```python
1188
+ '''
1189
+ Audio Understanding Task Prompt:
1190
+ Speech:
1191
+ ASR with ZH(same as AST en2zh): 请仔细听这段音频片段,并将其内容逐字记录。
1192
+ ASR with EN(same as AST zh2en): Please listen to the audio snippet carefully and transcribe the content.
1193
+ Speaker Analysis: Based on the speaker's content, speculate on their gender, condition, age range, and health status.
1194
+ General Audio:
1195
+ Audio Caption: Summarize the main content of the audio.
1196
+ Sound Scene Tagging: Utilize one keyword to convey the audio's content or the associated scene.
1197
+ '''
1198
+ task_prompt = "" # Choose the task prompt above
1199
+ audio_input, _ = librosa.load('xxx.wav', sr=16000, mono=True)
1200
+
1201
+ msgs = [{'role': 'user', 'content': [task_prompt,audio_input]}]
1202
+
1203
+ res = model.chat(
1204
+ msgs=msgs,
1205
+ tokenizer=tokenizer,
1206
+ sampling=True,
1207
+ max_new_tokens=128,
1208
+ use_tts_template=True,
1209
+ generate_audio=True,
1210
+ temperature=0.3,
1211
+ output_audio_path='result.wav',
1212
+ )
1213
+ print(res)
1214
+ ```
1215
+ ```python
1216
+ '''
1217
+ Speech Generation Task Prompt:
1218
+ Human Instruction-to-Speech: see https://voxinstruct.github.io/VoxInstruct/
1219
+ Example:
1220
+ # 在新闻中,一个年轻男性兴致勃勃地说:“祝福亲爱的祖国母亲美丽富强!”他用低音调和低音量,慢慢地说出了这句话。
1221
+ # Delighting in a surprised tone, an adult male with low pitch and low volume comments:"One even gave my little dog a biscuit" This dialogue takes place at a leisurely pace, delivering a sense of excitement and surprise in the context.
1222
+
1223
+ Voice Cloning or Voice Conversion: With this mode, model will act like a TTS model.
1224
+ '''
1225
+ # Human Instruction-to-Speech:
1226
+ task_prompt = '' #Try to make some Human Instruction-to-Speech prompt (Voice Creation)
1227
+ msgs = [{'role': 'user', 'content': [task_prompt]}] # you can also try to ask the same audio question
1228
+
1229
+ # Voice Cloning mode:
1230
+ # sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en')
1231
+ # text_prompt = f"Please read the text below."
1232
+ # user_question = {'role': 'user', 'content': [text_prompt, "content that you want to read"]} # using same voice in sys_prompt to read the text. (Voice Cloning)
1233
+ # user_question = {'role': 'user', 'content': [text_prompt, librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # using same voice in sys_prompt to read 'xxx.wav'. (Voice Conversion)
1234
+ # msgs = [sys_prompt, user_question]
1235
+
1236
+ res = model.chat(
1237
+ msgs=msgs,
1238
+ tokenizer=tokenizer,
1239
+ sampling=True,
1240
+ max_new_tokens=128,
1241
+ use_tts_template=True,
1242
+ generate_audio=True,
1243
+ temperature=0.3,
1244
+ output_audio_path='result.wav',
1245
+ )
1246
+
1247
+
1248
+ ```
1249
+
1250
+ </details>
1251
+
1252
+ ### Vision-Only mode
1253
+
1254
+ `MiniCPM-o-2_6` has the same inference methods as `MiniCPM-V-2_6`
1255
+
1256
+ #### Chat with single image
1257
+ ```python
1258
+ # test.py
1259
+ image = Image.open('xx.jpg').convert('RGB')
1260
+ question = 'What is in the image?'
1261
+ msgs = [{'role': 'user', 'content': [image, question]}]
1262
+ res = model.chat(
1263
+ image=None,
1264
+ msgs=msgs,
1265
+ tokenizer=tokenizer
1266
+ )
1267
+ print(res)
1268
+
1269
+ ## if you want to use streaming, please make sure sampling=True and stream=True
1270
+ ## the model.chat will return a generator
1271
+ res = model.chat(
1272
+ msgs=msgs,
1273
+ tokenizer=tokenizer,
1274
+ sampling=True,
1275
+ stream=True
1276
+ )
1277
+ generated_text = ""
1278
+ for new_text in res:
1279
+ generated_text += new_text
1280
+ print(new_text, flush=True, end='')
1281
+ ```
1282
+
1283
+ #### Chat with multiple images
1284
+ <details>
1285
+ <summary> Click to show Python code running MiniCPM-o 2.6 with multiple images input. </summary>
1286
+
1287
+ ```python
1288
+ image1 = Image.open('image1.jpg').convert('RGB')
1289
+ image2 = Image.open('image2.jpg').convert('RGB')
1290
+ question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'
1291
+ msgs = [{'role': 'user', 'content': [image1, image2, question]}]
1292
+ answer = model.chat(
1293
+ msgs=msgs,
1294
+ tokenizer=tokenizer
1295
+ )
1296
+ print(answer)
1297
+ ```
1298
+ </details>
1299
+
1300
+ #### In-context few-shot learning
1301
+ <details>
1302
+ <summary> Click to view Python code running MiniCPM-o 2.6 with few-shot input. </summary>
1303
+
1304
+ ```python
1305
+ question = "production date"
1306
+ image1 = Image.open('example1.jpg').convert('RGB')
1307
+ answer1 = "2023.08.04"
1308
+ image2 = Image.open('example2.jpg').convert('RGB')
1309
+ answer2 = "2007.04.24"
1310
+ image_test = Image.open('test.jpg').convert('RGB')
1311
+ msgs = [
1312
+ {'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
1313
+ {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
1314
+ {'role': 'user', 'content': [image_test, question]}
1315
+ ]
1316
+ answer = model.chat(
1317
+ msgs=msgs,
1318
+ tokenizer=tokenizer
1319
+ )
1320
+ print(answer)
1321
+ ```
1322
+ </details>
1323
+
1324
+ #### Chat with video
1325
+ <details>
1326
+ <summary> Click to view Python code running MiniCPM-o 2.6 with video input. </summary>
1327
+
1328
+ ```python
1329
+ MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number
1330
+ def encode_video(video_path):
1331
+ def uniform_sample(l, n):
1332
+ gap = len(l) / n
1333
+ idxs = [int(i * gap + gap / 2) for i in range(n)]
1334
+ return [l[i] for i in idxs]
1335
+ vr = VideoReader(video_path, ctx=cpu(0))
1336
+ sample_fps = round(vr.get_avg_fps() / 1) # FPS
1337
+ frame_idx = [i for i in range(0, len(vr), sample_fps)]
1338
+ if len(frame_idx) > MAX_NUM_FRAMES:
1339
+ frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
1340
+ frames = vr.get_batch(frame_idx).asnumpy()
1341
+ frames = [Image.fromarray(v.astype('uint8')) for v in frames]
1342
+ print('num frames:', len(frames))
1343
+ return frames
1344
+ video_path ="video_test.mp4"
1345
+ frames = encode_video(video_path)
1346
+ question = "Describe the video"
1347
+ msgs = [
1348
+ {'role': 'user', 'content': frames + [question]},
1349
+ ]
1350
+ # Set decode params for video
1351
+ params={}
1352
+ params["use_image_id"] = False
1353
+ params["max_slice_nums"] = 2 # use 1 if cuda OOM and video resolution > 448*448
1354
+ answer = model.chat(
1355
+ msgs=msgs,
1356
+ tokenizer=tokenizer,
1357
+ **params
1358
+ )
1359
+ print(answer)
1360
+ ```
1361
+ </details>
1362
+
1363
+ Please look at [GitHub](https://github.com/OpenBMB/MiniCPM-o) for more detail about usage.
1364
+
1365
+
1366
+ ## Inference with llama.cpp<a id="llamacpp"></a>
1367
+ MiniCPM-o 2.6 (vision-only mode) can run with llama.cpp. See our fork of [llama.cpp](https://github.com/OpenBMB/llama.cpp/tree/minicpm-omni) and [readme](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) for more detail.
1368
+
1369
+
1370
+ ## Int4 quantized version
1371
+ Download the int4 quantized version for lower GPU memory (7GB) usage: [MiniCPM-o-2_6-int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4).
1372
+
1373
+
1374
+ ## License
1375
+ #### Model License
1376
+ * The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
1377
+ * The usage of MiniCPM-o and MiniCPM-V series model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
1378
+ * The models and weights of MiniCPM are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, MiniCPM-o 2.6 weights are also available for free commercial use.
1379
+
1380
+
1381
+ #### Statement
1382
+ * As an LMM, MiniCPM-o 2.6 generates contents by learning a large mount of multimodal corpora, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-o 2.6 does not represent the views and positions of the model developers
1383
+ * We will not be liable for any problems arising from the use of the MinCPM-V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.
1384
+
1385
+ ## Key Techniques and Other Multimodal Projects
1386
+
1387
+ 👏 Welcome to explore key techniques of MiniCPM-o 2.6 and other multimodal projects of our team:
1388
+
1389
+ [VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD) | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V)
1390
+
1391
+ ## Citation
1392
+
1393
+ If you find our work helpful, please consider citing our papers 📝 and liking this project ❤️!
1394
+
1395
+ ```bib
1396
+ @article{yao2024minicpm,
1397
+ title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
1398
+ author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
1399
+ journal={arXiv preprint arXiv:2408.01800},
1400
+ year={2024}
1401
+ }
1402
+ ```
added_tokens.json ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</asr>": 151682,
3
+ "</box>": 151670,
4
+ "</image>": 151666,
5
+ "</image_id>": 151678,
6
+ "</point>": 151674,
7
+ "</quad>": 151672,
8
+ "</query>": 151684,
9
+ "</ref>": 151668,
10
+ "</slice>": 151676,
11
+ "</tool_call>": 151658,
12
+ "</unit>": 151680,
13
+ "<asr>": 151681,
14
+ "<box>": 151669,
15
+ "<image>": 151665,
16
+ "<image_id>": 151677,
17
+ "<point>": 151673,
18
+ "<quad>": 151671,
19
+ "<query>": 151683,
20
+ "<ref>": 151667,
21
+ "<reserved_43>": 151698,
22
+ "<reserved_53>": 151699,
23
+ "<slice>": 151675,
24
+ "<tool_call>": 151657,
25
+ "<unit>": 151679,
26
+ "<|audio_end|>": 151687,
27
+ "<|audio_start|>": 151685,
28
+ "<|audio|>": 151686,
29
+ "<|box_end|>": 151649,
30
+ "<|box_start|>": 151648,
31
+ "<|endoftext|>": 151643,
32
+ "<|file_sep|>": 151664,
33
+ "<|fim_middle|>": 151660,
34
+ "<|fim_pad|>": 151662,
35
+ "<|fim_prefix|>": 151659,
36
+ "<|fim_suffix|>": 151661,
37
+ "<|im_end|>": 151645,
38
+ "<|im_start|>": 151644,
39
+ "<|image_pad|>": 151655,
40
+ "<|interrupt|>": 151695,
41
+ "<|listen|>": 151693,
42
+ "<|object_ref_end|>": 151647,
43
+ "<|object_ref_start|>": 151646,
44
+ "<|quad_end|>": 151651,
45
+ "<|quad_start|>": 151650,
46
+ "<|repo_name|>": 151663,
47
+ "<|speak|>": 151694,
48
+ "<|spk_bos|>": 151688,
49
+ "<|spk_eos|>": 151690,
50
+ "<|spk|>": 151689,
51
+ "<|tts_bos|>": 151691,
52
+ "<|tts_eos|>": 151692,
53
+ "<|vad_end|>": 151697,
54
+ "<|vad_start|>": 151696,
55
+ "<|video_pad|>": 151656,
56
+ "<|vision_end|>": 151653,
57
+ "<|vision_pad|>": 151654,
58
+ "<|vision_start|>": 151652
59
+ }
config.json ADDED
@@ -0,0 +1,208 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "openbmb/MiniCPM-o-2_6",
3
+ "architectures": [
4
+ "MiniCPMO"
5
+ ],
6
+
7
+ "attention_dropout": 0.0,
8
+ "bos_token_id": 151643,
9
+ "eos_token_id": 151645,
10
+ "hidden_act": "silu",
11
+ "hidden_size": 3584,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 18944,
14
+ "max_position_embeddings": 32768,
15
+ "max_window_layers": 28,
16
+ "num_attention_heads": 28,
17
+ "num_hidden_layers": 28,
18
+ "num_key_value_heads": 4,
19
+ "rms_norm_eps": 1e-06,
20
+ "rope_theta": 1000000.0,
21
+ "sliding_window": 131072,
22
+ "tie_word_embeddings": false,
23
+ "use_sliding_window": false,
24
+ "vocab_size": 151700,
25
+ "batch_vision_input": true,
26
+ "drop_vision_last_layer": false,
27
+ "image_size": 448,
28
+
29
+ "audio_chunk_length": 1.0,
30
+ "audio_config": {
31
+ "_name_or_path": "openai/whisper-medium",
32
+ "architectures": [
33
+ "MiniCPMWhisperEncoder"
34
+ ],
35
+ "begin_suppress_tokens": [
36
+ 220,
37
+ 50257
38
+ ],
39
+ "bos_token_id": 50257,
40
+ "d_model": 1024,
41
+ "decoder_attention_heads": 16,
42
+ "decoder_ffn_dim": 4096,
43
+ "decoder_layers": 24,
44
+ "decoder_start_token_id": 50258,
45
+ "encoder_attention_heads": 16,
46
+ "encoder_ffn_dim": 4096,
47
+ "encoder_layers": 24,
48
+ "eos_token_id": 50257,
49
+ "forced_decoder_ids": [
50
+ [
51
+ 1,
52
+ 50259
53
+ ],
54
+ [
55
+ 2,
56
+ 50359
57
+ ],
58
+ [
59
+ 3,
60
+ 50363
61
+ ]
62
+ ],
63
+ "max_length": 448,
64
+ "model_type": "whisper",
65
+ "num_hidden_layers": 24,
66
+ "pad_token_id": 50257,
67
+ "suppress_tokens": [
68
+ 1,
69
+ 2,
70
+ 7,
71
+ 8,
72
+ 9,
73
+ 10,
74
+ 14,
75
+ 25,
76
+ 26,
77
+ 27,
78
+ 28,
79
+ 29,
80
+ 31,
81
+ 58,
82
+ 59,
83
+ 60,
84
+ 61,
85
+ 62,
86
+ 63,
87
+ 90,
88
+ 91,
89
+ 92,
90
+ 93,
91
+ 359,
92
+ 503,
93
+ 522,
94
+ 542,
95
+ 873,
96
+ 893,
97
+ 902,
98
+ 918,
99
+ 922,
100
+ 931,
101
+ 1350,
102
+ 1853,
103
+ 1982,
104
+ 2460,
105
+ 2627,
106
+ 3246,
107
+ 3253,
108
+ 3268,
109
+ 3536,
110
+ 3846,
111
+ 3961,
112
+ 4183,
113
+ 4667,
114
+ 6585,
115
+ 6647,
116
+ 7273,
117
+ 9061,
118
+ 9383,
119
+ 10428,
120
+ 10929,
121
+ 11938,
122
+ 12033,
123
+ 12331,
124
+ 12562,
125
+ 13793,
126
+ 14157,
127
+ 14635,
128
+ 15265,
129
+ 15618,
130
+ 16553,
131
+ 16604,
132
+ 18362,
133
+ 18956,
134
+ 20075,
135
+ 21675,
136
+ 22520,
137
+ 26130,
138
+ 26161,
139
+ 26435,
140
+ 28279,
141
+ 29464,
142
+ 31650,
143
+ 32302,
144
+ 32470,
145
+ 36865,
146
+ 42863,
147
+ 47425,
148
+ 49870,
149
+ 50254,
150
+ 50258,
151
+ 50358,
152
+ 50359,
153
+ 50360,
154
+ 50361,
155
+ 50362
156
+ ],
157
+ "torch_dtype": "float32"
158
+ },
159
+ "audio_pool_step": 2,
160
+ "auto_map": {
161
+ "AutoConfig": "configuration_minicpm.MiniCPMOConfig",
162
+ "AutoModel": "modeling_minicpmo.MiniCPMO",
163
+ "AutoModelForCausalLM": "modeling_minicpmo.MiniCPMO"
164
+ },
165
+ "chunk_input": true,
166
+ "listen_speak_type": "asr",
167
+ "model_type": "minicpmo",
168
+ "patch_size": 14,
169
+ "query_num": 64,
170
+ "slice_config": {
171
+ "max_slice_nums": 9,
172
+ "model_type": "minicpmv"
173
+ },
174
+ "slice_mode": true,
175
+ "torch_dtype": "bfloat16",
176
+ "transformers_version": "4.44.2",
177
+ "tts_config": {
178
+ "model_type": "conditional_chattts",
179
+ "llm_dim": 3584
180
+ },
181
+ "use_cache": true,
182
+ "use_image_id": true,
183
+ "version": 2.6,
184
+ "vision_batch_size": 16,
185
+ "vision_config": {
186
+ "hidden_size": 1152,
187
+ "image_size": 980,
188
+ "intermediate_size": 4304,
189
+ "model_type": "siglip_vision_model",
190
+ "num_attention_heads": 16,
191
+ "num_hidden_layers": 27,
192
+ "patch_size": 14
193
+ },
194
+ "quantization_config": {
195
+ "bits": 4,
196
+ "checkpoint_format": "gptq",
197
+ "damp_percent": 0.01,
198
+ "desc_act": true,
199
+ "group_size": -1,
200
+ "lm_head": false,
201
+ "model_file_base_name": null,
202
+ "model_name_or_path": null,
203
+ "quant_method": "gptq",
204
+ "static_groups": false,
205
+ "sym": true,
206
+ "true_sequential": true
207
+ }
208
+ }
configuration_minicpm.py ADDED
@@ -0,0 +1,210 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2025 The OpenBMB Team. All rights reserved.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+
16
+ import os
17
+ from typing import Union
18
+
19
+ from transformers import PretrainedConfig
20
+ from transformers import Qwen2Config
21
+ from transformers import WhisperConfig
22
+ from transformers.utils import logging
23
+
24
+ from .modeling_navit_siglip import SiglipVisionConfig
25
+
26
+ logger = logging.get_logger(__name__)
27
+
28
+
29
+ class MiniCPMVSliceConfig(PretrainedConfig):
30
+ model_type = "minicpmv"
31
+
32
+ def __init__(
33
+ self,
34
+ patch_size=14,
35
+ max_slice_nums=9,
36
+ scale_resolution=448,
37
+ **kwargs,
38
+ ):
39
+ super().__init__(**kwargs)
40
+ self.patch_size = patch_size
41
+ self.max_slice_nums = max_slice_nums
42
+ self.scale_resolution = scale_resolution
43
+
44
+ @classmethod
45
+ def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
46
+ cls._set_token_in_kwargs(kwargs)
47
+
48
+ config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
49
+
50
+ if config_dict.get("model_type") == "minicpmv":
51
+ config_dict = config_dict["slice_config"]
52
+
53
+ if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
54
+ logger.warning(
55
+ f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
56
+ f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
57
+ )
58
+
59
+ return cls.from_dict(config_dict, **kwargs)
60
+
61
+
62
+ class ConditionalChatTTSConfig(PretrainedConfig):
63
+ model_type = "conditional_chattts"
64
+
65
+ def __init__(
66
+ self,
67
+ llm_dim: int = 2560,
68
+ hidden_size: int = 768,
69
+ intermediate_size: int = 3072,
70
+ num_attention_heads: int = 12,
71
+ num_hidden_layers: int = 20,
72
+ max_position_embeddings: int = 4096,
73
+ num_audio_tokens: int = 626,
74
+ num_text_tokens: int = 21178,
75
+ num_mel_bins: int = 100,
76
+ num_vq: int = 4,
77
+ use_speaker_embedding: bool = True,
78
+ use_llm_hidden_state: bool = False,
79
+ spk_emb_token_id: int = 21143,
80
+ num_spk_embs: int = 1,
81
+ audio_bos_token_id: int = 21132,
82
+ text_eos_token_id: int = 21133,
83
+ use_text: bool = True,
84
+ streaming: bool = True,
85
+ streaming_text_chunk_size: int = 10,
86
+ streaming_text_reserved_len: int = 300,
87
+ streaming_audio_chunk_size: int = 50,
88
+ attn_implementation: str = "sdpa",
89
+ use_mlp: bool = True,
90
+ aug_loss_weight: bool = True,
91
+ do_sample: bool = True,
92
+ top_p: float = 0.7,
93
+ top_k: int = 20,
94
+ repetition_penalty: float = 1.0,
95
+ **kwargs,
96
+ ):
97
+ super().__init__(**kwargs)
98
+
99
+ self.llm_dim = llm_dim
100
+ self.hidden_size = hidden_size
101
+ self.intermediate_size = intermediate_size
102
+ self.num_attention_heads = num_attention_heads
103
+ self.num_hidden_layers = num_hidden_layers
104
+ self.max_position_embeddings = max_position_embeddings
105
+ self.num_audio_tokens = num_audio_tokens
106
+ self.num_text_tokens = num_text_tokens
107
+ self.num_mel_bins = num_mel_bins
108
+ self.num_vq = num_vq
109
+ self.use_speaker_embedding = use_speaker_embedding
110
+ self.use_llm_hidden_state = use_llm_hidden_state
111
+ self.spk_emb_token_id = spk_emb_token_id
112
+ self.num_spk_embs = num_spk_embs
113
+ self.audio_bos_token_id = audio_bos_token_id
114
+ self.text_eos_token_id = text_eos_token_id
115
+ self.use_text = use_text
116
+ self.streaming = streaming
117
+ self.streaming_text_chunk_size = streaming_text_chunk_size
118
+ self.streaming_text_reserved_len = streaming_text_reserved_len
119
+ self.streaming_audio_chunk_size = streaming_audio_chunk_size
120
+ self.attn_implementation = attn_implementation
121
+ self.use_mlp = use_mlp
122
+ self.aug_loss_weight = aug_loss_weight
123
+ self.do_sample = do_sample
124
+ self.top_p = top_p
125
+ self.top_k = top_k
126
+ self.repetition_penalty = repetition_penalty
127
+
128
+
129
+ class MiniCPMOConfig(Qwen2Config):
130
+ model_type = "minicpmo"
131
+ keys_to_ignore_at_inference = ["past_key_values"]
132
+
133
+ default_vision_config = {
134
+ "hidden_size": 1152,
135
+ "image_size": 980,
136
+ "intermediate_size": 4304,
137
+ "model_type": "siglip",
138
+ "num_attention_heads": 16,
139
+ "num_hidden_layers": 27,
140
+ "patch_size": 14,
141
+ }
142
+
143
+ def __init__(
144
+ self,
145
+ use_cache=True,
146
+ query_num=64,
147
+ image_size=448,
148
+ drop_vision_last_layer=True,
149
+ batch_vision_input=True,
150
+ slice_config=None,
151
+ vision_config=None,
152
+ audio_config=None,
153
+ tts_config=None,
154
+ use_image_id=True,
155
+ vision_batch_size=16,
156
+ audio_pool_step=2,
157
+ audio_chunk_length=1.0,
158
+ stream_input=False,
159
+ init_vision=True,
160
+ init_audio=True,
161
+ init_tts=True,
162
+ **kwargs,
163
+ ):
164
+ self.use_cache = use_cache
165
+ self.query_num = query_num
166
+ self.image_size = image_size
167
+ self.drop_vision_last_layer = drop_vision_last_layer
168
+ self.batch_vision_input = batch_vision_input
169
+ self.use_image_id = use_image_id
170
+ self.vision_batch_size = vision_batch_size
171
+ self.audio_pool_step = audio_pool_step
172
+ self.audio_chunk_length = audio_chunk_length
173
+ self.stream_input = stream_input
174
+ self.init_vision = init_vision
175
+ self.init_audio = init_audio
176
+ self.init_tts = init_tts
177
+
178
+ if slice_config is None:
179
+ self.slice_config = MiniCPMVSliceConfig(max_slice_nums=1)
180
+ else:
181
+ self.slice_config = MiniCPMVSliceConfig(**slice_config)
182
+ self.slice_mode = True
183
+
184
+ # same as HuggingFaceM4/siglip-so400m-14-980-flash-attn2-navit add tgt_sizes
185
+ if vision_config is None:
186
+ self.vision_config = SiglipVisionConfig(**self.default_vision_config)
187
+ logger.info("vision_config is None, using default vision config")
188
+ elif isinstance(vision_config, dict):
189
+ self.vision_config = SiglipVisionConfig(**vision_config)
190
+ elif isinstance(vision_config, SiglipVisionConfig):
191
+ self.vision_config = vision_config
192
+
193
+ # same as openai/whisper-medium add use_cache
194
+ if audio_config is None:
195
+ self.audio_config = WhisperConfig()
196
+ elif isinstance(audio_config, dict):
197
+ self.audio_config = WhisperConfig(**audio_config)
198
+ elif isinstance(audio_config, WhisperConfig):
199
+ self.audio_config = audio_config
200
+
201
+ if tts_config is None:
202
+ self.tts_config = ConditionalChatTTSConfig()
203
+ elif isinstance(tts_config, dict):
204
+ self.tts_config = ConditionalChatTTSConfig(**tts_config)
205
+ elif isinstance(tts_config, ConditionalChatTTSConfig):
206
+ self.tts_config = tts_config
207
+
208
+ self.patch_size = self.vision_config.patch_size
209
+
210
+ super().__init__(**kwargs)
gitattributes ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ *png filter=lfs diff=lfs merge=lfs -text
37
+ *jpg filter=lfs diff=lfs merge=lfs -text
38
+ *gif filter=lfs diff=lfs merge=lfs -text
39
+ *.wav filter=lfs diff=lfs merge=lfs -text
image_processing_minicpmv.py ADDED
@@ -0,0 +1,407 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2025 The OpenBMB Team. All rights reserved.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+
16
+ import math
17
+ from typing import Any
18
+ from typing import Dict
19
+ from typing import List
20
+ from typing import Optional
21
+ from typing import Union
22
+
23
+ import numpy as np
24
+ import PIL
25
+ import PIL.Image
26
+ import PIL.ImageSequence
27
+ import torch
28
+ from PIL import Image
29
+ from transformers import AutoImageProcessor
30
+ from transformers.image_processing_utils import BaseImageProcessor
31
+ from transformers.image_processing_utils import BatchFeature
32
+ from transformers.image_transforms import to_channel_dimension_format
33
+ from transformers.image_utils import ChannelDimension
34
+ from transformers.image_utils import infer_channel_dimension_format
35
+ from transformers.image_utils import is_torch_tensor
36
+ from transformers.image_utils import to_numpy_array
37
+ from transformers.image_utils import valid_images
38
+ from transformers.utils import is_torch_device
39
+ from transformers.utils import is_torch_dtype
40
+ from transformers.utils import requires_backends
41
+ from transformers.utils import TensorType
42
+
43
+
44
+ def recursive_converter(converter, value):
45
+ if isinstance(value, list):
46
+ new_value = []
47
+ for v in value:
48
+ new_value += [recursive_converter(converter, v)]
49
+ return new_value
50
+ else:
51
+ return converter(value)
52
+
53
+
54
+ class MiniCPMOBatchFeature(BatchFeature):
55
+ r"""
56
+ Extend from BatchFeature for supporting various image size
57
+ """
58
+
59
+ def __init__(self, data: Optional[Dict[str, Any]] = None, tensor_type: Union[None, str, TensorType] = None):
60
+ super().__init__(data)
61
+ self.convert_to_tensors(tensor_type=tensor_type)
62
+
63
+ def convert_to_tensors(self, tensor_type: Optional[Union[str, TensorType]] = None):
64
+ if tensor_type is None:
65
+ return self
66
+
67
+ is_tensor, as_tensor = self._get_is_as_tensor_fns(tensor_type)
68
+
69
+ def converter(value):
70
+ try:
71
+ if not is_tensor(value):
72
+ tensor = as_tensor(value)
73
+ return tensor
74
+ except: # noqa E722
75
+ if key == "overflowing_values":
76
+ raise ValueError("Unable to create tensor returning overflowing values of different lengths. ")
77
+ raise ValueError(
78
+ "Unable to create tensor, you should probably activate padding "
79
+ "with 'padding=True' to have batched tensors with the same length."
80
+ )
81
+
82
+ for key, value in self.items():
83
+ self[key] = recursive_converter(converter, value)
84
+ return self
85
+
86
+ def to(self, *args, **kwargs) -> "MiniCPMOBatchFeature":
87
+ requires_backends(self, ["torch"])
88
+ import torch
89
+
90
+ def cast_tensor(v):
91
+ # check if v is a floating point
92
+ if torch.is_floating_point(v):
93
+ # cast and send to device
94
+ return v.to(*args, **kwargs)
95
+ elif device is not None:
96
+ return v.to(device=device)
97
+ else:
98
+ return v
99
+
100
+ new_data = {}
101
+ device = kwargs.get("device")
102
+ # Check if the args are a device or a dtype
103
+ if device is None and len(args) > 0:
104
+ # device should be always the first argument
105
+ arg = args[0]
106
+ if is_torch_dtype(arg):
107
+ # The first argument is a dtype
108
+ pass
109
+ elif isinstance(arg, str) or is_torch_device(arg) or isinstance(arg, int):
110
+ device = arg
111
+ else:
112
+ # it's something else
113
+ raise ValueError(f"Attempting to cast a BatchFeature to type {str(arg)}. This is not supported.")
114
+ # We cast only floating point tensors to avoid issues with tokenizers casting `LongTensor` to `FloatTensor`
115
+ for k, v in self.items():
116
+ new_data[k] = recursive_converter(cast_tensor, v)
117
+ self.data = new_data
118
+ return self
119
+
120
+
121
+ class MiniCPMVImageProcessor(BaseImageProcessor):
122
+ model_input_names = ["pixel_values"]
123
+
124
+ def __init__(self, max_slice_nums=9, scale_resolution=448, patch_size=14, **kwargs):
125
+ super().__init__(**kwargs)
126
+ self.max_slice_nums = max_slice_nums
127
+ self.scale_resolution = scale_resolution
128
+ self.patch_size = patch_size
129
+ self.use_image_id = kwargs.pop("use_image_id", False)
130
+ self.image_feature_size = kwargs.pop("image_feature_size", 64)
131
+ self.im_start_token = kwargs.pop("im_start", "<image>")
132
+ self.im_end_token = kwargs.pop("im_end", "</image>")
133
+ self.slice_start_token = kwargs.pop("slice_start", "<slice>")
134
+ self.slice_end_token = kwargs.pop("slice_end", "</slice>")
135
+ self.unk_token = kwargs.pop("unk", "<unk>")
136
+ self.im_id_start = kwargs.pop("im_id_start", "<image_id>")
137
+ self.im_id_end = kwargs.pop("im_id_end", "</image_id>")
138
+ self.slice_mode = kwargs.pop("slice_mode", True)
139
+
140
+ self.mean = np.array(kwargs.pop("norm_mean", [0.5, 0.5, 0.5]))
141
+ self.std = np.array(kwargs.pop("norm_std", [0.5, 0.5, 0.5]))
142
+ self.version = kwargs.pop("version", 2.0)
143
+
144
+ def ensure_divide(self, length, patch_size):
145
+ return max(round(length / patch_size) * patch_size, patch_size)
146
+
147
+ def find_best_resize(self, original_size, scale_resolution, patch_size, allow_upscale=False):
148
+ width, height = original_size
149
+ if (width * height > scale_resolution * scale_resolution) or allow_upscale:
150
+ r = width / height
151
+ height = int(scale_resolution / math.sqrt(r))
152
+ width = int(height * r)
153
+ best_width = self.ensure_divide(width, patch_size)
154
+ best_height = self.ensure_divide(height, patch_size)
155
+ return (best_width, best_height)
156
+
157
+ def get_refine_size(self, original_size, grid, scale_resolution, patch_size, allow_upscale=False):
158
+ width, height = original_size
159
+ grid_x, grid_y = grid
160
+
161
+ refine_width = self.ensure_divide(width, grid_x)
162
+ refine_height = self.ensure_divide(height, grid_y)
163
+
164
+ grid_width = refine_width / grid_x
165
+ grid_height = refine_height / grid_y
166
+
167
+ best_grid_size = self.find_best_resize(
168
+ (grid_width, grid_height), scale_resolution, patch_size, allow_upscale=allow_upscale
169
+ )
170
+ refine_size = (best_grid_size[0] * grid_x, best_grid_size[1] * grid_y)
171
+ return refine_size
172
+
173
+ def split_to_patches(self, image, grid):
174
+ patches = []
175
+ width, height = image.size
176
+ grid_x = int(width / grid[0])
177
+ grid_y = int(height / grid[1])
178
+ for i in range(0, height, grid_y):
179
+ images = []
180
+ for j in range(0, width, grid_x):
181
+ box = (j, i, j + grid_x, i + grid_y)
182
+ patch = image.crop(box)
183
+ images.append(patch)
184
+ patches.append(images)
185
+ return patches
186
+
187
+ def slice_image(self, image, max_slice_nums=9, scale_resolution=448, patch_size=14, never_split=False):
188
+ original_size = image.size
189
+ source_image = None
190
+ best_grid = self.get_sliced_grid(original_size, max_slice_nums, never_split)
191
+ patches = []
192
+
193
+ if best_grid is None:
194
+ # dont need to slice, upsample
195
+ best_size = self.find_best_resize(original_size, scale_resolution, patch_size, allow_upscale=True)
196
+ source_image = image.resize(best_size, resample=Image.Resampling.BICUBIC)
197
+ else:
198
+ # source image, down-sampling and ensure divided by patch_size
199
+ best_resize = self.find_best_resize(original_size, scale_resolution, patch_size)
200
+ source_image = image.copy().resize(best_resize, resample=Image.Resampling.BICUBIC)
201
+ refine_size = self.get_refine_size(
202
+ original_size, best_grid, scale_resolution, patch_size, allow_upscale=True
203
+ )
204
+ refine_image = image.resize(refine_size, resample=Image.Resampling.BICUBIC)
205
+ patches = self.split_to_patches(refine_image, best_grid)
206
+
207
+ return source_image, patches, best_grid
208
+
209
+ def get_grid_placeholder(self, grid):
210
+ if grid is None:
211
+ return ""
212
+ slice_image_placeholder = (
213
+ self.slice_start_token + self.unk_token * self.image_feature_size + self.slice_end_token
214
+ )
215
+
216
+ cols = grid[0]
217
+ rows = grid[1]
218
+ slices = []
219
+ for i in range(rows):
220
+ lines = []
221
+ for j in range(cols):
222
+ lines.append(slice_image_placeholder)
223
+ slices.append("".join(lines))
224
+
225
+ slice_placeholder = "\n".join(slices)
226
+ return slice_placeholder
227
+
228
+ def get_image_id_placeholder(self, idx=0):
229
+ return f"{self.im_id_start}{idx}{self.im_id_end}"
230
+
231
+ def get_sliced_images(self, image, max_slice_nums=None):
232
+ slice_images = []
233
+
234
+ if not self.slice_mode:
235
+ return [image]
236
+
237
+ max_slice_nums = self.max_slice_nums if max_slice_nums is None else int(max_slice_nums)
238
+ assert max_slice_nums > 0
239
+ source_image, patches, sliced_grid = self.slice_image(
240
+ image, max_slice_nums, self.scale_resolution, self.patch_size # default: 9 # default: 448 # default: 14
241
+ )
242
+
243
+ slice_images.append(source_image)
244
+ if len(patches) > 0:
245
+ for i in range(len(patches)):
246
+ for j in range(len(patches[0])):
247
+ slice_images.append(patches[i][j])
248
+ return slice_images
249
+
250
+ def get_sliced_grid(self, image_size, max_slice_nums, nerver_split=False):
251
+ original_width, original_height = image_size
252
+ log_ratio = math.log(original_width / original_height)
253
+ ratio = original_width * original_height / (self.scale_resolution * self.scale_resolution)
254
+ multiple = min(math.ceil(ratio), max_slice_nums)
255
+ if multiple <= 1 or nerver_split:
256
+ return None
257
+ candidate_split_grids_nums = []
258
+ for i in [multiple - 1, multiple, multiple + 1]:
259
+ if i == 1 or i > max_slice_nums:
260
+ continue
261
+ candidate_split_grids_nums.append(i)
262
+
263
+ candidate_grids = []
264
+ for split_grids_nums in candidate_split_grids_nums:
265
+ m = 1
266
+ while m <= split_grids_nums:
267
+ if split_grids_nums % m == 0:
268
+ candidate_grids.append([m, split_grids_nums // m])
269
+ m += 1
270
+
271
+ best_grid = [1, 1]
272
+ min_error = float("inf")
273
+ for grid in candidate_grids:
274
+ error = abs(log_ratio - math.log(grid[0] / grid[1]))
275
+ if error < min_error:
276
+ best_grid = grid
277
+ min_error = error
278
+
279
+ return best_grid
280
+
281
+ def get_slice_image_placeholder(self, image_size, image_idx=0, max_slice_nums=None, use_image_id=None):
282
+ max_slice_nums = self.max_slice_nums if max_slice_nums is None else int(max_slice_nums)
283
+ assert max_slice_nums > 0
284
+ grid = self.get_sliced_grid(image_size=image_size, max_slice_nums=max_slice_nums)
285
+
286
+ image_placeholder = self.im_start_token + self.unk_token * self.image_feature_size + self.im_end_token
287
+ use_image_id = self.use_image_id if use_image_id is None else bool(use_image_id)
288
+ if use_image_id:
289
+ final_placeholder = self.get_image_id_placeholder(image_idx) + image_placeholder
290
+ else:
291
+ final_placeholder = image_placeholder
292
+
293
+ if self.slice_mode:
294
+ final_placeholder = final_placeholder + self.get_grid_placeholder(grid=grid)
295
+ return final_placeholder
296
+
297
+ def to_pil_image(self, image, rescale=None) -> PIL.Image.Image:
298
+ """
299
+ Converts `image` to a PIL Image. Optionally rescales it and puts the channel dimension back as the last axis if
300
+ needed.
301
+
302
+ Args:
303
+ image (`PIL.Image.Image` or `numpy.ndarray` or `torch.Tensor`):
304
+ The image to convert to the PIL Image format.
305
+ rescale (`bool`, *optional*):
306
+ Whether or not to apply the scaling factor (to make pixel values integers between 0 and 255). Will
307
+ default to `True` if the image type is a floating type, `False` otherwise.
308
+ """
309
+ if isinstance(image, PIL.Image.Image):
310
+ return image
311
+ if is_torch_tensor(image):
312
+ image = image.numpy()
313
+
314
+ if isinstance(image, np.ndarray):
315
+ if rescale is None:
316
+ # rescale default to the array being of floating type.
317
+ rescale = isinstance(image.flat[0], np.floating)
318
+ # If the channel as been moved to first dim, we put it back at the end.
319
+ if image.ndim == 3 and image.shape[0] in [1, 3]:
320
+ image = image.transpose(1, 2, 0)
321
+ if rescale:
322
+ image = image * 255
323
+ image = image.astype(np.uint8)
324
+ return PIL.Image.fromarray(image)
325
+ return image
326
+
327
+ def reshape_by_patch(self, image):
328
+ """
329
+ :param image: shape [3, H, W]
330
+ :param patch_size:
331
+ :return: [3, patch_size, HW/patch_size]
332
+ """
333
+ image = torch.from_numpy(image)
334
+ patch_size = self.patch_size
335
+ patches = torch.nn.functional.unfold(image, (patch_size, patch_size), stride=(patch_size, patch_size))
336
+
337
+ patches = patches.reshape(image.size(0), patch_size, patch_size, -1)
338
+ patches = patches.permute(0, 1, 3, 2).reshape(image.size(0), patch_size, -1)
339
+ return patches.numpy()
340
+
341
+ def preprocess(
342
+ self,
343
+ images: Union[Image.Image, List[Image.Image], List[List[Image.Image]]],
344
+ do_pad: Optional[bool] = True,
345
+ max_slice_nums: int = None,
346
+ return_tensors: Optional[Union[str, TensorType]] = None,
347
+ **kwargs,
348
+ ) -> MiniCPMOBatchFeature:
349
+ if isinstance(images, Image.Image):
350
+ images_list = [[images]]
351
+ elif isinstance(images[0], Image.Image):
352
+ images_list = [images]
353
+ else:
354
+ images_list = images
355
+
356
+ new_images_list = []
357
+ image_sizes_list = []
358
+ tgt_sizes_list = []
359
+
360
+ for _images in images_list:
361
+ if _images is None or len(_images) == 0:
362
+ new_images_list.append([])
363
+ image_sizes_list.append([])
364
+ tgt_sizes_list.append([])
365
+ continue
366
+ if not valid_images(_images):
367
+ raise ValueError(
368
+ "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
369
+ "torch.Tensor, tf.Tensor or jax.ndarray."
370
+ )
371
+
372
+ _images = [self.to_pil_image(image).convert("RGB") for image in _images]
373
+ input_data_format = infer_channel_dimension_format(np.array(_images[0]))
374
+
375
+ new_images = []
376
+ image_sizes = [image.size for image in _images]
377
+ tgt_sizes = []
378
+ for image in _images:
379
+ image_patches = self.get_sliced_images(image, max_slice_nums)
380
+ image_patches = [to_numpy_array(image).astype(np.float32) / 255 for image in image_patches]
381
+ image_patches = [
382
+ self.normalize(image=image, mean=self.mean, std=self.std, input_data_format=input_data_format)
383
+ for image in image_patches
384
+ ]
385
+ image_patches = [
386
+ to_channel_dimension_format(image, ChannelDimension.FIRST, input_channel_dim=input_data_format)
387
+ for image in image_patches
388
+ ]
389
+ for slice_image in image_patches:
390
+ new_images.append(self.reshape_by_patch(slice_image))
391
+ tgt_sizes.append(
392
+ np.array((slice_image.shape[1] // self.patch_size, slice_image.shape[2] // self.patch_size))
393
+ )
394
+
395
+ if tgt_sizes:
396
+ tgt_sizes = np.vstack(tgt_sizes)
397
+
398
+ new_images_list.append(new_images)
399
+ image_sizes_list.append(image_sizes)
400
+ tgt_sizes_list.append(tgt_sizes)
401
+ return MiniCPMOBatchFeature(
402
+ data={"pixel_values": new_images_list, "image_sizes": image_sizes_list, "tgt_sizes": tgt_sizes_list},
403
+ tensor_type=return_tensors,
404
+ )
405
+
406
+
407
+ AutoImageProcessor.register("MiniCPMVImageProcessor", MiniCPMVImageProcessor)
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
modeling_minicpmo.py ADDED
The diff for this file is too large to render. See raw diff
 
modeling_navit_siglip.py ADDED
@@ -0,0 +1,940 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2024 Google AI and The HuggingFace Team. All rights reserved.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+ """ PyTorch Siglip model. """
16
+ # Copied from HuggingFaceM4/siglip-so400m-14-980-flash-attn2-navit and add tgt_sizes
17
+
18
+
19
+ import math
20
+ import os
21
+ import warnings
22
+ from dataclasses import dataclass
23
+ from typing import Optional
24
+ from typing import Tuple
25
+ from typing import Union
26
+
27
+ import numpy as np
28
+ import torch
29
+ import torch.nn.functional as F
30
+ import torch.utils.checkpoint
31
+ from torch import nn
32
+ from torch.nn.init import _calculate_fan_in_and_fan_out
33
+ from transformers.activations import ACT2FN
34
+ from transformers.configuration_utils import PretrainedConfig
35
+ from transformers.modeling_attn_mask_utils import _prepare_4d_attention_mask
36
+ from transformers.modeling_outputs import BaseModelOutput
37
+ from transformers.modeling_outputs import BaseModelOutputWithPooling
38
+ from transformers.modeling_utils import PreTrainedModel
39
+ from transformers.utils import add_start_docstrings
40
+ from transformers.utils import add_start_docstrings_to_model_forward
41
+ from transformers.utils import is_flash_attn_2_available
42
+ from transformers.utils import logging
43
+ from transformers.utils import ModelOutput
44
+ from transformers.utils import replace_return_docstrings
45
+
46
+ logger = logging.get_logger(__name__)
47
+
48
+
49
+ class SiglipVisionConfig(PretrainedConfig):
50
+ r"""
51
+ This is the configuration class to store the configuration of a [`SiglipVisionModel`]. It is used to instantiate a
52
+ Siglip vision encoder according to the specified arguments, defining the model architecture. Instantiating a
53
+ configuration with the defaults will yield a similar configuration to that of the vision encoder of the Siglip
54
+ [google/siglip-base-patch16-224](https://huggingface.co/google/siglip-base-patch16-224) architecture.
55
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
56
+ documentation from [`PretrainedConfig`] for more information.
57
+ Args:
58
+ hidden_size (`int`, *optional*, defaults to 768):
59
+ Dimensionality of the encoder layers and the pooler layer.
60
+ intermediate_size (`int`, *optional*, defaults to 3072):
61
+ Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
62
+ num_hidden_layers (`int`, *optional*, defaults to 12):
63
+ Number of hidden layers in the Transformer encoder.
64
+ num_attention_heads (`int`, *optional*, defaults to 12):
65
+ Number of attention heads for each attention layer in the Transformer encoder.
66
+ num_channels (`int`, *optional*, defaults to 3):
67
+ Number of channels in the input images.
68
+ image_size (`int`, *optional*, defaults to 224):
69
+ The size (resolution) of each image.
70
+ patch_size (`int`, *optional*, defaults to 16):
71
+ The size (resolution) of each patch.
72
+ hidden_act (`str` or `function`, *optional*, defaults to `"gelu_pytorch_tanh"`):
73
+ The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
74
+ `"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported.
75
+ layer_norm_eps (`float`, *optional*, defaults to 1e-06):
76
+ The epsilon used by the layer normalization layers.
77
+ attention_dropout (`float`, *optional*, defaults to 0.0):
78
+ The dropout ratio for the attention probabilities.
79
+ Example:
80
+ ```python
81
+ >>> from transformers import SiglipVisionConfig, SiglipVisionModel
82
+ >>> # Initializing a SiglipVisionConfig with google/siglip-base-patch16-224 style configuration
83
+ >>> configuration = SiglipVisionConfig()
84
+ >>> # Initializing a SiglipVisionModel (with random weights) from the google/siglip-base-patch16-224 style configuration
85
+ >>> model = SiglipVisionModel(configuration)
86
+ >>> # Accessing the model configuration
87
+ >>> configuration = model.config
88
+ ```"""
89
+
90
+ model_type = "siglip_vision_model"
91
+
92
+ def __init__(
93
+ self,
94
+ hidden_size=768,
95
+ intermediate_size=3072,
96
+ num_hidden_layers=12,
97
+ num_attention_heads=12,
98
+ num_channels=3,
99
+ image_size=224,
100
+ patch_size=16,
101
+ hidden_act="gelu_pytorch_tanh",
102
+ layer_norm_eps=1e-6,
103
+ attention_dropout=0.0,
104
+ **kwargs,
105
+ ):
106
+ super().__init__(**kwargs)
107
+
108
+ self.hidden_size = hidden_size
109
+ self.intermediate_size = intermediate_size
110
+ self.num_hidden_layers = num_hidden_layers
111
+ self.num_attention_heads = num_attention_heads
112
+ self.num_channels = num_channels
113
+ self.patch_size = patch_size
114
+ self.image_size = image_size
115
+ self.attention_dropout = attention_dropout
116
+ self.layer_norm_eps = layer_norm_eps
117
+ self.hidden_act = hidden_act
118
+
119
+ @classmethod
120
+ def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
121
+ cls._set_token_in_kwargs(kwargs)
122
+
123
+ config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
124
+
125
+ # get the vision config dict if we are loading from SiglipConfig
126
+ if config_dict.get("model_type") == "siglip":
127
+ config_dict = config_dict["vision_config"]
128
+
129
+ if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
130
+ logger.warning(
131
+ f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
132
+ f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
133
+ )
134
+
135
+ return cls.from_dict(config_dict, **kwargs)
136
+
137
+
138
+ _CHECKPOINT_FOR_DOC = "google/siglip-base-patch16-224"
139
+
140
+ SIGLIP_PRETRAINED_MODEL_ARCHIVE_LIST = [
141
+ "google/siglip-base-patch16-224",
142
+ # See all SigLIP models at https://huggingface.co/models?filter=siglip
143
+ ]
144
+
145
+ if is_flash_attn_2_available():
146
+ from flash_attn import flash_attn_func
147
+ from flash_attn import flash_attn_varlen_func
148
+ from flash_attn.bert_padding import index_first_axis # noqa
149
+ from flash_attn.bert_padding import pad_input
150
+ from flash_attn.bert_padding import unpad_input
151
+
152
+
153
+ # Copied from transformers.models.llama.modeling_llama._get_unpad_data
154
+ def _get_unpad_data(attention_mask):
155
+ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
156
+ indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
157
+ max_seqlen_in_batch = seqlens_in_batch.max().item()
158
+ cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.torch.int32), (1, 0))
159
+ return (
160
+ indices,
161
+ cu_seqlens,
162
+ max_seqlen_in_batch,
163
+ )
164
+
165
+
166
+ def _trunc_normal_(tensor, mean, std, a, b):
167
+ # Cut & paste from PyTorch official master until it's in a few official releases - RW
168
+ # Method based on https://people.sc.fsu.edu/~jburkardt/presentations/truncated_normal.pdf
169
+ def norm_cdf(x):
170
+ # Computes standard normal cumulative distribution function
171
+ return (1.0 + math.erf(x / math.sqrt(2.0))) / 2.0
172
+
173
+ if (mean < a - 2 * std) or (mean > b + 2 * std):
174
+ warnings.warn(
175
+ "mean is more than 2 std from [a, b] in nn.init.trunc_normal_. "
176
+ "The distribution of values may be incorrect.",
177
+ stacklevel=2,
178
+ )
179
+
180
+ # Values are generated by using a truncated uniform distribution and
181
+ # then using the inverse CDF for the normal distribution.
182
+ # Get upper and lower cdf values
183
+ l = norm_cdf((a - mean) / std)
184
+ u = norm_cdf((b - mean) / std)
185
+
186
+ # Uniformly fill tensor with values from [l, u], then translate to
187
+ # [2l-1, 2u-1].
188
+ tensor.uniform_(2 * l - 1, 2 * u - 1)
189
+
190
+ # Use inverse cdf transform for normal distribution to get truncated
191
+ # standard normal
192
+ if tensor.dtype in [torch.float16, torch.bfloat16]:
193
+ # The `erfinv_` op is not (yet?) defined in float16+cpu, bfloat16+gpu
194
+ og_dtype = tensor.dtype
195
+ tensor = tensor.to(torch.float32)
196
+ tensor.erfinv_()
197
+ tensor = tensor.to(og_dtype)
198
+ else:
199
+ tensor.erfinv_()
200
+
201
+ # Transform to proper mean, std
202
+ tensor.mul_(std * math.sqrt(2.0))
203
+ tensor.add_(mean)
204
+
205
+ # Clamp to ensure it's in the proper range
206
+ if tensor.dtype == torch.float16:
207
+ # The `clamp_` op is not (yet?) defined in float16+cpu
208
+ tensor = tensor.to(torch.float32)
209
+ tensor.clamp_(min=a, max=b)
210
+ tensor = tensor.to(torch.float16)
211
+ else:
212
+ tensor.clamp_(min=a, max=b)
213
+
214
+
215
+ def trunc_normal_tf_(
216
+ tensor: torch.Tensor, mean: float = 0.0, std: float = 1.0, a: float = -2.0, b: float = 2.0
217
+ ) -> torch.Tensor:
218
+ """Fills the input Tensor with values drawn from a truncated
219
+ normal distribution. The values are effectively drawn from the
220
+ normal distribution :math:`\\mathcal{N}(\text{mean}, \text{std}^2)`
221
+ with values outside :math:`[a, b]` redrawn until they are within
222
+ the bounds. The method used for generating the random values works
223
+ best when :math:`a \\leq \text{mean} \\leq b`.
224
+ NOTE: this 'tf' variant behaves closer to Tensorflow / JAX impl where the
225
+ bounds [a, b] are applied when sampling the normal distribution with mean=0, std=1.0
226
+ and the result is subsquently scaled and shifted by the mean and std args.
227
+ Args:
228
+ tensor: an n-dimensional `torch.Tensor`
229
+ mean: the mean of the normal distribution
230
+ std: the standard deviation of the normal distribution
231
+ a: the minimum cutoff value
232
+ b: the maximum cutoff value
233
+ """
234
+ with torch.no_grad():
235
+ _trunc_normal_(tensor, 0, 1.0, a, b)
236
+ tensor.mul_(std).add_(mean)
237
+
238
+
239
+ def variance_scaling_(tensor, scale=1.0, mode="fan_in", distribution="normal"):
240
+ fan_in, fan_out = _calculate_fan_in_and_fan_out(tensor)
241
+ if mode == "fan_in":
242
+ denom = fan_in
243
+ elif mode == "fan_out":
244
+ denom = fan_out
245
+ elif mode == "fan_avg":
246
+ denom = (fan_in + fan_out) / 2
247
+
248
+ variance = scale / denom
249
+
250
+ if distribution == "truncated_normal":
251
+ # constant is stddev of standard normal truncated to (-2, 2)
252
+ trunc_normal_tf_(tensor, std=math.sqrt(variance) / 0.87962566103423978)
253
+ elif distribution == "normal":
254
+ with torch.no_grad():
255
+ tensor.normal_(std=math.sqrt(variance))
256
+ elif distribution == "uniform":
257
+ bound = math.sqrt(3 * variance)
258
+ with torch.no_grad():
259
+ tensor.uniform_(-bound, bound)
260
+ else:
261
+ raise ValueError(f"invalid distribution {distribution}")
262
+
263
+
264
+ def lecun_normal_(tensor):
265
+ variance_scaling_(tensor, mode="fan_in", distribution="truncated_normal")
266
+
267
+
268
+ def default_flax_embed_init(tensor):
269
+ variance_scaling_(tensor, mode="fan_in", distribution="normal")
270
+
271
+
272
+ @dataclass
273
+ # Copied from transformers.models.clip.modeling_clip.CLIPVisionModelOutput with CLIP->Siglip
274
+ class SiglipVisionModelOutput(ModelOutput):
275
+ """
276
+ Base class for vision model's outputs that also contains image embeddings of the pooling of the last hidden states.
277
+ Args:
278
+ image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
279
+ The image embeddings obtained by applying the projection layer to the pooler_output.
280
+ last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
281
+ Sequence of hidden-states at the output of the last layer of the model.
282
+ hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
283
+ Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
284
+ one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
285
+ Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
286
+ attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
287
+ Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
288
+ sequence_length)`.
289
+ Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
290
+ heads.
291
+ """
292
+
293
+ image_embeds: Optional[torch.FloatTensor] = None
294
+ last_hidden_state: torch.FloatTensor = None
295
+ hidden_states: Optional[Tuple[torch.FloatTensor]] = None
296
+ attentions: Optional[Tuple[torch.FloatTensor]] = None
297
+
298
+
299
+ class SiglipVisionEmbeddings(nn.Module):
300
+ def __init__(self, config: SiglipVisionConfig):
301
+ super().__init__()
302
+ self.config = config
303
+ self.embed_dim = config.hidden_size
304
+ self.image_size = config.image_size
305
+ self.patch_size = config.patch_size
306
+
307
+ self.patch_embedding = nn.Conv2d(
308
+ in_channels=config.num_channels,
309
+ out_channels=self.embed_dim,
310
+ kernel_size=self.patch_size,
311
+ stride=self.patch_size,
312
+ padding="valid",
313
+ )
314
+
315
+ self.num_patches_per_side = self.image_size // self.patch_size
316
+ self.num_patches = self.num_patches_per_side**2
317
+ self.num_positions = self.num_patches
318
+ self.position_embedding = nn.Embedding(self.num_positions, self.embed_dim)
319
+
320
+ def forward(
321
+ self,
322
+ pixel_values: torch.FloatTensor,
323
+ patch_attention_mask: torch.BoolTensor,
324
+ tgt_sizes: Optional[torch.IntTensor] = None,
325
+ ) -> torch.Tensor:
326
+ batch_size = pixel_values.size(0)
327
+
328
+ patch_embeds = self.patch_embedding(pixel_values)
329
+ embeddings = patch_embeds.flatten(2).transpose(1, 2)
330
+
331
+ max_im_h, max_im_w = pixel_values.size(2), pixel_values.size(3)
332
+ max_nb_patches_h, max_nb_patches_w = max_im_h // self.patch_size, max_im_w // self.patch_size
333
+ boundaries = torch.arange(1 / self.num_patches_per_side, 1.0, 1 / self.num_patches_per_side)
334
+ position_ids = torch.full(
335
+ size=(
336
+ batch_size,
337
+ max_nb_patches_h * max_nb_patches_w,
338
+ ),
339
+ fill_value=0,
340
+ )
341
+
342
+ for batch_idx, p_attn_mask in enumerate(patch_attention_mask):
343
+ if tgt_sizes is not None:
344
+ nb_patches_h = tgt_sizes[batch_idx][0]
345
+ nb_patches_w = tgt_sizes[batch_idx][1]
346
+ else:
347
+ nb_patches_h = p_attn_mask[:, 0].sum()
348
+ nb_patches_w = p_attn_mask[0].sum()
349
+
350
+ fractional_coords_h = torch.arange(0, 1 - 1e-6, 1 / nb_patches_h)
351
+ fractional_coords_w = torch.arange(0, 1 - 1e-6, 1 / nb_patches_w)
352
+
353
+ bucket_coords_h = torch.bucketize(fractional_coords_h, boundaries, right=True)
354
+ bucket_coords_w = torch.bucketize(fractional_coords_w, boundaries, right=True)
355
+
356
+ pos_ids = (bucket_coords_h[:, None] * self.num_patches_per_side + bucket_coords_w).flatten()
357
+ position_ids[batch_idx][p_attn_mask.view(-1).cpu()] = pos_ids
358
+
359
+ position_ids = position_ids.to(self.position_embedding.weight.device)
360
+
361
+ embeddings = embeddings + self.position_embedding(position_ids)
362
+ return embeddings
363
+
364
+
365
+ class SiglipAttention(nn.Module):
366
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
367
+
368
+ # Copied from transformers.models.clip.modeling_clip.CLIPAttention.__init__
369
+ def __init__(self, config):
370
+ super().__init__()
371
+ self.config = config
372
+ self.embed_dim = config.hidden_size
373
+ self.num_heads = config.num_attention_heads
374
+ self.head_dim = self.embed_dim // self.num_heads
375
+ if self.head_dim * self.num_heads != self.embed_dim:
376
+ raise ValueError(
377
+ f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`:"
378
+ f" {self.num_heads})."
379
+ )
380
+ self.scale = self.head_dim**-0.5
381
+ self.dropout = config.attention_dropout
382
+
383
+ self.k_proj = nn.Linear(self.embed_dim, self.embed_dim)
384
+ self.v_proj = nn.Linear(self.embed_dim, self.embed_dim)
385
+ self.q_proj = nn.Linear(self.embed_dim, self.embed_dim)
386
+ self.out_proj = nn.Linear(self.embed_dim, self.embed_dim)
387
+
388
+ def forward(
389
+ self,
390
+ hidden_states: torch.Tensor,
391
+ attention_mask: Optional[torch.Tensor] = None,
392
+ output_attentions: Optional[bool] = False,
393
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
394
+ """Input shape: Batch x Time x Channel"""
395
+
396
+ batch_size, q_len, _ = hidden_states.size()
397
+
398
+ query_states = self.q_proj(hidden_states)
399
+ key_states = self.k_proj(hidden_states)
400
+ value_states = self.v_proj(hidden_states)
401
+
402
+ query_states = query_states.view(batch_size, q_len, self.num_heads, self.head_dim).transpose(1, 2)
403
+ key_states = key_states.view(batch_size, q_len, self.num_heads, self.head_dim).transpose(1, 2)
404
+ value_states = value_states.view(batch_size, q_len, self.num_heads, self.head_dim).transpose(1, 2)
405
+
406
+ k_v_seq_len = key_states.shape[-2]
407
+ attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) * self.scale
408
+
409
+ if attn_weights.size() != (batch_size, self.num_heads, q_len, k_v_seq_len):
410
+ raise ValueError(
411
+ f"Attention weights should be of size {(batch_size, self.num_heads, q_len, k_v_seq_len)}, but is"
412
+ f" {attn_weights.size()}"
413
+ )
414
+
415
+ if attention_mask is not None:
416
+ if attention_mask.size() != (batch_size, 1, q_len, k_v_seq_len):
417
+ raise ValueError(
418
+ f"Attention mask should be of size {(batch_size, 1, q_len, k_v_seq_len)}, but is {attention_mask.size()}"
419
+ )
420
+ attn_weights = attn_weights + attention_mask
421
+
422
+ # upcast attention to fp32
423
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
424
+ attn_weights = nn.functional.dropout(attn_weights, p=self.dropout, training=self.training)
425
+ attn_output = torch.matmul(attn_weights, value_states)
426
+
427
+ if attn_output.size() != (batch_size, self.num_heads, q_len, self.head_dim):
428
+ raise ValueError(
429
+ f"`attn_output` should be of size {(batch_size, self.num_heads, q_len, self.head_dim)}, but is"
430
+ f" {attn_output.size()}"
431
+ )
432
+
433
+ attn_output = attn_output.transpose(1, 2).contiguous()
434
+ attn_output = attn_output.reshape(batch_size, q_len, self.embed_dim)
435
+
436
+ attn_output = self.out_proj(attn_output)
437
+
438
+ return attn_output, attn_weights
439
+
440
+
441
+ class SiglipFlashAttention2(SiglipAttention):
442
+ """
443
+ Llama flash attention module. This module inherits from `LlamaAttention` as the weights of the module stays
444
+ untouched. The only required change would be on the forward pass where it needs to correctly call the public API of
445
+ flash attention and deal with padding tokens in case the input contains any of them.
446
+ """
447
+
448
+ def __init__(self, *args, **kwargs):
449
+ super().__init__(*args, **kwargs)
450
+ self.is_causal = False # Hack to make sure we don't use a causal mask
451
+
452
+ def forward(
453
+ self,
454
+ hidden_states: torch.Tensor,
455
+ attention_mask: Optional[torch.LongTensor] = None,
456
+ position_ids: Optional[torch.LongTensor] = None,
457
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
458
+ output_attentions: bool = False,
459
+ use_cache: bool = False,
460
+ **kwargs,
461
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
462
+ output_attentions = False
463
+
464
+ bsz, q_len, _ = hidden_states.size()
465
+
466
+ query_states = self.q_proj(hidden_states)
467
+ key_states = self.k_proj(hidden_states)
468
+ value_states = self.v_proj(hidden_states)
469
+
470
+ # Flash attention requires the input to have the shape
471
+ # batch_size x seq_length x head_dim x hidden_dim
472
+ # therefore we just need to keep the original shape
473
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
474
+ key_states = key_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
475
+ value_states = value_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
476
+
477
+ kv_seq_len = key_states.shape[-2]
478
+ if past_key_value is not None:
479
+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
480
+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
481
+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
482
+
483
+ # if past_key_value is not None:
484
+ # cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models
485
+ # key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
486
+
487
+ # TODO: These transpose are quite inefficient but Flash Attention requires the layout [batch_size, sequence_length, num_heads, head_dim]. We would need to refactor the KV cache
488
+ # to be able to avoid many of these transpose/reshape/view.
489
+ query_states = query_states.transpose(1, 2)
490
+ key_states = key_states.transpose(1, 2)
491
+ value_states = value_states.transpose(1, 2)
492
+
493
+ dropout_rate = self.dropout if self.training else 0.0
494
+
495
+ # In PEFT, usually we cast the layer norms in float32 for training stability reasons
496
+ # therefore the input hidden states gets silently casted in float32. Hence, we need
497
+ # cast them back in the correct dtype just to be sure everything works as expected.
498
+ # This might slowdown training & inference so it is recommended to not cast the LayerNorms
499
+ # in fp32. (LlamaRMSNorm handles it correctly)
500
+
501
+ input_dtype = query_states.dtype
502
+ if input_dtype == torch.float32:
503
+ if torch.is_autocast_enabled():
504
+ target_dtype = torch.get_autocast_gpu_dtype()
505
+ # Handle the case where the model is quantized
506
+ elif hasattr(self.config, "_pre_quantization_dtype"):
507
+ target_dtype = self.config._pre_quantization_dtype
508
+ else:
509
+ target_dtype = self.q_proj.weight.dtype
510
+
511
+ logger.warning_once(
512
+ "The input hidden states seems to be silently casted in float32, this might be related to the fact"
513
+ " you have upcasted embedding or layer norm layers in float32. We will cast back the input in"
514
+ f" {target_dtype}."
515
+ )
516
+
517
+ query_states = query_states.to(target_dtype)
518
+ key_states = key_states.to(target_dtype)
519
+ value_states = value_states.to(target_dtype)
520
+
521
+ attn_output = self._flash_attention_forward(
522
+ query_states, key_states, value_states, attention_mask, q_len, dropout=dropout_rate
523
+ )
524
+
525
+ attn_output = attn_output.reshape(bsz, q_len, self.embed_dim).contiguous()
526
+ attn_output = self.out_proj(attn_output)
527
+
528
+ if not output_attentions:
529
+ attn_weights = None
530
+
531
+ return attn_output, attn_weights
532
+
533
+ def _flash_attention_forward(
534
+ self, query_states, key_states, value_states, attention_mask, query_length, dropout=0.0, softmax_scale=None
535
+ ):
536
+ """
537
+ Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
538
+ first unpad the input, then computes the attention scores and pad the final attention scores.
539
+ Args:
540
+ query_states (`torch.Tensor`):
541
+ Input query states to be passed to Flash Attention API
542
+ key_states (`torch.Tensor`):
543
+ Input key states to be passed to Flash Attention API
544
+ value_states (`torch.Tensor`):
545
+ Input value states to be passed to Flash Attention API
546
+ attention_mask (`torch.Tensor`):
547
+ The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
548
+ position of padding tokens and 1 for the position of non-padding tokens.
549
+ dropout (`int`, *optional*):
550
+ Attention dropout
551
+ softmax_scale (`float`, *optional*):
552
+ The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
553
+ """
554
+
555
+ # TODO: Remove the `query_length != 1` check once Flash Attention for RoCm is bumped to 2.1. For details, please see the comment in LlamaFlashAttention2 __init__.
556
+ causal = self.is_causal and query_length != 1
557
+
558
+ # Contains at least one padding token in the sequence
559
+ if attention_mask is not None:
560
+ batch_size = query_states.shape[0]
561
+ query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(
562
+ query_states, key_states, value_states, attention_mask, query_length
563
+ )
564
+
565
+ cu_seqlens_q, cu_seqlens_k = cu_seq_lens
566
+ max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
567
+
568
+ attn_output_unpad = flash_attn_varlen_func(
569
+ query_states,
570
+ key_states,
571
+ value_states,
572
+ cu_seqlens_q=cu_seqlens_q,
573
+ cu_seqlens_k=cu_seqlens_k,
574
+ max_seqlen_q=max_seqlen_in_batch_q,
575
+ max_seqlen_k=max_seqlen_in_batch_k,
576
+ dropout_p=dropout,
577
+ softmax_scale=softmax_scale,
578
+ causal=causal,
579
+ )
580
+
581
+ attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
582
+ else:
583
+ attn_output = flash_attn_func(
584
+ query_states, key_states, value_states, dropout, softmax_scale=softmax_scale, causal=causal
585
+ )
586
+
587
+ return attn_output
588
+
589
+ def _upad_input(self, query_layer, key_layer, value_layer, attention_mask, query_length):
590
+ indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(attention_mask)
591
+ batch_size, kv_seq_len, num_key_value_heads, head_dim = key_layer.shape
592
+
593
+ key_layer = index_first_axis(
594
+ key_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
595
+ )
596
+ value_layer = index_first_axis(
597
+ value_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
598
+ )
599
+ if query_length == kv_seq_len:
600
+ query_layer = index_first_axis(
601
+ query_layer.reshape(batch_size * kv_seq_len, self.num_heads, head_dim), indices_k
602
+ )
603
+ cu_seqlens_q = cu_seqlens_k
604
+ max_seqlen_in_batch_q = max_seqlen_in_batch_k
605
+ indices_q = indices_k
606
+ elif query_length == 1:
607
+ max_seqlen_in_batch_q = 1
608
+ cu_seqlens_q = torch.arange(
609
+ batch_size + 1, dtype=torch.int32, device=query_layer.device
610
+ ) # There is a memcpy here, that is very bad.
611
+ indices_q = cu_seqlens_q[:-1]
612
+ query_layer = query_layer.squeeze(1)
613
+ else:
614
+ # The -q_len: slice assumes left padding.
615
+ attention_mask = attention_mask[:, -query_length:]
616
+ query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(query_layer, attention_mask)
617
+
618
+ return (
619
+ query_layer,
620
+ key_layer,
621
+ value_layer,
622
+ indices_q,
623
+ (cu_seqlens_q, cu_seqlens_k),
624
+ (max_seqlen_in_batch_q, max_seqlen_in_batch_k),
625
+ )
626
+
627
+
628
+ # Copied from transformers.models.clip.modeling_clip.CLIPMLP with CLIP->Siglip
629
+ class SiglipMLP(nn.Module):
630
+ def __init__(self, config):
631
+ super().__init__()
632
+ self.config = config
633
+ self.activation_fn = ACT2FN[config.hidden_act]
634
+ self.fc1 = nn.Linear(config.hidden_size, config.intermediate_size)
635
+ self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size)
636
+
637
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
638
+ hidden_states = self.fc1(hidden_states)
639
+ hidden_states = self.activation_fn(hidden_states)
640
+ hidden_states = self.fc2(hidden_states)
641
+ return hidden_states
642
+
643
+
644
+ # Copied from transformers.models.clip.modeling_clip.CLIPEncoderLayer with CLIP->Siglip
645
+ class SiglipEncoderLayer(nn.Module):
646
+ def __init__(self, config: SiglipVisionConfig):
647
+ super().__init__()
648
+ self.embed_dim = config.hidden_size
649
+ self._use_flash_attention_2 = config._attn_implementation == "flash_attention_2"
650
+ self.self_attn = SiglipAttention(config) if not self._use_flash_attention_2 else SiglipFlashAttention2(config)
651
+ self.layer_norm1 = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_eps)
652
+ self.mlp = SiglipMLP(config)
653
+ self.layer_norm2 = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_eps)
654
+
655
+ def forward(
656
+ self,
657
+ hidden_states: torch.Tensor,
658
+ attention_mask: torch.Tensor,
659
+ output_attentions: Optional[bool] = False,
660
+ ) -> Tuple[torch.FloatTensor]:
661
+ """
662
+ Args:
663
+ hidden_states (`torch.FloatTensor`):
664
+ Input to the layer of shape `(batch, seq_len, embed_dim)`.
665
+ attention_mask (`torch.FloatTensor`):
666
+ Attention mask of shape `(batch, 1, q_len, k_v_seq_len)` where padding elements are indicated by very large negative values.
667
+ output_attentions (`bool`, *optional*, defaults to `False`):
668
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
669
+ returned tensors for more detail.
670
+ """
671
+ residual = hidden_states
672
+
673
+ hidden_states = self.layer_norm1(hidden_states)
674
+ hidden_states, attn_weights = self.self_attn(
675
+ hidden_states=hidden_states,
676
+ attention_mask=attention_mask,
677
+ output_attentions=output_attentions,
678
+ )
679
+ hidden_states = residual + hidden_states
680
+
681
+ residual = hidden_states
682
+ hidden_states = self.layer_norm2(hidden_states)
683
+ hidden_states = self.mlp(hidden_states)
684
+ hidden_states = residual + hidden_states
685
+
686
+ outputs = (hidden_states,)
687
+
688
+ if output_attentions:
689
+ outputs += (attn_weights,)
690
+
691
+ return outputs
692
+
693
+
694
+ class SiglipPreTrainedModel(PreTrainedModel):
695
+ """
696
+ An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
697
+ models.
698
+ """
699
+
700
+ config_class = SiglipVisionConfig
701
+ base_model_prefix = "siglip"
702
+ supports_gradient_checkpointing = True
703
+
704
+ def _init_weights(self, module):
705
+ """Initialize the weights"""
706
+
707
+ if isinstance(module, SiglipVisionEmbeddings):
708
+ width = self.config.hidden_size
709
+ nn.init.normal_(module.position_embedding.weight, std=1 / np.sqrt(width))
710
+ elif isinstance(module, nn.Embedding):
711
+ default_flax_embed_init(module.weight)
712
+ elif isinstance(module, SiglipAttention):
713
+ nn.init.normal_(module.q_proj.weight)
714
+ nn.init.normal_(module.k_proj.weight)
715
+ nn.init.normal_(module.v_proj.weight)
716
+ nn.init.normal_(module.out_proj.weight)
717
+ nn.init.zeros_(module.q_proj.bias)
718
+ nn.init.zeros_(module.k_proj.bias)
719
+ nn.init.zeros_(module.v_proj.bias)
720
+ nn.init.zeros_(module.out_proj.bias)
721
+ elif isinstance(module, SiglipMLP):
722
+ nn.init.normal_(module.fc1.weight)
723
+ nn.init.normal_(module.fc2.weight)
724
+ nn.init.normal_(module.fc1.bias, std=1e-6)
725
+ nn.init.normal_(module.fc2.bias, std=1e-6)
726
+ elif isinstance(module, (nn.Linear, nn.Conv2d)):
727
+ lecun_normal_(module.weight)
728
+ if module.bias is not None:
729
+ nn.init.zeros_(module.bias)
730
+ elif isinstance(module, nn.LayerNorm):
731
+ module.bias.data.zero_()
732
+ module.weight.data.fill_(1.0)
733
+
734
+
735
+ SIGLIP_START_DOCSTRING = r"""
736
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
737
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
738
+ etc.)
739
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
740
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
741
+ and behavior.
742
+ Parameters:
743
+ config ([`SiglipVisionConfig`]): Model configuration class with all the parameters of the model.
744
+ Initializing with a config file does not load the weights associated with the model, only the
745
+ configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
746
+ """
747
+
748
+
749
+ SIGLIP_VISION_INPUTS_DOCSTRING = r"""
750
+ Args:
751
+ pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
752
+ Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using
753
+ [`AutoImageProcessor`]. See [`CLIPImageProcessor.__call__`] for details.
754
+ output_attentions (`bool`, *optional*):
755
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
756
+ tensors for more detail.
757
+ output_hidden_states (`bool`, *optional*):
758
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
759
+ more detail.
760
+ return_dict (`bool`, *optional*):
761
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
762
+ """
763
+
764
+
765
+ # Copied from transformers.models.clip.modeling_clip.CLIPEncoder with CLIP->Siglip
766
+ class SiglipEncoder(nn.Module):
767
+ """
768
+ Transformer encoder consisting of `config.num_hidden_layers` self attention layers. Each layer is a
769
+ [`SiglipEncoderLayer`].
770
+ Args:
771
+ config: SiglipConfig
772
+ """
773
+
774
+ def __init__(self, config: SiglipVisionConfig):
775
+ super().__init__()
776
+ self.config = config
777
+ self.layers = nn.ModuleList([SiglipEncoderLayer(config) for _ in range(config.num_hidden_layers)])
778
+ self.gradient_checkpointing = False
779
+
780
+ # Ignore copy
781
+ def forward(
782
+ self,
783
+ inputs_embeds,
784
+ attention_mask: Optional[torch.Tensor] = None,
785
+ output_attentions: Optional[bool] = None,
786
+ output_hidden_states: Optional[bool] = None,
787
+ return_dict: Optional[bool] = None,
788
+ ) -> Union[Tuple, BaseModelOutput]:
789
+ r"""
790
+ Args:
791
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
792
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation.
793
+ This is useful if you want more control over how to convert `input_ids` indices into associated vectors
794
+ than the model's internal embedding lookup matrix.
795
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
796
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
797
+ - 1 for tokens that are **not masked**,
798
+ - 0 for tokens that are **masked**.
799
+ [What are attention masks?](../glossary#attention-mask)
800
+ output_attentions (`bool`, *optional*):
801
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
802
+ returned tensors for more detail.
803
+ output_hidden_states (`bool`, *optional*):
804
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
805
+ for more detail.
806
+ return_dict (`bool`, *optional*):
807
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
808
+ """
809
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
810
+ output_hidden_states = (
811
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
812
+ )
813
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
814
+
815
+ encoder_states = () if output_hidden_states else None
816
+ all_attentions = () if output_attentions else None
817
+
818
+ hidden_states = inputs_embeds
819
+ for encoder_layer in self.layers:
820
+ if output_hidden_states:
821
+ encoder_states = encoder_states + (hidden_states,)
822
+ if self.gradient_checkpointing and self.training:
823
+ layer_outputs = self._gradient_checkpointing_func(
824
+ encoder_layer.__call__,
825
+ hidden_states,
826
+ attention_mask,
827
+ output_attentions,
828
+ )
829
+ else:
830
+ layer_outputs = encoder_layer(
831
+ hidden_states,
832
+ attention_mask,
833
+ output_attentions=output_attentions,
834
+ )
835
+
836
+ hidden_states = layer_outputs[0]
837
+
838
+ if output_attentions:
839
+ all_attentions = all_attentions + (layer_outputs[1],)
840
+
841
+ if output_hidden_states:
842
+ encoder_states = encoder_states + (hidden_states,)
843
+
844
+ if not return_dict:
845
+ return tuple(v for v in [hidden_states, encoder_states, all_attentions] if v is not None)
846
+ return BaseModelOutput(last_hidden_state=hidden_states, hidden_states=encoder_states, attentions=all_attentions)
847
+
848
+
849
+ @add_start_docstrings("""The vision model from SigLIP without any head or projection on top.""", SIGLIP_START_DOCSTRING)
850
+ class SiglipVisionTransformer(SiglipPreTrainedModel):
851
+ config_class = SiglipVisionConfig
852
+ main_input_name = "pixel_values"
853
+ _supports_flash_attn_2 = True
854
+ _no_split_modules = []
855
+
856
+ def __init__(self, config: SiglipVisionConfig):
857
+ super().__init__(config)
858
+ self.config = config
859
+ embed_dim = config.hidden_size
860
+
861
+ self.embeddings = SiglipVisionEmbeddings(config)
862
+ self.encoder = SiglipEncoder(config)
863
+ self.post_layernorm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps)
864
+ self._use_flash_attention_2 = config._attn_implementation == "flash_attention_2"
865
+
866
+ # Initialize weights and apply final processing
867
+ self.post_init()
868
+
869
+ def get_input_embeddings(self) -> nn.Module:
870
+ return self.embeddings.patch_embedding
871
+
872
+ @add_start_docstrings_to_model_forward(SIGLIP_VISION_INPUTS_DOCSTRING)
873
+ @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=SiglipVisionConfig)
874
+ def forward(
875
+ self,
876
+ pixel_values,
877
+ patch_attention_mask: Optional[torch.BoolTensor] = None,
878
+ tgt_sizes: Optional[torch.IntTensor] = None,
879
+ output_attentions: Optional[bool] = None,
880
+ output_hidden_states: Optional[bool] = None,
881
+ return_dict: Optional[bool] = None,
882
+ ) -> Union[Tuple, BaseModelOutputWithPooling]:
883
+ r"""
884
+ Returns:
885
+ """
886
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
887
+ output_hidden_states = (
888
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
889
+ )
890
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
891
+
892
+ batch_size = pixel_values.size(0)
893
+ if patch_attention_mask is None:
894
+ patch_attention_mask = torch.ones(
895
+ size=(
896
+ batch_size,
897
+ pixel_values.size(2) // self.config.patch_size,
898
+ pixel_values.size(3) // self.config.patch_size,
899
+ ),
900
+ dtype=torch.bool,
901
+ device=pixel_values.device,
902
+ )
903
+
904
+ hidden_states = self.embeddings(
905
+ pixel_values=pixel_values, patch_attention_mask=patch_attention_mask, tgt_sizes=tgt_sizes
906
+ )
907
+
908
+ patch_attention_mask = patch_attention_mask.view(batch_size, -1)
909
+ # The call to `_upad_input` in `_flash_attention_forward` is expensive
910
+ # So when the `patch_attention_mask` is full of 1s (i.e. attending to the whole sequence),
911
+ # avoiding passing the attention_mask, which is equivalent to attending to the full sequence
912
+ if not torch.any(~patch_attention_mask):
913
+ attention_mask = None
914
+ else:
915
+ attention_mask = (
916
+ _prepare_4d_attention_mask(patch_attention_mask, hidden_states.dtype)
917
+ if not self._use_flash_attention_2
918
+ else patch_attention_mask
919
+ )
920
+
921
+ encoder_outputs = self.encoder(
922
+ inputs_embeds=hidden_states,
923
+ attention_mask=attention_mask,
924
+ output_attentions=output_attentions,
925
+ output_hidden_states=output_hidden_states,
926
+ return_dict=return_dict,
927
+ )
928
+
929
+ last_hidden_state = encoder_outputs[0]
930
+ last_hidden_state = self.post_layernorm(last_hidden_state)
931
+
932
+ if not return_dict:
933
+ return (last_hidden_state, None) + encoder_outputs[1:]
934
+
935
+ return BaseModelOutputWithPooling(
936
+ last_hidden_state=last_hidden_state,
937
+ pooler_output=None,
938
+ hidden_states=encoder_outputs.hidden_states,
939
+ attentions=encoder_outputs.attentions,
940
+ )
preprocessor_config.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "image_processor_type": "MiniCPMVImageProcessor",
3
+ "auto_map": {
4
+ "AutoProcessor": "processing_minicpmo.MiniCPMOProcessor",
5
+ "AutoImageProcessor": "image_processing_minicpmv.MiniCPMVImageProcessor"
6
+ },
7
+ "processor_class": "MiniCPMOProcessor",
8
+ "max_slice_nums": 9,
9
+ "scale_resolution": 448,
10
+ "patch_size": 14,
11
+ "use_image_id": true,
12
+ "image_feature_size": 64,
13
+ "im_start": "<image>",
14
+ "im_end": "</image>",
15
+ "slice_start": "<slice>",
16
+ "slice_end": "</slice>",
17
+ "unk": "<unk>",
18
+ "im_id_start": "<image_id>",
19
+ "im_id_end": "</image_id>",
20
+ "slice_mode": true,
21
+ "norm_mean": [0.5, 0.5, 0.5],
22
+ "norm_std": [0.5, 0.5, 0.5],
23
+ "version": 2.6
24
+ }
processing_minicpmo.py ADDED
@@ -0,0 +1,505 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2025 The OpenBMB Team. All rights reserved.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+ """
16
+ Processor class for MiniCPMO.
17
+ """
18
+
19
+ import math
20
+ import re
21
+ from typing import List
22
+ from typing import Literal
23
+ from typing import Optional
24
+ from typing import Union
25
+
26
+ import numpy as np
27
+ import torch
28
+ import torchaudio
29
+ from transformers.image_utils import ImageInput
30
+ from transformers.processing_utils import ProcessorMixin
31
+ from transformers.tokenization_utils_base import PreTokenizedInput
32
+ from transformers.tokenization_utils_base import TextInput
33
+ from transformers.utils import TensorType
34
+
35
+ from .image_processing_minicpmv import MiniCPMOBatchFeature
36
+
37
+
38
+ class MiniCPMOProcessor(ProcessorMixin):
39
+ r"""
40
+ Constructs a MiniCPMV processor which wraps a MiniCPMV image processor and a MiniCPMV tokenizer into a single processor.
41
+
42
+ [`MiniCPMVProcessor`] offers all the functionalities of [`MiniCPMVImageProcessor`] and [`LlamaTokenizerWrapper`]. See the
43
+ [`~MiniCPMVProcessor.__call__`] and [`~MiniCPMVProcessor.decode`] for more information.
44
+
45
+ Args:
46
+ image_processor ([`MiniCPMVImageProcessor`], *optional*):
47
+ The image processor is a required input.
48
+ tokenizer ([`LlamaTokenizerWrapper`], *optional*):
49
+ The tokenizer is a required input.
50
+ """
51
+
52
+ attributes = ["image_processor", "feature_extractor", "tokenizer"]
53
+ feature_extractor_class = "WhisperFeatureExtractor"
54
+ image_processor_class = "AutoImageProcessor"
55
+ tokenizer_class = "AutoTokenizer"
56
+
57
+ def __init__(self, image_processor=None, feature_extractor=None, tokenizer=None):
58
+ super().__init__(image_processor, feature_extractor, tokenizer)
59
+ self.version = image_processor.version
60
+
61
+ def __call__(
62
+ self,
63
+ text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]],
64
+ images: ImageInput = None,
65
+ audios: Union[np.ndarray, List[np.ndarray], List[List[np.ndarray]]] = None,
66
+ audio_parts: Optional[list] = None,
67
+ max_length: Optional[int] = None,
68
+ do_pad: Optional[bool] = True,
69
+ max_slice_nums: int = None,
70
+ use_image_id: bool = True,
71
+ chunk_input: bool = False,
72
+ return_tensors: Optional[Union[str, TensorType]] = TensorType.PYTORCH,
73
+ sampling_rate: Optional[int] = 16000,
74
+ **kwargs,
75
+ ) -> MiniCPMOBatchFeature:
76
+ if images is not None:
77
+ image_inputs = self.image_processor(
78
+ images, do_pad=do_pad, max_slice_nums=max_slice_nums, return_tensors=return_tensors
79
+ )
80
+ else:
81
+ image_inputs = None
82
+
83
+ if audios is not None:
84
+ audio_features, audio_feature_lens, audio_phs = self.audio_feature_extract(
85
+ audios, audio_parts, chunk_input, sampling_rate
86
+ )
87
+ else:
88
+ audio_features, audio_feature_lens, audio_phs = [], [], []
89
+
90
+ model_inputs = self._convert_omni_to_inputs(
91
+ image_inputs,
92
+ audio_phs,
93
+ text,
94
+ max_slice_nums=max_slice_nums,
95
+ use_image_id=use_image_id,
96
+ max_length=max_length,
97
+ **kwargs,
98
+ )
99
+
100
+ model_inputs["audio_features"] = audio_features
101
+ model_inputs["audio_feature_lens"] = audio_feature_lens
102
+
103
+ return MiniCPMOBatchFeature(data={**model_inputs})
104
+
105
+ def audio_feature_extract(
106
+ self,
107
+ audios: Union[np.ndarray, List[np.ndarray], List[List[np.ndarray]]],
108
+ audio_parts: Optional[list] = None,
109
+ chunk_input: Optional[bool] = False,
110
+ sampling_rate: Optional[int] = None,
111
+ chunk_length: Optional[int] = 1,
112
+ **kwargs,
113
+ ):
114
+ def get_audio_placeholder(audio_lens, chunk_input):
115
+ pool_step = 2
116
+ feature_lens = math.ceil(audio_lens / self.feature_extractor.hop_length)
117
+
118
+ feature_lens = (feature_lens - 1) // 2 + 1
119
+ output_lens = (feature_lens - pool_step) // pool_step + 1
120
+
121
+ if chunk_input:
122
+ fbank_feat_in_chunk = int(chunk_length * 100)
123
+ cnn_feat_in_chunk = (fbank_feat_in_chunk - 1) // 2 + 1
124
+ audio_embeds_in_chunk = (cnn_feat_in_chunk - pool_step) // pool_step + 1
125
+ num_audio_chunks = (output_lens + audio_embeds_in_chunk - 1) // audio_embeds_in_chunk
126
+
127
+ place_holders = ""
128
+ total_unk_len = 0
129
+ for _ in range(num_audio_chunks):
130
+ unk_len = min(audio_embeds_in_chunk, output_lens - total_unk_len)
131
+ place_holders += self.tokenizer.audio_start + "<unk>" * unk_len + self.tokenizer.audio_end
132
+ total_unk_len += unk_len
133
+ audio_placeholder = place_holders
134
+ else:
135
+ audio_placeholder = self.tokenizer.audio_start + "<unk>" * output_lens + self.tokenizer.audio_end
136
+
137
+ return audio_placeholder
138
+
139
+ if isinstance(audios, np.ndarray):
140
+ audios_list = [[audios]]
141
+ elif isinstance(audios[0], np.ndarray):
142
+ audios_list = [audios]
143
+ else:
144
+ audios_list = audios
145
+
146
+ if audio_parts is not None:
147
+ assert len(audio_parts) == len(audios_list)
148
+ for parts, audios in zip(audio_parts, audios_list):
149
+ assert len(parts) == len(audios)
150
+
151
+ audio_feature_lens_list = []
152
+ audio_ph_list = []
153
+
154
+ audio_features_all = []
155
+
156
+ # audio placeholder not dependent on audio_parts
157
+ for audios in audios_list:
158
+ if audios:
159
+ audio_ph_list.append([get_audio_placeholder(len(a), chunk_input) for a in audios])
160
+ else:
161
+ audio_ph_list.append([])
162
+
163
+ for idx, audios in enumerate(audios_list):
164
+ if audio_parts is not None:
165
+ # same audio part merge
166
+ audio_part = audio_parts[idx]
167
+ merge_audio = []
168
+ cur_audio = []
169
+ for aid, (part, audio) in enumerate(zip(audio_part, audios)):
170
+ if aid == 0 or audio_part[aid] == audio_part[aid - 1]:
171
+ cur_audio.append(audio)
172
+ else:
173
+ merge_audio.append(np.hstack(cur_audio))
174
+ cur_audio = [audio]
175
+ if cur_audio:
176
+ merge_audio.append(np.hstack(cur_audio))
177
+
178
+ else:
179
+ merge_audio = audios
180
+
181
+ audio_feature_lens = []
182
+
183
+ # If the audio exceeds 30 seconds, split it into chunks every 30 seconds.
184
+ final_merge_audio = []
185
+ max_audio_inp_len = 30 * sampling_rate
186
+ for audio in merge_audio:
187
+ if len(audio) <= max_audio_inp_len:
188
+ final_merge_audio.append(audio)
189
+ else:
190
+ for i in range(math.ceil(len(audio) / max_audio_inp_len)):
191
+ final_merge_audio.append(audio[i * max_audio_inp_len : (i + 1) * max_audio_inp_len])
192
+
193
+ if audios:
194
+ audio_inputs = self.feature_extractor(
195
+ final_merge_audio,
196
+ sampling_rate=sampling_rate,
197
+ return_attention_mask=True,
198
+ padding="max_length",
199
+ return_tensors="pt",
200
+ **kwargs,
201
+ )
202
+ audio_feature = audio_inputs["input_features"]
203
+ actual_lens = audio_inputs["attention_mask"].sum(dim=1)
204
+
205
+ for feat, lens in zip(audio_feature, actual_lens):
206
+ audio_features_all.append(feat[:, :lens])
207
+ audio_feature_lens.append(lens)
208
+
209
+ audio_feature_lens = torch.hstack(audio_feature_lens)
210
+ audio_feature_lens_list.append(audio_feature_lens)
211
+ else:
212
+ audio_feature_lens_list.append([])
213
+
214
+ if audio_features_all:
215
+ audio_features = [i.permute(1, 0) for i in audio_features_all]
216
+ audio_features = torch.nn.utils.rnn.pad_sequence(
217
+ audio_features, batch_first=True, padding_value=0.0
218
+ ).permute(0, 2, 1)
219
+ else:
220
+ audio_features = []
221
+
222
+ return audio_features, audio_feature_lens_list, audio_ph_list
223
+
224
+ # Copied from transformers.models.clip.processing_clip.CLIPProcessor.batch_decode with CLIP->Llama
225
+ def batch_decode(self, *args, **kwargs):
226
+ """
227
+ This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
228
+ refer to the docstring of this method for more information.
229
+ """
230
+ output_ids = args[0]
231
+ result_text = []
232
+ for result in output_ids:
233
+ result = result[result != 0]
234
+ if result[0] == self.tokenizer.bos_id:
235
+ result = result[1:]
236
+ if result[-1] == self.tokenizer.eos_id:
237
+ result = result[:-1]
238
+ result_text.append(self.tokenizer.decode(result, *args[1:], **kwargs).strip())
239
+ return result_text
240
+ # return self.tokenizer.batch_decode(*args, **kwargs)
241
+
242
+ # Copied from transformers.models.clip.processing_clip.CLIPProcessor.decode with CLIP->Llama
243
+ def decode(self, *args, **kwargs):
244
+ """
245
+ This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
246
+ the docstring of this method for more information.
247
+ """
248
+ result = args[0]
249
+ result = result[result != 0]
250
+ if result[0] == self.tokenizer.bos_id:
251
+ result = result[1:]
252
+ if result[-1] == self.tokenizer.eos_id or (
253
+ hasattr(self.tokenizer, "eot_id") and result[-1] == self.tokenizer.eot_id
254
+ ):
255
+ result = result[:-1]
256
+ return self.tokenizer.decode(result, *args[1:], **kwargs).strip()
257
+
258
+ def _convert(self, input_str, max_inp_length: Optional[int] = None, **kwargs):
259
+ input_ids = self.tokenizer.encode(input_str, **kwargs)
260
+ if max_inp_length is not None:
261
+ input_ids = input_ids[:max_inp_length]
262
+ input_ids = torch.tensor(input_ids, dtype=torch.int32)
263
+
264
+ ## image bound
265
+ start_cond = (input_ids == self.tokenizer.im_start_id) | (input_ids == self.tokenizer.slice_start_id)
266
+ end_cond = (input_ids == self.tokenizer.im_end_id) | (input_ids == self.tokenizer.slice_end_id)
267
+
268
+ image_start_idx = torch.where(start_cond)[0]
269
+ image_start_idx += 1
270
+ image_end_idx = torch.where(end_cond)[0]
271
+
272
+ valid_image_nums = max(len(image_start_idx), len(image_end_idx))
273
+
274
+ image_bounds = torch.hstack(
275
+ [
276
+ image_start_idx[:valid_image_nums].unsqueeze(-1),
277
+ image_end_idx[:valid_image_nums].unsqueeze(-1),
278
+ ]
279
+ )
280
+
281
+ ## audio bound
282
+ audio_start_idx = torch.where(input_ids == self.tokenizer.audio_start_id)[0]
283
+ audio_end_idx = torch.where(input_ids == self.tokenizer.audio_end_id)[0]
284
+ assert len(audio_start_idx) == len(audio_end_idx)
285
+ audio_bounds = torch.hstack([(audio_start_idx + 1).unsqueeze(-1), audio_end_idx.unsqueeze(-1)])
286
+
287
+ spk_start_idx = torch.where(input_ids == self.tokenizer.spk_start_id)[0]
288
+ spk_end_idx = torch.where(input_ids == self.tokenizer.spk_end_id)[0]
289
+ assert len(spk_start_idx) == len(spk_end_idx)
290
+ spk_bounds = torch.hstack([(spk_start_idx + 1).unsqueeze(-1), spk_end_idx.unsqueeze(-1)])
291
+
292
+ return input_ids, image_bounds, audio_bounds, spk_bounds
293
+
294
+ def _convert_omni_to_inputs(
295
+ self,
296
+ images,
297
+ audio_phs,
298
+ texts: Union[str, List[str]],
299
+ truncation=None,
300
+ max_length=None,
301
+ max_slice_nums=None,
302
+ use_image_id=None,
303
+ return_tensors=None,
304
+ **kwargs,
305
+ ):
306
+ if images is None and audio_phs is None:
307
+ model_inputs = self.tokenizer(
308
+ texts, return_tensors=return_tensors, truncation=truncation, max_length=max_length, **kwargs
309
+ )
310
+ return MiniCPMOBatchFeature(data={**model_inputs})
311
+
312
+ image_tag = "(<image>./</image>)"
313
+ image_pattern = "\(<image>./</image>\)"
314
+ audio_tag = "(<audio>./</audio>)"
315
+ audio_pattern = "\(<audio>./</audio>\)"
316
+ split_pattern = f"({image_pattern}|{audio_pattern})"
317
+
318
+ if isinstance(texts, str):
319
+ texts = [texts]
320
+
321
+ bs = len(texts)
322
+ if images is not None:
323
+ images, image_sizes, tgt_sizes = images["pixel_values"], images["image_sizes"], images["tgt_sizes"]
324
+ else:
325
+ images, image_sizes, tgt_sizes = [[]] * bs, [[]] * bs, [[]] * bs
326
+
327
+ input_ids_list = []
328
+ image_bounds_list = []
329
+ audio_bounds_list = []
330
+ spk_bounds_list = []
331
+
332
+ for index, text in enumerate(texts):
333
+ text_chunks = re.split(split_pattern, text)
334
+
335
+ image_tags = re.findall(image_pattern, text)
336
+ audio_tags = re.findall(audio_pattern, text)
337
+
338
+ if image_tags:
339
+ assert images is not None
340
+ assert len(image_tags) == len(image_sizes[index])
341
+ if audio_tags:
342
+ assert audio_phs is not None
343
+ assert len(audio_tags) == len(audio_phs[index])
344
+
345
+ image_id = 0
346
+ audio_id = 0
347
+ for i, chunk in enumerate(text_chunks):
348
+ if chunk == image_tag:
349
+ image_placeholder = self.image_processor.get_slice_image_placeholder(
350
+ image_sizes[index][image_id], image_id, max_slice_nums, use_image_id
351
+ )
352
+ image_id += 1
353
+ text_chunks[i] = image_placeholder
354
+ elif chunk == audio_tag:
355
+ audio_placeholder = audio_phs[index][audio_id]
356
+ audio_id += 1
357
+ text_chunks[i] = audio_placeholder
358
+
359
+ final_text = "".join(text_chunks)
360
+ input_ids, image_bounds, audio_bounds, spk_bounds = self._convert(final_text, max_length, **kwargs)
361
+
362
+ input_ids_list.append(input_ids)
363
+ image_bounds_list.append(image_bounds)
364
+ audio_bounds_list.append(audio_bounds)
365
+ spk_bounds_list.append(spk_bounds)
366
+
367
+ padded_input_ids, padding_lengths = self.pad(input_ids_list, padding_side="left")
368
+ attention_mask = torch.ones_like(padded_input_ids, dtype=torch.bool)
369
+ for i, length in enumerate(padding_lengths):
370
+ image_bounds_list[i] = image_bounds_list[i] + length
371
+ audio_bounds_list[i] = audio_bounds_list[i] + length
372
+ spk_bounds_list[i] = spk_bounds_list[i] + length
373
+ attention_mask[i, :length] = False
374
+
375
+ data = {
376
+ "input_ids": padded_input_ids,
377
+ "attention_mask": attention_mask,
378
+ "pixel_values": images,
379
+ "image_sizes": image_sizes,
380
+ "image_bound": image_bounds_list,
381
+ "tgt_sizes": tgt_sizes,
382
+ "audio_bounds": audio_bounds_list,
383
+ "spk_bounds": spk_bounds_list,
384
+ }
385
+
386
+ return data
387
+
388
+ @property
389
+ # Copied from transformers.models.clip.processing_clip.CLIPProcessor.model_input_names
390
+ def model_input_names(self):
391
+ tokenizer_input_names = self.tokenizer.model_input_names
392
+ image_processor_input_names = self.image_processor.model_input_names
393
+ feature_extractor_input_names = self.feature_extractor.model_input_names
394
+ return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names + feature_extractor_input_names))
395
+
396
+ def pad(self, inputs, max_length=None, padding_value=0, padding_side="left"):
397
+ items = []
398
+ if isinstance(inputs[0], list):
399
+ assert isinstance(inputs[0][0], torch.Tensor)
400
+ for it in inputs:
401
+ for tr in it:
402
+ items.append(tr)
403
+ else:
404
+ assert isinstance(inputs[0], torch.Tensor)
405
+ items = inputs
406
+
407
+ batch_size = len(items)
408
+ shape = items[0].shape
409
+ dim = len(shape)
410
+ assert dim <= 2
411
+ if max_length is None:
412
+ max_length = 0
413
+ max_length = max(max_length, max(item.shape[-1] for item in items))
414
+ min_length = min(item.shape[-1] for item in items)
415
+ dtype = items[0].dtype
416
+
417
+ if dim == 0:
418
+ return torch.stack([item for item in items], dim=0), [0]
419
+ elif dim == 1:
420
+ if max_length == min_length:
421
+ return torch.stack([item for item in items], dim=0), [0] * batch_size
422
+ tensor = torch.zeros((batch_size, max_length), dtype=dtype) + padding_value
423
+ else:
424
+ tensor = torch.zeros((batch_size, max_length, shape[-1]), dtype=dtype) + padding_value
425
+
426
+ padding_length = []
427
+ for i, item in enumerate(items):
428
+ if dim == 1:
429
+ if padding_side == "left":
430
+ tensor[i, -len(item) :] = item.clone()
431
+ else:
432
+ tensor[i, : len(item)] = item.clone()
433
+ elif dim == 2:
434
+ if padding_side == "left":
435
+ tensor[i, -len(item) :, :] = item.clone()
436
+ else:
437
+ tensor[i, : len(item), :] = item.clone()
438
+ padding_length.append(tensor.shape[-1] - len(item))
439
+
440
+ return tensor, padding_length
441
+
442
+
443
+ class MelSpectrogramFeatures(torch.nn.Module):
444
+ def __init__(
445
+ self,
446
+ sample_rate=24000,
447
+ n_fft=1024,
448
+ hop_length=256,
449
+ n_mels=100,
450
+ padding: Literal["center", "same"] = "center",
451
+ ):
452
+ super().__init__()
453
+ if padding not in ["center", "same"]:
454
+ raise ValueError("Padding must be 'center' or 'same'.")
455
+ self.padding = padding
456
+ self.mel_spec = torchaudio.transforms.MelSpectrogram(
457
+ sample_rate=sample_rate,
458
+ n_fft=n_fft,
459
+ hop_length=hop_length,
460
+ n_mels=n_mels,
461
+ center=padding == "center",
462
+ power=1,
463
+ )
464
+
465
+ def __call__(self, audio: torch.Tensor) -> torch.Tensor:
466
+ """
467
+ audio: Tensor([num_channels, num_samples])
468
+ """
469
+ return super().__call__(audio)
470
+
471
+ def forward(self, audio: torch.Tensor) -> torch.Tensor:
472
+ """
473
+ audio: Tensor([num_channels, num_samples])
474
+ """
475
+ mel: torch.Tensor = self.mel_spec(audio)
476
+ features = torch.log(torch.clip(mel, min=1e-5))
477
+ return features
478
+
479
+
480
+ class ChatTTSProcessor:
481
+ def __init__(self, text_tokenizer):
482
+ self.audio_processor = MelSpectrogramFeatures()
483
+ self.text_tokenizer = text_tokenizer
484
+
485
+ def __call__(self, text_list, audio_list):
486
+ assert len(text_list) == len(audio_list)
487
+ input_ids_varlen = []
488
+ for text in text_list:
489
+ input_ids_ = self.text_tokenizer.encode(text, return_tensors="pt", add_special_tokens=False) # [1, seq_len]
490
+ input_ids_ = input_ids_.squeeze(0) # [seq_len]
491
+ input_ids_varlen.append(input_ids_)
492
+
493
+ audio_features_varlen = []
494
+ for audio in audio_list:
495
+ assert audio.shape.__len__() == 1 # [seq_len]
496
+ try:
497
+ mel = self.audio_processor(audio) # [100(num_mel_bins), seq_len_mel]
498
+ except Exception as e:
499
+ raise e
500
+ audio_features_varlen.append(mel)
501
+
502
+ return {
503
+ "tts_input_ids_varlen": input_ids_varlen, # return List[Tensor]
504
+ "tts_input_features_varlen": audio_features_varlen, # return List[Tensor]
505
+ }
quantize_config.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bits": 4,
3
+ "group_size": -1,
4
+ "damp_percent": 0.01,
5
+ "desc_act": true,
6
+ "static_groups": false,
7
+ "sym": true,
8
+ "true_sequential": true,
9
+ "lm_head": false,
10
+ "model_name_or_path": null,
11
+ "model_file_base_name": null,
12
+ "quant_method": "gptq",
13
+ "checkpoint_format": "gptq"
14
+ }
resampler.py ADDED
@@ -0,0 +1,864 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2025 The OpenBMB Team. All rights reserved.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+
16
+ import warnings
17
+ from functools import partial
18
+ from typing import Optional
19
+ from typing import Tuple
20
+
21
+ import numpy as np
22
+ import torch
23
+ import torch.nn.functional as F
24
+ from torch import nn
25
+ from torch import Tensor
26
+ from torch.nn.functional import *
27
+ from torch.nn.init import trunc_normal_
28
+ from torch.nn.modules.activation import *
29
+ from transformers.integrations import is_deepspeed_zero3_enabled
30
+
31
+
32
+ def get_2d_sincos_pos_embed(embed_dim, image_size):
33
+ """
34
+ image_size: image_size or (image_height, image_width)
35
+ return:
36
+ pos_embed: [image_height, image_width, embed_dim]
37
+ """
38
+ if isinstance(image_size, int):
39
+ grid_h_size, grid_w_size = image_size, image_size
40
+ else:
41
+ grid_h_size, grid_w_size = image_size[0], image_size[1]
42
+
43
+ grid_h = np.arange(grid_h_size, dtype=np.float32)
44
+ grid_w = np.arange(grid_w_size, dtype=np.float32)
45
+ grid = np.meshgrid(grid_w, grid_h) # here w goes first
46
+ grid = np.stack(grid, axis=0)
47
+
48
+ pos_embed = get_2d_sincos_pos_embed_from_grid(embed_dim, grid)
49
+ return pos_embed
50
+
51
+
52
+ def get_2d_sincos_pos_embed_from_grid(embed_dim, grid):
53
+ assert embed_dim % 2 == 0
54
+
55
+ # use half of dimensions to encode grid_h
56
+ emb_h = get_1d_sincos_pos_embed_from_grid_new(embed_dim // 2, grid[0]) # (H, W, D/2)
57
+ emb_w = get_1d_sincos_pos_embed_from_grid_new(embed_dim // 2, grid[1]) # (H, W, D/2)
58
+
59
+ emb = np.concatenate([emb_h, emb_w], axis=-1) # (H, W, D)
60
+ return emb
61
+
62
+
63
+ def get_1d_sincos_pos_embed_from_grid_new(embed_dim, pos):
64
+ """
65
+ embed_dim: output dimension for each position
66
+ pos: a list of positions to be encoded: size (H, W)
67
+ out: (H, W, D)
68
+ """
69
+ assert embed_dim % 2 == 0
70
+ omega = np.arange(embed_dim // 2, dtype=np.float32)
71
+ omega /= embed_dim / 2.0
72
+ omega = 1.0 / 10000**omega # (D/2,)
73
+
74
+ out = np.einsum("hw,d->hwd", pos, omega) # (H, W, D/2), outer product
75
+
76
+ emb_sin = np.sin(out) # (H, W, D/2)
77
+ emb_cos = np.cos(out) # (H, W, D/2)
78
+
79
+ emb = np.concatenate([emb_sin, emb_cos], axis=-1) # (H, W, D)
80
+ return emb
81
+
82
+
83
+ class Resampler(nn.Module):
84
+ """
85
+ A 2D perceiver-resampler network with one cross attention layers by
86
+ given learnable queries and 2d sincos pos_emb
87
+ Outputs:
88
+ A tensor with the shape of (batch_size, num_queries, embed_dim)
89
+ """
90
+
91
+ def __init__(
92
+ self,
93
+ num_queries,
94
+ embed_dim,
95
+ num_heads,
96
+ kv_dim=None,
97
+ norm_layer=partial(nn.LayerNorm, eps=1e-6),
98
+ adaptive=False,
99
+ max_size=(70, 70),
100
+ ):
101
+ super().__init__()
102
+ self.num_queries = num_queries
103
+ self.embed_dim = embed_dim
104
+ self.num_heads = num_heads
105
+ self.adaptive = adaptive
106
+ self.max_size = max_size
107
+
108
+ self.query = nn.Parameter(torch.zeros(self.num_queries, embed_dim))
109
+
110
+ if kv_dim is not None and kv_dim != embed_dim:
111
+ self.kv_proj = nn.Linear(kv_dim, embed_dim, bias=False)
112
+ else:
113
+ self.kv_proj = nn.Identity()
114
+
115
+ self.attn = MultiheadAttention(embed_dim, num_heads)
116
+ self.ln_q = norm_layer(embed_dim)
117
+ self.ln_kv = norm_layer(embed_dim)
118
+
119
+ self.ln_post = norm_layer(embed_dim)
120
+ self.proj = nn.Parameter((embed_dim**-0.5) * torch.randn(embed_dim, embed_dim))
121
+
122
+ self._set_2d_pos_cache(self.max_size)
123
+
124
+ def _set_2d_pos_cache(self, max_size, device="cpu"):
125
+ if is_deepspeed_zero3_enabled():
126
+ device = "cuda"
127
+ pos_embed = torch.from_numpy(get_2d_sincos_pos_embed(self.embed_dim, max_size)).float().to(device)
128
+ self.register_buffer("pos_embed", pos_embed, persistent=False)
129
+
130
+ def _adjust_pos_cache(self, tgt_sizes, device):
131
+ max_h = torch.max(tgt_sizes[:, 0])
132
+ max_w = torch.max(tgt_sizes[:, 1])
133
+ if max_h > self.max_size[0] or max_w > self.max_size[1]:
134
+ self.max_size = [max(max_h, self.max_size[0]), max(max_w, self.max_size[1])]
135
+ self._set_2d_pos_cache(self.max_size, device)
136
+
137
+ def _init_weights(self, m):
138
+ if isinstance(m, nn.Linear):
139
+ trunc_normal_(m.weight, std=0.02)
140
+ if isinstance(m, nn.Linear) and m.bias is not None:
141
+ nn.init.constant_(m.bias, 0)
142
+ elif isinstance(m, nn.LayerNorm):
143
+ nn.init.constant_(m.bias, 0)
144
+ nn.init.constant_(m.weight, 1.0)
145
+
146
+ def forward(self, x, tgt_sizes=None):
147
+ assert x.shape[0] == tgt_sizes.shape[0]
148
+ bs = x.shape[0]
149
+
150
+ device = x.device
151
+ dtype = x.dtype
152
+
153
+ patch_len = tgt_sizes[:, 0] * tgt_sizes[:, 1]
154
+
155
+ self._adjust_pos_cache(tgt_sizes, device=device)
156
+
157
+ max_patch_len = torch.max(patch_len)
158
+ key_padding_mask = torch.zeros((bs, max_patch_len), dtype=torch.bool, device=device)
159
+
160
+ pos_embed = []
161
+ for i in range(bs):
162
+ tgt_h, tgt_w = tgt_sizes[i]
163
+ pos_embed.append(self.pos_embed[:tgt_h, :tgt_w, :].reshape((tgt_h * tgt_w, -1)).to(dtype)) # patches * D
164
+ key_padding_mask[i, patch_len[i] :] = True
165
+
166
+ pos_embed = torch.nn.utils.rnn.pad_sequence(pos_embed, batch_first=True, padding_value=0.0).permute(
167
+ 1, 0, 2
168
+ ) # BLD => L * B * D
169
+
170
+ x = self.kv_proj(x) # B * L * D
171
+ x = self.ln_kv(x).permute(1, 0, 2) # L * B * D
172
+
173
+ q = self.ln_q(self.query) # Q * D
174
+
175
+ out = self.attn(
176
+ self._repeat(q, bs), # Q * B * D
177
+ x + pos_embed, # L * B * D + L * B * D
178
+ x,
179
+ key_padding_mask=key_padding_mask,
180
+ )[0]
181
+ # out: Q * B * D
182
+ x = out.permute(1, 0, 2) # B * Q * D
183
+
184
+ x = self.ln_post(x)
185
+ x = x @ self.proj
186
+ return x
187
+
188
+ def _repeat(self, query, N: int):
189
+ return query.unsqueeze(1).repeat(1, N, 1)
190
+
191
+
192
+ class MultiheadAttention(nn.MultiheadAttention):
193
+ def __init__(
194
+ self,
195
+ embed_dim,
196
+ num_heads,
197
+ dropout=0.0,
198
+ bias=True,
199
+ add_bias_kv=False,
200
+ add_zero_attn=False,
201
+ kdim=None,
202
+ vdim=None,
203
+ batch_first=False,
204
+ device=None,
205
+ dtype=None,
206
+ ):
207
+ super().__init__(
208
+ embed_dim, num_heads, dropout, bias, add_bias_kv, add_zero_attn, kdim, vdim, batch_first, device, dtype
209
+ )
210
+
211
+ # rewrite out_proj layer,with nn.Linear
212
+ self.out_proj = nn.Linear(embed_dim, embed_dim, bias=bias, device=device, dtype=dtype)
213
+
214
+ def forward(
215
+ self,
216
+ query: Tensor,
217
+ key: Tensor,
218
+ value: Tensor,
219
+ key_padding_mask: Optional[Tensor] = None,
220
+ need_weights: bool = True,
221
+ attn_mask: Optional[Tensor] = None,
222
+ average_attn_weights: bool = True,
223
+ is_causal: bool = False,
224
+ ) -> Tuple[Tensor, Optional[Tensor]]:
225
+ why_not_fast_path = ""
226
+ if (
227
+ (attn_mask is not None and torch.is_floating_point(attn_mask))
228
+ or (key_padding_mask is not None)
229
+ and torch.is_floating_point(key_padding_mask)
230
+ ):
231
+ why_not_fast_path = "floating-point masks are not supported for fast path."
232
+
233
+ is_batched = query.dim() == 3
234
+
235
+ key_padding_mask = _canonical_mask(
236
+ mask=key_padding_mask,
237
+ mask_name="key_padding_mask",
238
+ other_type=F._none_or_dtype(attn_mask),
239
+ other_name="attn_mask",
240
+ target_type=query.dtype,
241
+ )
242
+
243
+ attn_mask = _canonical_mask(
244
+ mask=attn_mask,
245
+ mask_name="attn_mask",
246
+ other_type=None,
247
+ other_name="",
248
+ target_type=query.dtype,
249
+ check_other=False,
250
+ )
251
+
252
+ if not is_batched:
253
+ why_not_fast_path = f"input not batched; expected query.dim() of 3 but got {query.dim()}"
254
+ elif query is not key or key is not value:
255
+ # When lifting this restriction, don't forget to either
256
+ # enforce that the dtypes all match or test cases where
257
+ # they don't!
258
+ why_not_fast_path = "non-self attention was used (query, key, and value are not the same Tensor)"
259
+ elif self.in_proj_bias is not None and query.dtype != self.in_proj_bias.dtype:
260
+ why_not_fast_path = (
261
+ f"dtypes of query ({query.dtype}) and self.in_proj_bias ({self.in_proj_bias.dtype}) don't match"
262
+ )
263
+ elif self.in_proj_weight is None:
264
+ why_not_fast_path = "in_proj_weight was None"
265
+ elif query.dtype != self.in_proj_weight.dtype:
266
+ # this case will fail anyway, but at least they'll get a useful error message.
267
+ why_not_fast_path = (
268
+ f"dtypes of query ({query.dtype}) and self.in_proj_weight ({self.in_proj_weight.dtype}) don't match"
269
+ )
270
+ elif self.training:
271
+ why_not_fast_path = "training is enabled"
272
+ elif (self.num_heads % 2) != 0:
273
+ why_not_fast_path = "self.num_heads is not even"
274
+ elif not self.batch_first:
275
+ why_not_fast_path = "batch_first was not True"
276
+ elif self.bias_k is not None:
277
+ why_not_fast_path = "self.bias_k was not None"
278
+ elif self.bias_v is not None:
279
+ why_not_fast_path = "self.bias_v was not None"
280
+ elif self.add_zero_attn:
281
+ why_not_fast_path = "add_zero_attn was enabled"
282
+ elif not self._qkv_same_embed_dim:
283
+ why_not_fast_path = "_qkv_same_embed_dim was not True"
284
+ elif query.is_nested and (key_padding_mask is not None or attn_mask is not None):
285
+ why_not_fast_path = "supplying both src_key_padding_mask and src_mask at the same time \
286
+ is not supported with NestedTensor input"
287
+ elif torch.is_autocast_enabled():
288
+ why_not_fast_path = "autocast is enabled"
289
+
290
+ if not why_not_fast_path:
291
+ tensor_args = (
292
+ query,
293
+ key,
294
+ value,
295
+ self.in_proj_weight,
296
+ self.in_proj_bias,
297
+ self.out_proj.weight,
298
+ self.out_proj.bias,
299
+ )
300
+ # We have to use list comprehensions below because TorchScript does not support
301
+ # generator expressions.
302
+ if torch.overrides.has_torch_function(tensor_args):
303
+ why_not_fast_path = "some Tensor argument has_torch_function"
304
+ elif _is_make_fx_tracing():
305
+ why_not_fast_path = "we are running make_fx tracing"
306
+ elif not all(_check_arg_device(x) for x in tensor_args):
307
+ why_not_fast_path = (
308
+ "some Tensor argument's device is neither one of "
309
+ f"cpu, cuda or {torch.utils.backend_registration._privateuse1_backend_name}"
310
+ )
311
+ elif torch.is_grad_enabled() and any(_arg_requires_grad(x) for x in tensor_args):
312
+ why_not_fast_path = (
313
+ "grad is enabled and at least one of query or the "
314
+ "input/output projection weights or biases requires_grad"
315
+ )
316
+ if not why_not_fast_path:
317
+ merged_mask, mask_type = self.merge_masks(attn_mask, key_padding_mask, query)
318
+
319
+ if self.in_proj_bias is not None and self.in_proj_weight is not None:
320
+ return torch._native_multi_head_attention(
321
+ query,
322
+ key,
323
+ value,
324
+ self.embed_dim,
325
+ self.num_heads,
326
+ self.in_proj_weight,
327
+ self.in_proj_bias,
328
+ self.out_proj.weight,
329
+ self.out_proj.bias,
330
+ merged_mask,
331
+ need_weights,
332
+ average_attn_weights,
333
+ mask_type,
334
+ )
335
+
336
+ any_nested = query.is_nested or key.is_nested or value.is_nested
337
+ assert not any_nested, (
338
+ "MultiheadAttention does not support NestedTensor outside of its fast path. "
339
+ + f"The fast path was not hit because {why_not_fast_path}"
340
+ )
341
+
342
+ if self.batch_first and is_batched:
343
+ # make sure that the transpose op does not affect the "is" property
344
+ if key is value:
345
+ if query is key:
346
+ query = key = value = query.transpose(1, 0)
347
+ else:
348
+ query, key = (x.transpose(1, 0) for x in (query, key))
349
+ value = key
350
+ else:
351
+ query, key, value = (x.transpose(1, 0) for x in (query, key, value))
352
+
353
+ if not self._qkv_same_embed_dim:
354
+ attn_output, attn_output_weights = self.multi_head_attention_forward(
355
+ query,
356
+ key,
357
+ value,
358
+ self.embed_dim,
359
+ self.num_heads,
360
+ self.in_proj_weight,
361
+ self.in_proj_bias,
362
+ self.bias_k,
363
+ self.bias_v,
364
+ self.add_zero_attn,
365
+ self.dropout,
366
+ self.out_proj.weight,
367
+ self.out_proj.bias,
368
+ training=self.training,
369
+ key_padding_mask=key_padding_mask,
370
+ need_weights=need_weights,
371
+ attn_mask=attn_mask,
372
+ use_separate_proj_weight=True,
373
+ q_proj_weight=self.q_proj_weight,
374
+ k_proj_weight=self.k_proj_weight,
375
+ v_proj_weight=self.v_proj_weight,
376
+ average_attn_weights=average_attn_weights,
377
+ is_causal=is_causal,
378
+ )
379
+ else:
380
+ attn_output, attn_output_weights = self.multi_head_attention_forward(
381
+ query,
382
+ key,
383
+ value,
384
+ self.embed_dim,
385
+ self.num_heads,
386
+ self.in_proj_weight,
387
+ self.in_proj_bias,
388
+ self.bias_k,
389
+ self.bias_v,
390
+ self.add_zero_attn,
391
+ self.dropout,
392
+ self.out_proj.weight,
393
+ self.out_proj.bias,
394
+ training=self.training,
395
+ key_padding_mask=key_padding_mask,
396
+ need_weights=need_weights,
397
+ attn_mask=attn_mask,
398
+ average_attn_weights=average_attn_weights,
399
+ is_causal=is_causal,
400
+ )
401
+ if self.batch_first and is_batched:
402
+ return attn_output.transpose(1, 0), attn_output_weights
403
+ else:
404
+ return attn_output, attn_output_weights
405
+
406
+ def multi_head_attention_forward(
407
+ self,
408
+ query: Tensor,
409
+ key: Tensor,
410
+ value: Tensor,
411
+ embed_dim_to_check: int,
412
+ num_heads: int,
413
+ in_proj_weight: Optional[Tensor],
414
+ in_proj_bias: Optional[Tensor],
415
+ bias_k: Optional[Tensor],
416
+ bias_v: Optional[Tensor],
417
+ add_zero_attn: bool,
418
+ dropout_p: float,
419
+ out_proj_weight: Tensor,
420
+ out_proj_bias: Optional[Tensor],
421
+ training: bool = True,
422
+ key_padding_mask: Optional[Tensor] = None,
423
+ need_weights: bool = True,
424
+ attn_mask: Optional[Tensor] = None,
425
+ use_separate_proj_weight: bool = False,
426
+ q_proj_weight: Optional[Tensor] = None,
427
+ k_proj_weight: Optional[Tensor] = None,
428
+ v_proj_weight: Optional[Tensor] = None,
429
+ static_k: Optional[Tensor] = None,
430
+ static_v: Optional[Tensor] = None,
431
+ average_attn_weights: bool = True,
432
+ is_causal: bool = False,
433
+ ) -> Tuple[Tensor, Optional[Tensor]]:
434
+ tens_ops = (query, key, value, in_proj_weight, in_proj_bias, bias_k, bias_v, out_proj_weight, out_proj_bias)
435
+
436
+ is_batched = _mha_shape_check(query, key, value, key_padding_mask, attn_mask, num_heads)
437
+
438
+ # For unbatched input, we unsqueeze at the expected batch-dim to pretend that the input
439
+ # is batched, run the computation and before returning squeeze the
440
+ # batch dimension so that the output doesn't carry this temporary batch dimension.
441
+ if not is_batched:
442
+ # unsqueeze if the input is unbatched
443
+ query = query.unsqueeze(1)
444
+ key = key.unsqueeze(1)
445
+ value = value.unsqueeze(1)
446
+ if key_padding_mask is not None:
447
+ key_padding_mask = key_padding_mask.unsqueeze(0)
448
+
449
+ # set up shape vars
450
+ tgt_len, bsz, embed_dim = query.shape
451
+ src_len, _, _ = key.shape
452
+
453
+ key_padding_mask = _canonical_mask(
454
+ mask=key_padding_mask,
455
+ mask_name="key_padding_mask",
456
+ other_type=F._none_or_dtype(attn_mask),
457
+ other_name="attn_mask",
458
+ target_type=query.dtype,
459
+ )
460
+
461
+ if is_causal and attn_mask is None:
462
+ raise RuntimeError(
463
+ "Need attn_mask if specifying the is_causal hint. "
464
+ "You may use the Transformer module method "
465
+ "`generate_square_subsequent_mask` to create this mask."
466
+ )
467
+
468
+ if is_causal and key_padding_mask is None and not need_weights:
469
+ # when we have a kpm or need weights, we need attn_mask
470
+ # Otherwise, we use the is_causal hint go as is_causal
471
+ # indicator to SDPA.
472
+ attn_mask = None
473
+ else:
474
+ attn_mask = _canonical_mask(
475
+ mask=attn_mask,
476
+ mask_name="attn_mask",
477
+ other_type=None,
478
+ other_name="",
479
+ target_type=query.dtype,
480
+ check_other=False,
481
+ )
482
+
483
+ if key_padding_mask is not None:
484
+ # We have the attn_mask, and use that to merge kpm into it.
485
+ # Turn off use of is_causal hint, as the merged mask is no
486
+ # longer causal.
487
+ is_causal = False
488
+
489
+ assert (
490
+ embed_dim == embed_dim_to_check
491
+ ), f"was expecting embedding dimension of {embed_dim_to_check}, but got {embed_dim}"
492
+ if isinstance(embed_dim, torch.Tensor):
493
+ # embed_dim can be a tensor when JIT tracing
494
+ head_dim = embed_dim.div(num_heads, rounding_mode="trunc")
495
+ else:
496
+ head_dim = embed_dim // num_heads
497
+ assert head_dim * num_heads == embed_dim, f"embed_dim {embed_dim} not divisible by num_heads {num_heads}"
498
+ if use_separate_proj_weight:
499
+ # allow MHA to have different embedding dimensions when separate projection weights are used
500
+ assert (
501
+ key.shape[:2] == value.shape[:2]
502
+ ), f"key's sequence and batch dims {key.shape[:2]} do not match value's {value.shape[:2]}"
503
+ else:
504
+ assert key.shape == value.shape, f"key shape {key.shape} does not match value shape {value.shape}"
505
+
506
+ #
507
+ # compute in-projection
508
+ #
509
+ if not use_separate_proj_weight:
510
+ assert in_proj_weight is not None, "use_separate_proj_weight is False but in_proj_weight is None"
511
+ q, k, v = _in_projection_packed(query, key, value, in_proj_weight, in_proj_bias)
512
+ else:
513
+ assert q_proj_weight is not None, "use_separate_proj_weight is True but q_proj_weight is None"
514
+ assert k_proj_weight is not None, "use_separate_proj_weight is True but k_proj_weight is None"
515
+ assert v_proj_weight is not None, "use_separate_proj_weight is True but v_proj_weight is None"
516
+ if in_proj_bias is None:
517
+ b_q = b_k = b_v = None
518
+ else:
519
+ b_q, b_k, b_v = in_proj_bias.chunk(3)
520
+ q, k, v = _in_projection(query, key, value, q_proj_weight, k_proj_weight, v_proj_weight, b_q, b_k, b_v)
521
+
522
+ # prep attention mask
523
+
524
+ if attn_mask is not None:
525
+ # ensure attn_mask's dim is 3
526
+ if attn_mask.dim() == 2:
527
+ correct_2d_size = (tgt_len, src_len)
528
+ if attn_mask.shape != correct_2d_size:
529
+ raise RuntimeError(
530
+ f"The shape of the 2D attn_mask is {attn_mask.shape}, but should be {correct_2d_size}."
531
+ )
532
+ attn_mask = attn_mask.unsqueeze(0)
533
+ elif attn_mask.dim() == 3:
534
+ correct_3d_size = (bsz * num_heads, tgt_len, src_len)
535
+ if attn_mask.shape != correct_3d_size:
536
+ raise RuntimeError(
537
+ f"The shape of the 3D attn_mask is {attn_mask.shape}, but should be {correct_3d_size}."
538
+ )
539
+ else:
540
+ raise RuntimeError(f"attn_mask's dimension {attn_mask.dim()} is not supported")
541
+
542
+ # add bias along batch dimension (currently second)
543
+ if bias_k is not None and bias_v is not None:
544
+ assert static_k is None, "bias cannot be added to static key."
545
+ assert static_v is None, "bias cannot be added to static value."
546
+ k = torch.cat([k, bias_k.repeat(1, bsz, 1)])
547
+ v = torch.cat([v, bias_v.repeat(1, bsz, 1)])
548
+ if attn_mask is not None:
549
+ attn_mask = pad(attn_mask, (0, 1))
550
+ if key_padding_mask is not None:
551
+ key_padding_mask = pad(key_padding_mask, (0, 1))
552
+ else:
553
+ assert bias_k is None
554
+ assert bias_v is None
555
+
556
+ #
557
+ # reshape q, k, v for multihead attention and make em batch first
558
+ #
559
+ q = q.view(tgt_len, bsz * num_heads, head_dim).transpose(0, 1)
560
+ if static_k is None:
561
+ k = k.view(k.shape[0], bsz * num_heads, head_dim).transpose(0, 1)
562
+ else:
563
+ # TODO finish disentangling control flow so we don't do in-projections when statics are passed
564
+ assert (
565
+ static_k.size(0) == bsz * num_heads
566
+ ), f"expecting static_k.size(0) of {bsz * num_heads}, but got {static_k.size(0)}"
567
+ assert static_k.size(2) == head_dim, f"expecting static_k.size(2) of {head_dim}, but got {static_k.size(2)}"
568
+ k = static_k
569
+ if static_v is None:
570
+ v = v.view(v.shape[0], bsz * num_heads, head_dim).transpose(0, 1)
571
+ else:
572
+ # TODO finish disentangling control flow so we don't do in-projections when statics are passed
573
+ assert (
574
+ static_v.size(0) == bsz * num_heads
575
+ ), f"expecting static_v.size(0) of {bsz * num_heads}, but got {static_v.size(0)}"
576
+ assert static_v.size(2) == head_dim, f"expecting static_v.size(2) of {head_dim}, but got {static_v.size(2)}"
577
+ v = static_v
578
+
579
+ # add zero attention along batch dimension (now first)
580
+ if add_zero_attn:
581
+ zero_attn_shape = (bsz * num_heads, 1, head_dim)
582
+ k = torch.cat([k, torch.zeros(zero_attn_shape, dtype=k.dtype, device=k.device)], dim=1)
583
+ v = torch.cat([v, torch.zeros(zero_attn_shape, dtype=v.dtype, device=v.device)], dim=1)
584
+ if attn_mask is not None:
585
+ attn_mask = pad(attn_mask, (0, 1))
586
+ if key_padding_mask is not None:
587
+ key_padding_mask = pad(key_padding_mask, (0, 1))
588
+
589
+ # update source sequence length after adjustments
590
+ src_len = k.size(1)
591
+
592
+ # merge key padding and attention masks
593
+ if key_padding_mask is not None:
594
+ assert key_padding_mask.shape == (
595
+ bsz,
596
+ src_len,
597
+ ), f"expecting key_padding_mask shape of {(bsz, src_len)}, but got {key_padding_mask.shape}"
598
+ key_padding_mask = (
599
+ key_padding_mask.view(bsz, 1, 1, src_len)
600
+ .expand(-1, num_heads, -1, -1)
601
+ .reshape(bsz * num_heads, 1, src_len)
602
+ )
603
+ if attn_mask is None:
604
+ attn_mask = key_padding_mask
605
+ else:
606
+ attn_mask = attn_mask + key_padding_mask
607
+
608
+ # adjust dropout probability
609
+ if not training:
610
+ dropout_p = 0.0
611
+
612
+ #
613
+ # (deep breath) calculate attention and out projection
614
+ #
615
+
616
+ if need_weights:
617
+ B, Nt, E = q.shape
618
+ q_scaled = q / math.sqrt(E)
619
+
620
+ assert not (is_causal and attn_mask is None), "FIXME: is_causal not implemented for need_weights"
621
+
622
+ if attn_mask is not None:
623
+ attn_output_weights = torch.baddbmm(attn_mask, q_scaled, k.transpose(-2, -1))
624
+ else:
625
+ attn_output_weights = torch.bmm(q_scaled, k.transpose(-2, -1))
626
+ attn_output_weights = softmax(attn_output_weights, dim=-1)
627
+ if dropout_p > 0.0:
628
+ attn_output_weights = dropout(attn_output_weights, p=dropout_p)
629
+
630
+ attn_output = torch.bmm(attn_output_weights, v)
631
+
632
+ attn_output = attn_output.transpose(0, 1).contiguous().view(tgt_len * bsz, embed_dim)
633
+ attn_output = self.out_proj(attn_output)
634
+ attn_output = attn_output.view(tgt_len, bsz, attn_output.size(1))
635
+
636
+ # optionally average attention weights over heads
637
+ attn_output_weights = attn_output_weights.view(bsz, num_heads, tgt_len, src_len)
638
+ if average_attn_weights:
639
+ attn_output_weights = attn_output_weights.mean(dim=1)
640
+
641
+ if not is_batched:
642
+ # squeeze the output if input was unbatched
643
+ attn_output = attn_output.squeeze(1)
644
+ attn_output_weights = attn_output_weights.squeeze(0)
645
+ return attn_output, attn_output_weights
646
+ else:
647
+ # attn_mask can be either (L,S) or (N*num_heads, L, S)
648
+ # if attn_mask's shape is (1, L, S) we need to unsqueeze to (1, 1, L, S)
649
+ # in order to match the input for SDPA of (N, num_heads, L, S)
650
+ if attn_mask is not None:
651
+ if attn_mask.size(0) == 1 and attn_mask.dim() == 3:
652
+ attn_mask = attn_mask.unsqueeze(0)
653
+ else:
654
+ attn_mask = attn_mask.view(bsz, num_heads, -1, src_len)
655
+
656
+ q = q.view(bsz, num_heads, tgt_len, head_dim)
657
+ k = k.view(bsz, num_heads, src_len, head_dim)
658
+ v = v.view(bsz, num_heads, src_len, head_dim)
659
+
660
+ attn_output = F.scaled_dot_product_attention(q, k, v, attn_mask, dropout_p, is_causal)
661
+ attn_output = attn_output.permute(2, 0, 1, 3).contiguous().view(bsz * tgt_len, embed_dim)
662
+
663
+ attn_output = self.out_proj(attn_output)
664
+ attn_output = attn_output.view(tgt_len, bsz, attn_output.size(1))
665
+ if not is_batched:
666
+ # squeeze the output if input was unbatched
667
+ attn_output = attn_output.squeeze(1)
668
+ return attn_output, None
669
+
670
+
671
+ def _mha_shape_check(
672
+ query: Tensor,
673
+ key: Tensor,
674
+ value: Tensor,
675
+ key_padding_mask: Optional[Tensor],
676
+ attn_mask: Optional[Tensor],
677
+ num_heads: int,
678
+ ):
679
+ # Verifies the expected shape for `query, `key`, `value`, `key_padding_mask` and `attn_mask`
680
+ # and returns if the input is batched or not.
681
+ # Raises an error if `query` is not 2-D (unbatched) or 3-D (batched) tensor.
682
+
683
+ # Shape check.
684
+ if query.dim() == 3:
685
+ # Batched Inputs
686
+ is_batched = True
687
+ assert key.dim() == 3 and value.dim() == 3, (
688
+ "For batched (3-D) `query`, expected `key` and `value` to be 3-D"
689
+ f" but found {key.dim()}-D and {value.dim()}-D tensors respectively"
690
+ )
691
+ if key_padding_mask is not None:
692
+ assert key_padding_mask.dim() == 2, (
693
+ "For batched (3-D) `query`, expected `key_padding_mask` to be `None` or 2-D"
694
+ f" but found {key_padding_mask.dim()}-D tensor instead"
695
+ )
696
+ if attn_mask is not None:
697
+ assert attn_mask.dim() in (2, 3), (
698
+ "For batched (3-D) `query`, expected `attn_mask` to be `None`, 2-D or 3-D"
699
+ f" but found {attn_mask.dim()}-D tensor instead"
700
+ )
701
+ elif query.dim() == 2:
702
+ # Unbatched Inputs
703
+ is_batched = False
704
+ assert key.dim() == 2 and value.dim() == 2, (
705
+ "For unbatched (2-D) `query`, expected `key` and `value` to be 2-D"
706
+ f" but found {key.dim()}-D and {value.dim()}-D tensors respectively"
707
+ )
708
+
709
+ if key_padding_mask is not None:
710
+ assert key_padding_mask.dim() == 1, (
711
+ "For unbatched (2-D) `query`, expected `key_padding_mask` to be `None` or 1-D"
712
+ f" but found {key_padding_mask.dim()}-D tensor instead"
713
+ )
714
+
715
+ if attn_mask is not None:
716
+ assert attn_mask.dim() in (2, 3), (
717
+ "For unbatched (2-D) `query`, expected `attn_mask` to be `None`, 2-D or 3-D"
718
+ f" but found {attn_mask.dim()}-D tensor instead"
719
+ )
720
+ if attn_mask.dim() == 3:
721
+ expected_shape = (num_heads, query.shape[0], key.shape[0])
722
+ assert (
723
+ attn_mask.shape == expected_shape
724
+ ), f"Expected `attn_mask` shape to be {expected_shape} but got {attn_mask.shape}"
725
+ else:
726
+ raise AssertionError(
727
+ f"query should be unbatched 2D or batched 3D tensor but received {query.dim()}-D query tensor"
728
+ )
729
+
730
+ return is_batched
731
+
732
+
733
+ def _canonical_mask(
734
+ mask: Optional[Tensor],
735
+ mask_name: str,
736
+ other_type: Optional[DType],
737
+ other_name: str,
738
+ target_type: DType,
739
+ check_other: bool = True,
740
+ ) -> Optional[Tensor]:
741
+
742
+ if mask is not None:
743
+ _mask_dtype = mask.dtype
744
+ _mask_is_float = torch.is_floating_point(mask)
745
+ if _mask_dtype != torch.bool and not _mask_is_float:
746
+ raise AssertionError(f"only bool and floating types of {mask_name} are supported")
747
+ if check_other and other_type is not None:
748
+ if _mask_dtype != other_type:
749
+ warnings.warn(
750
+ f"Support for mismatched {mask_name} and {other_name} "
751
+ "is deprecated. Use same type for both instead."
752
+ )
753
+ if not _mask_is_float:
754
+ mask = torch.zeros_like(mask, dtype=target_type).masked_fill_(mask, float("-inf"))
755
+ return mask
756
+
757
+
758
+ def _in_projection_packed(
759
+ q: Tensor,
760
+ k: Tensor,
761
+ v: Tensor,
762
+ w: Tensor,
763
+ b: Optional[Tensor] = None,
764
+ ) -> List[Tensor]:
765
+ r"""
766
+ Performs the in-projection step of the attention operation, using packed weights.
767
+ Output is a triple containing projection tensors for query, key and value.
768
+ Args:
769
+ q, k, v: query, key and value tensors to be projected. For self-attention,
770
+ these are typically the same tensor; for encoder-decoder attention,
771
+ k and v are typically the same tensor. (We take advantage of these
772
+ identities for performance if they are present.) Regardless, q, k and v
773
+ must share a common embedding dimension; otherwise their shapes may vary.
774
+ w: projection weights for q, k and v, packed into a single tensor. Weights
775
+ are packed along dimension 0, in q, k, v order.
776
+ b: optional projection biases for q, k and v, packed into a single tensor
777
+ in q, k, v order.
778
+ Shape:
779
+ Inputs:
780
+ - q: :math:`(..., E)` where E is the embedding dimension
781
+ - k: :math:`(..., E)` where E is the embedding dimension
782
+ - v: :math:`(..., E)` where E is the embedding dimension
783
+ - w: :math:`(E * 3, E)` where E is the embedding dimension
784
+ - b: :math:`E * 3` where E is the embedding dimension
785
+ Output:
786
+ - in output list :math:`[q', k', v']`, each output tensor will have the
787
+ same shape as the corresponding input tensor.
788
+ """
789
+ E = q.size(-1)
790
+ if k is v:
791
+ if q is k:
792
+ # self-attention
793
+ proj = linear(q, w, b)
794
+ # reshape to 3, E and not E, 3 is deliberate for better memory coalescing and keeping same order as chunk()
795
+ proj = proj.unflatten(-1, (3, E)).unsqueeze(0).transpose(0, -2).squeeze(-2).contiguous()
796
+ return proj[0], proj[1], proj[2]
797
+ else:
798
+ # encoder-decoder attention
799
+ w_q, w_kv = w.split([E, E * 2])
800
+ if b is None:
801
+ b_q = b_kv = None
802
+ else:
803
+ b_q, b_kv = b.split([E, E * 2])
804
+ q_proj = linear(q, w_q, b_q)
805
+ kv_proj = linear(k, w_kv, b_kv)
806
+ # reshape to 2, E and not E, 2 is deliberate for better memory coalescing and keeping same order as chunk()
807
+ kv_proj = kv_proj.unflatten(-1, (2, E)).unsqueeze(0).transpose(0, -2).squeeze(-2).contiguous()
808
+ return (q_proj, kv_proj[0], kv_proj[1])
809
+ else:
810
+ w_q, w_k, w_v = w.chunk(3)
811
+ if b is None:
812
+ b_q = b_k = b_v = None
813
+ else:
814
+ b_q, b_k, b_v = b.chunk(3)
815
+ return linear(q, w_q, b_q), linear(k, w_k, b_k), linear(v, w_v, b_v)
816
+
817
+
818
+ def _in_projection(
819
+ q: Tensor,
820
+ k: Tensor,
821
+ v: Tensor,
822
+ w_q: Tensor,
823
+ w_k: Tensor,
824
+ w_v: Tensor,
825
+ b_q: Optional[Tensor] = None,
826
+ b_k: Optional[Tensor] = None,
827
+ b_v: Optional[Tensor] = None,
828
+ ) -> Tuple[Tensor, Tensor, Tensor]:
829
+ r"""
830
+ Performs the in-projection step of the attention operation. This is simply
831
+ a triple of linear projections, with shape constraints on the weights which
832
+ ensure embedding dimension uniformity in the projected outputs.
833
+ Output is a triple containing projection tensors for query, key and value.
834
+ Args:
835
+ q, k, v: query, key and value tensors to be projected.
836
+ w_q, w_k, w_v: weights for q, k and v, respectively.
837
+ b_q, b_k, b_v: optional biases for q, k and v, respectively.
838
+ Shape:
839
+ Inputs:
840
+ - q: :math:`(Qdims..., Eq)` where Eq is the query embedding dimension and Qdims are any
841
+ number of leading dimensions.
842
+ - k: :math:`(Kdims..., Ek)` where Ek is the key embedding dimension and Kdims are any
843
+ number of leading dimensions.
844
+ - v: :math:`(Vdims..., Ev)` where Ev is the value embedding dimension and Vdims are any
845
+ number of leading dimensions.
846
+ - w_q: :math:`(Eq, Eq)`
847
+ - w_k: :math:`(Eq, Ek)`
848
+ - w_v: :math:`(Eq, Ev)`
849
+ - b_q: :math:`(Eq)`
850
+ - b_k: :math:`(Eq)`
851
+ - b_v: :math:`(Eq)`
852
+ Output: in output triple :math:`(q', k', v')`,
853
+ - q': :math:`[Qdims..., Eq]`
854
+ - k': :math:`[Kdims..., Eq]`
855
+ - v': :math:`[Vdims..., Eq]`
856
+ """
857
+ Eq, Ek, Ev = q.size(-1), k.size(-1), v.size(-1)
858
+ assert w_q.shape == (Eq, Eq), f"expecting query weights shape of {(Eq, Eq)}, but got {w_q.shape}"
859
+ assert w_k.shape == (Eq, Ek), f"expecting key weights shape of {(Eq, Ek)}, but got {w_k.shape}"
860
+ assert w_v.shape == (Eq, Ev), f"expecting value weights shape of {(Eq, Ev)}, but got {w_v.shape}"
861
+ assert b_q is None or b_q.shape == (Eq,), f"expecting query bias shape of {(Eq,)}, but got {b_q.shape}"
862
+ assert b_k is None or b_k.shape == (Eq,), f"expecting key bias shape of {(Eq,)}, but got {b_k.shape}"
863
+ assert b_v is None or b_v.shape == (Eq,), f"expecting value bias shape of {(Eq,)}, but got {b_v.shape}"
864
+ return linear(q, w_q, b_q), linear(k, w_k, b_k), linear(v, w_v, b_v)
special_tokens_map.json ADDED
@@ -0,0 +1,264 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ {
4
+ "content": "<image>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false
9
+ },
10
+ {
11
+ "content": "</image>",
12
+ "lstrip": false,
13
+ "normalized": false,
14
+ "rstrip": false,
15
+ "single_word": false
16
+ },
17
+ {
18
+ "content": "<ref>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ {
25
+ "content": "</ref>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ },
31
+ {
32
+ "content": "<box>",
33
+ "lstrip": false,
34
+ "normalized": false,
35
+ "rstrip": false,
36
+ "single_word": false
37
+ },
38
+ {
39
+ "content": "</box>",
40
+ "lstrip": false,
41
+ "normalized": false,
42
+ "rstrip": false,
43
+ "single_word": false
44
+ },
45
+ {
46
+ "content": "<quad>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false
51
+ },
52
+ {
53
+ "content": "</quad>",
54
+ "lstrip": false,
55
+ "normalized": false,
56
+ "rstrip": false,
57
+ "single_word": false
58
+ },
59
+ {
60
+ "content": "<point>",
61
+ "lstrip": false,
62
+ "normalized": false,
63
+ "rstrip": false,
64
+ "single_word": false
65
+ },
66
+ {
67
+ "content": "</point>",
68
+ "lstrip": false,
69
+ "normalized": false,
70
+ "rstrip": false,
71
+ "single_word": false
72
+ },
73
+ {
74
+ "content": "<slice>",
75
+ "lstrip": false,
76
+ "normalized": false,
77
+ "rstrip": false,
78
+ "single_word": false
79
+ },
80
+ {
81
+ "content": "</slice>",
82
+ "lstrip": false,
83
+ "normalized": false,
84
+ "rstrip": false,
85
+ "single_word": false
86
+ },
87
+ {
88
+ "content": "<image_id>",
89
+ "lstrip": false,
90
+ "normalized": false,
91
+ "rstrip": false,
92
+ "single_word": false
93
+ },
94
+ {
95
+ "content": "</image_id>",
96
+ "lstrip": false,
97
+ "normalized": false,
98
+ "rstrip": false,
99
+ "single_word": false
100
+ },
101
+ {
102
+ "content": "<unit>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false
107
+ },
108
+ {
109
+ "content": "</unit>",
110
+ "lstrip": false,
111
+ "normalized": false,
112
+ "rstrip": false,
113
+ "single_word": false
114
+ },
115
+ {
116
+ "content": "<asr>",
117
+ "lstrip": false,
118
+ "normalized": false,
119
+ "rstrip": false,
120
+ "single_word": false
121
+ },
122
+ {
123
+ "content": "</asr>",
124
+ "lstrip": false,
125
+ "normalized": false,
126
+ "rstrip": false,
127
+ "single_word": false
128
+ },
129
+ {
130
+ "content": "<query>",
131
+ "lstrip": false,
132
+ "normalized": false,
133
+ "rstrip": false,
134
+ "single_word": false
135
+ },
136
+ {
137
+ "content": "</query>",
138
+ "lstrip": false,
139
+ "normalized": false,
140
+ "rstrip": false,
141
+ "single_word": false
142
+ },
143
+ {
144
+ "content": "<|audio_start|>",
145
+ "lstrip": false,
146
+ "normalized": false,
147
+ "rstrip": false,
148
+ "single_word": false
149
+ },
150
+ {
151
+ "content": "<|audio|>",
152
+ "lstrip": false,
153
+ "normalized": false,
154
+ "rstrip": false,
155
+ "single_word": false
156
+ },
157
+ {
158
+ "content": "<|audio_end|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false
163
+ },
164
+ {
165
+ "content": "<|spk_bos|>",
166
+ "lstrip": false,
167
+ "normalized": false,
168
+ "rstrip": false,
169
+ "single_word": false
170
+ },
171
+ {
172
+ "content": "<|spk|>",
173
+ "lstrip": false,
174
+ "normalized": false,
175
+ "rstrip": false,
176
+ "single_word": false
177
+ },
178
+ {
179
+ "content": "<|spk_eos|>",
180
+ "lstrip": false,
181
+ "normalized": false,
182
+ "rstrip": false,
183
+ "single_word": false
184
+ },
185
+ {
186
+ "content": "<|tts_bos|>",
187
+ "lstrip": false,
188
+ "normalized": false,
189
+ "rstrip": false,
190
+ "single_word": false
191
+ },
192
+ {
193
+ "content": "<|tts_eos|>",
194
+ "lstrip": false,
195
+ "normalized": false,
196
+ "rstrip": false,
197
+ "single_word": false
198
+ },
199
+ {
200
+ "content": "<|listen|>",
201
+ "lstrip": false,
202
+ "normalized": false,
203
+ "rstrip": false,
204
+ "single_word": false
205
+ },
206
+ {
207
+ "content": "<|speak|>",
208
+ "lstrip": false,
209
+ "normalized": false,
210
+ "rstrip": false,
211
+ "single_word": false
212
+ },
213
+ {
214
+ "content": "<|interrupt|>",
215
+ "lstrip": false,
216
+ "normalized": false,
217
+ "rstrip": false,
218
+ "single_word": false
219
+ },
220
+ {
221
+ "content": "<|vad_start|>",
222
+ "lstrip": false,
223
+ "normalized": false,
224
+ "rstrip": false,
225
+ "single_word": false
226
+ },
227
+ {
228
+ "content": "<|vad_end|>",
229
+ "lstrip": false,
230
+ "normalized": false,
231
+ "rstrip": false,
232
+ "single_word": false
233
+ },
234
+ {
235
+ "content": "<reserved_43>",
236
+ "lstrip": false,
237
+ "normalized": false,
238
+ "rstrip": false,
239
+ "single_word": false
240
+ },
241
+ {
242
+ "content": "<reserved_53>",
243
+ "lstrip": false,
244
+ "normalized": false,
245
+ "rstrip": false,
246
+ "single_word": false
247
+ }
248
+ ],
249
+ "eos_token": {
250
+ "content": "<|im_end|>",
251
+ "lstrip": false,
252
+ "normalized": false,
253
+ "rstrip": false,
254
+ "single_word": false
255
+ },
256
+ "pad_token": {
257
+ "content": "<|endoftext|>",
258
+ "lstrip": false,
259
+ "normalized": false,
260
+ "rstrip": false,
261
+ "single_word": false
262
+ },
263
+ "unk_token": "<unk>"
264
+ }
tokenization_minicpmo_fast.py ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2025 The OpenBMB Team. All rights reserved.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+
16
+ from transformers import Qwen2TokenizerFast
17
+
18
+
19
+ class MiniCPMOTokenizerFast(Qwen2TokenizerFast):
20
+ def __init__(self, **kwargs):
21
+ super().__init__(**kwargs)
22
+ # image
23
+ self.im_start = "<image>"
24
+ self.im_end = "</image>"
25
+ self.ref_start = "<ref>"
26
+ self.ref_end = "</ref>"
27
+ self.box_start = "<box>"
28
+ self.box_end = "</box>"
29
+ self.quad_start = "<quad>"
30
+ self.quad_end = "</quad>"
31
+ self.slice_start = "<slice>"
32
+ self.slice_end = "</slice>"
33
+ self.im_id_start = "<image_id>"
34
+ self.im_id_end = "</image_id>"
35
+
36
+ # audio
37
+ self.audio_start = "<|audio_start|>"
38
+ self.audio_end = "<|audio_end|>"
39
+ self.spk_start = "<|spk_bos|>"
40
+ self.spk_end = "<|spk_eos|>"
41
+ self.tts_start = "<|tts_bos|>"
42
+ self.tts_end = "<|tts_eos|>"
43
+
44
+ @property
45
+ def eos_id(self):
46
+ return self.eos_token_id
47
+
48
+ @property
49
+ def bos_id(self):
50
+ return self.bos_token_id
51
+
52
+ @property
53
+ def unk_id(self):
54
+ return self.unk_token_id
55
+
56
+ @property
57
+ def im_start_id(self):
58
+ return self.convert_tokens_to_ids(self.im_start)
59
+
60
+ @property
61
+ def im_end_id(self):
62
+ return self.convert_tokens_to_ids(self.im_end)
63
+
64
+ @property
65
+ def slice_start_id(self):
66
+ return self.convert_tokens_to_ids(self.slice_start)
67
+
68
+ @property
69
+ def slice_end_id(self):
70
+ return self.convert_tokens_to_ids(self.slice_end)
71
+
72
+ @property
73
+ def im_id_start_id(self):
74
+ return self.convert_tokens_to_ids(self.im_id_start)
75
+
76
+ @property
77
+ def im_id_end_id(self):
78
+ return self.convert_tokens_to_ids(self.im_id_end)
79
+
80
+ @property
81
+ def audio_start_id(self):
82
+ return self.convert_tokens_to_ids(self.audio_start)
83
+
84
+ @property
85
+ def audio_end_id(self):
86
+ return self.convert_tokens_to_ids(self.audio_end)
87
+
88
+ @property
89
+ def spk_start_id(self):
90
+ return self.convert_tokens_to_ids(self.spk_start)
91
+
92
+ @property
93
+ def spk_end_id(self):
94
+ return self.convert_tokens_to_ids(self.spk_end)
95
+
96
+ @property
97
+ def tts_start_id(self):
98
+ return self.convert_tokens_to_ids(self.tts_start)
99
+
100
+ @property
101
+ def tts_end_id(self):
102
+ return self.convert_tokens_to_ids(self.tts_end)
103
+
104
+ @staticmethod
105
+ def escape(text: str) -> str:
106
+ return text
107
+
108
+ @staticmethod
109
+ def unescape(text: str) -> str:
110
+ return text
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,523 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "128244": {
6
+ "content": "<unk>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "151643": {
14
+ "content": "<|endoftext|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "151644": {
22
+ "content": "<|im_start|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "151645": {
30
+ "content": "<|im_end|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "151646": {
38
+ "content": "<|object_ref_start|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "151647": {
46
+ "content": "<|object_ref_end|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "151648": {
54
+ "content": "<|box_start|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "151649": {
62
+ "content": "<|box_end|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "151650": {
70
+ "content": "<|quad_start|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "151651": {
78
+ "content": "<|quad_end|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "151652": {
86
+ "content": "<|vision_start|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "151653": {
94
+ "content": "<|vision_end|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "151654": {
102
+ "content": "<|vision_pad|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "151655": {
110
+ "content": "<|image_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "151656": {
118
+ "content": "<|video_pad|>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": true
124
+ },
125
+ "151657": {
126
+ "content": "<tool_call>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "151658": {
134
+ "content": "</tool_call>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "151659": {
142
+ "content": "<|fim_prefix|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "151660": {
150
+ "content": "<|fim_middle|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "151661": {
158
+ "content": "<|fim_suffix|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "151662": {
166
+ "content": "<|fim_pad|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "151663": {
174
+ "content": "<|repo_name|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ },
181
+ "151664": {
182
+ "content": "<|file_sep|>",
183
+ "lstrip": false,
184
+ "normalized": false,
185
+ "rstrip": false,
186
+ "single_word": false,
187
+ "special": false
188
+ },
189
+ "151665": {
190
+ "content": "<image>",
191
+ "lstrip": false,
192
+ "normalized": false,
193
+ "rstrip": false,
194
+ "single_word": false,
195
+ "special": true
196
+ },
197
+ "151666": {
198
+ "content": "</image>",
199
+ "lstrip": false,
200
+ "normalized": false,
201
+ "rstrip": false,
202
+ "single_word": false,
203
+ "special": true
204
+ },
205
+ "151667": {
206
+ "content": "<ref>",
207
+ "lstrip": false,
208
+ "normalized": false,
209
+ "rstrip": false,
210
+ "single_word": false,
211
+ "special": true
212
+ },
213
+ "151668": {
214
+ "content": "</ref>",
215
+ "lstrip": false,
216
+ "normalized": false,
217
+ "rstrip": false,
218
+ "single_word": false,
219
+ "special": true
220
+ },
221
+ "151669": {
222
+ "content": "<box>",
223
+ "lstrip": false,
224
+ "normalized": false,
225
+ "rstrip": false,
226
+ "single_word": false,
227
+ "special": true
228
+ },
229
+ "151670": {
230
+ "content": "</box>",
231
+ "lstrip": false,
232
+ "normalized": false,
233
+ "rstrip": false,
234
+ "single_word": false,
235
+ "special": true
236
+ },
237
+ "151671": {
238
+ "content": "<quad>",
239
+ "lstrip": false,
240
+ "normalized": false,
241
+ "rstrip": false,
242
+ "single_word": false,
243
+ "special": true
244
+ },
245
+ "151672": {
246
+ "content": "</quad>",
247
+ "lstrip": false,
248
+ "normalized": false,
249
+ "rstrip": false,
250
+ "single_word": false,
251
+ "special": true
252
+ },
253
+ "151673": {
254
+ "content": "<point>",
255
+ "lstrip": false,
256
+ "normalized": false,
257
+ "rstrip": false,
258
+ "single_word": false,
259
+ "special": true
260
+ },
261
+ "151674": {
262
+ "content": "</point>",
263
+ "lstrip": false,
264
+ "normalized": false,
265
+ "rstrip": false,
266
+ "single_word": false,
267
+ "special": true
268
+ },
269
+ "151675": {
270
+ "content": "<slice>",
271
+ "lstrip": false,
272
+ "normalized": false,
273
+ "rstrip": false,
274
+ "single_word": false,
275
+ "special": true
276
+ },
277
+ "151676": {
278
+ "content": "</slice>",
279
+ "lstrip": false,
280
+ "normalized": false,
281
+ "rstrip": false,
282
+ "single_word": false,
283
+ "special": true
284
+ },
285
+ "151677": {
286
+ "content": "<image_id>",
287
+ "lstrip": false,
288
+ "normalized": false,
289
+ "rstrip": false,
290
+ "single_word": false,
291
+ "special": true
292
+ },
293
+ "151678": {
294
+ "content": "</image_id>",
295
+ "lstrip": false,
296
+ "normalized": false,
297
+ "rstrip": false,
298
+ "single_word": false,
299
+ "special": true
300
+ },
301
+ "151679": {
302
+ "content": "<unit>",
303
+ "lstrip": false,
304
+ "normalized": false,
305
+ "rstrip": false,
306
+ "single_word": false,
307
+ "special": true
308
+ },
309
+ "151680": {
310
+ "content": "</unit>",
311
+ "lstrip": false,
312
+ "normalized": false,
313
+ "rstrip": false,
314
+ "single_word": false,
315
+ "special": true
316
+ },
317
+ "151681": {
318
+ "content": "<asr>",
319
+ "lstrip": false,
320
+ "normalized": false,
321
+ "rstrip": false,
322
+ "single_word": false,
323
+ "special": true
324
+ },
325
+ "151682": {
326
+ "content": "</asr>",
327
+ "lstrip": false,
328
+ "normalized": false,
329
+ "rstrip": false,
330
+ "single_word": false,
331
+ "special": true
332
+ },
333
+ "151683": {
334
+ "content": "<query>",
335
+ "lstrip": false,
336
+ "normalized": false,
337
+ "rstrip": false,
338
+ "single_word": false,
339
+ "special": true
340
+ },
341
+ "151684": {
342
+ "content": "</query>",
343
+ "lstrip": false,
344
+ "normalized": false,
345
+ "rstrip": false,
346
+ "single_word": false,
347
+ "special": true
348
+ },
349
+ "151685": {
350
+ "content": "<|audio_start|>",
351
+ "lstrip": false,
352
+ "normalized": false,
353
+ "rstrip": false,
354
+ "single_word": false,
355
+ "special": true
356
+ },
357
+ "151686": {
358
+ "content": "<|audio|>",
359
+ "lstrip": false,
360
+ "normalized": false,
361
+ "rstrip": false,
362
+ "single_word": false,
363
+ "special": true
364
+ },
365
+ "151687": {
366
+ "content": "<|audio_end|>",
367
+ "lstrip": false,
368
+ "normalized": false,
369
+ "rstrip": false,
370
+ "single_word": false,
371
+ "special": true
372
+ },
373
+ "151688": {
374
+ "content": "<|spk_bos|>",
375
+ "lstrip": false,
376
+ "normalized": false,
377
+ "rstrip": false,
378
+ "single_word": false,
379
+ "special": true
380
+ },
381
+ "151689": {
382
+ "content": "<|spk|>",
383
+ "lstrip": false,
384
+ "normalized": false,
385
+ "rstrip": false,
386
+ "single_word": false,
387
+ "special": true
388
+ },
389
+ "151690": {
390
+ "content": "<|spk_eos|>",
391
+ "lstrip": false,
392
+ "normalized": false,
393
+ "rstrip": false,
394
+ "single_word": false,
395
+ "special": true
396
+ },
397
+ "151691": {
398
+ "content": "<|tts_bos|>",
399
+ "lstrip": false,
400
+ "normalized": false,
401
+ "rstrip": false,
402
+ "single_word": false,
403
+ "special": true
404
+ },
405
+ "151692": {
406
+ "content": "<|tts_eos|>",
407
+ "lstrip": false,
408
+ "normalized": false,
409
+ "rstrip": false,
410
+ "single_word": false,
411
+ "special": true
412
+ },
413
+ "151693": {
414
+ "content": "<|listen|>",
415
+ "lstrip": false,
416
+ "normalized": false,
417
+ "rstrip": false,
418
+ "single_word": false,
419
+ "special": true
420
+ },
421
+ "151694": {
422
+ "content": "<|speak|>",
423
+ "lstrip": false,
424
+ "normalized": false,
425
+ "rstrip": false,
426
+ "single_word": false,
427
+ "special": true
428
+ },
429
+ "151695": {
430
+ "content": "<|interrupt|>",
431
+ "lstrip": false,
432
+ "normalized": false,
433
+ "rstrip": false,
434
+ "single_word": false,
435
+ "special": true
436
+ },
437
+ "151696": {
438
+ "content": "<|vad_start|>",
439
+ "lstrip": false,
440
+ "normalized": false,
441
+ "rstrip": false,
442
+ "single_word": false,
443
+ "special": true
444
+ },
445
+ "151697": {
446
+ "content": "<|vad_end|>",
447
+ "lstrip": false,
448
+ "normalized": false,
449
+ "rstrip": false,
450
+ "single_word": false,
451
+ "special": true
452
+ },
453
+ "151698": {
454
+ "content": "<reserved_43>",
455
+ "lstrip": false,
456
+ "normalized": false,
457
+ "rstrip": false,
458
+ "single_word": false,
459
+ "special": true
460
+ },
461
+ "151699": {
462
+ "content": "<reserved_53>",
463
+ "lstrip": false,
464
+ "normalized": false,
465
+ "rstrip": false,
466
+ "single_word": false,
467
+ "special": true
468
+ }
469
+ },
470
+ "additional_special_tokens": [
471
+ "<image>",
472
+ "</image>",
473
+ "<ref>",
474
+ "</ref>",
475
+ "<box>",
476
+ "</box>",
477
+ "<quad>",
478
+ "</quad>",
479
+ "<point>",
480
+ "</point>",
481
+ "<slice>",
482
+ "</slice>",
483
+ "<image_id>",
484
+ "</image_id>",
485
+ "<unit>",
486
+ "</unit>",
487
+ "<asr>",
488
+ "</asr>",
489
+ "<query>",
490
+ "</query>",
491
+ "<|audio_start|>",
492
+ "<|audio|>",
493
+ "<|audio_end|>",
494
+ "<|spk_bos|>",
495
+ "<|spk|>",
496
+ "<|spk_eos|>",
497
+ "<|tts_bos|>",
498
+ "<|tts_eos|>",
499
+ "<|listen|>",
500
+ "<|speak|>",
501
+ "<|interrupt|>",
502
+ "<|vad_start|>",
503
+ "<|vad_end|>",
504
+ "<reserved_43>",
505
+ "<reserved_53>"
506
+ ],
507
+ "bos_token": "<|im_start|>",
508
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n<tool_call>\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n</tool_call>' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n<tool_response>\\n' }}\n {{- message.content }}\n {{- '\\n</tool_response>' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
509
+ "clean_up_tokenization_spaces": false,
510
+ "eos_token": "<|im_end|>",
511
+ "errors": "replace",
512
+ "model_max_length": 131072,
513
+ "pad_token": "<|endoftext|>",
514
+ "split_special_tokens": false,
515
+ "auto_map": {
516
+ "AutoTokenizer": [
517
+ "tokenization_minicpmo_fast.MiniCPMOTokenizerFast",
518
+ null
519
+ ]
520
+ },
521
+ "tokenizer_class": "MiniCPMOTokenizerFast",
522
+ "unk_token": "<unk>"
523
+ }
utils.py ADDED
@@ -0,0 +1,203 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2025 The OpenBMB Team. All rights reserved.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+
16
+ import logging
17
+ import re
18
+
19
+ import librosa
20
+ import numpy as np
21
+
22
+ logger = logging.getLogger(__name__)
23
+
24
+
25
+ def is_silent(data):
26
+ if np.abs(data).max() < 3e-3:
27
+ return True
28
+ else:
29
+ return False
30
+
31
+
32
+ def sentence_end(txt):
33
+ for c in [".", "。", "!", "?", "!", "?"]:
34
+ if c in txt:
35
+ if c == ".": # check not number before it like 1.
36
+ idx = txt.find(c)
37
+ if idx > 0:
38
+ if txt[idx - 1].isdigit():
39
+ continue
40
+ return c
41
+ return ""
42
+
43
+
44
+ class NumberToTextConverter:
45
+ r"""
46
+ A helper class to ensure text-to-speech (TTS) systems read numeric digits
47
+ in the desired language (Chinese or English) digit-by-digit. It forcibly
48
+ replaces all numeric substrings in text with their language-specific
49
+ textual representations, thereby reducing the likelihood of TTS mistakes
50
+ on numbers.
51
+ Note: MiniCPM-o 2.6 only use this in streaming mode.
52
+
53
+ Attributes:
54
+ num_to_chinese (dict):
55
+ Mapping from digit (str) to its Chinese textual form (str).
56
+ num_to_english (dict):
57
+ Mapping from digit (str) to its English textual form (str).
58
+
59
+ Example:
60
+ >>> converter = NumberToTextConverter()
61
+ >>> converter.replace_numbers_with_text("我有2个苹果", language="chinese")
62
+ '我有两个苹果'
63
+ >>> converter.replace_numbers_with_text("I have 23 books", language="english")
64
+ 'I have two three books'
65
+ """
66
+
67
+ def __init__(self):
68
+ self.num_to_chinese = {
69
+ "0": "零",
70
+ "1": "一",
71
+ "2": "二",
72
+ "3": "三",
73
+ "4": "四",
74
+ "5": "五",
75
+ "6": "六",
76
+ "7": "七",
77
+ "8": "八",
78
+ "9": "九",
79
+ }
80
+ self.num_to_english = {
81
+ "0": "zero",
82
+ "1": "one",
83
+ "2": "two",
84
+ "3": "three",
85
+ "4": "four",
86
+ "5": "five",
87
+ "6": "six",
88
+ "7": "seven",
89
+ "8": "eight",
90
+ "9": "nine",
91
+ }
92
+
93
+ def number_to_chinese_digit_by_digit(self, num_str):
94
+ result = ""
95
+ for char in num_str:
96
+ if char in self.num_to_chinese:
97
+ result += self.num_to_chinese[char]
98
+ return result
99
+
100
+ def number_to_english_digit_by_digit(self, num_str):
101
+ result = []
102
+ for char in num_str:
103
+ if char in self.num_to_english:
104
+ result.append(self.num_to_english[char])
105
+ return " ".join(result)
106
+
107
+ def detect_language(self, text):
108
+ chinese_count = len(re.findall(r"[\u4e00-\u9fff]", text))
109
+ english_count = len(re.findall(r"[a-zA-Z]", text))
110
+ return "chinese" if chinese_count >= english_count else "english"
111
+
112
+ def replace_numbers_with_text(self, text, language=None):
113
+ if language is None:
114
+ language = self.detect_language(text)
115
+ numbers = re.findall(r"\d+", text)
116
+
117
+ for num in numbers:
118
+ if language == "chinese":
119
+ replacement = self.number_to_chinese_digit_by_digit(num)
120
+ else:
121
+ replacement = self.number_to_english_digit_by_digit(num)
122
+ text = text.replace(num, replacement, 1)
123
+
124
+ return text
125
+
126
+
127
+ class VoiceChecker:
128
+ r"""
129
+ A simple utility class to detect silence or low variation in consecutive audio chunks by comparing
130
+ the mel-spectrogram distances. It keeps track of consecutive zero-distance and low-distance chunks
131
+ to decide if the audio is considered "bad" (e.g., overly silent or not changing enough).
132
+
133
+ Attributes:
134
+ previous_mel (`np.ndarray` or `None`):
135
+ Holds the previously observed mel-spectrogram in decibel scale. Used to compute
136
+ the next distance; reset via :meth:`reset`.
137
+ consecutive_zeros (`int`):
138
+ The number of consecutive chunks that were detected as silent (distance = 0).
139
+ consecutive_low_distance (`int`):
140
+ The number of consecutive chunks whose distance was below the threshold.
141
+
142
+ Example:
143
+ >>> checker = VoiceChecker()
144
+ >>> # Suppose we have audio_wav (list or np.ndarray) and mel_spec (np.ndarray)
145
+ >>> # We split them into chunks and call checker.is_bad(...)
146
+ >>> is_audio_bad = checker.is_bad(audio_wav, mel_spec, chunk_size=2560, thresh=100.0)
147
+ >>> if is_audio_bad:
148
+ ... print("Audio deemed bad!")
149
+ >>> # Reset states if needed
150
+ >>> checker.reset()
151
+ """
152
+
153
+ def __init__(self):
154
+ self.previous_mel = None
155
+ self.consecutive_zeros = 0
156
+ self.consecutive_low_distance = 0
157
+
158
+ def compute_distance(self, audio_chunk, mel_spec):
159
+ if is_silent(audio_chunk):
160
+ return 0.0 # 检查是否为空白片段
161
+
162
+ mel_db = librosa.power_to_db(mel_spec)
163
+ if self.previous_mel is None:
164
+ self.previous_mel = mel_db
165
+ return -1.0
166
+
167
+ distance = np.linalg.norm(np.mean(mel_db, axis=1) - np.mean(self.previous_mel, axis=1))
168
+ self.previous_mel = mel_db
169
+ return distance
170
+
171
+ def is_bad(self, audio_wav, mel_spec, chunk_size=2560, thresh=100.0):
172
+ num_chunks = len(audio_wav) // chunk_size
173
+ mel_chunk_size = mel_spec.shape[-1] // num_chunks
174
+ for i in range(num_chunks):
175
+ audio_chunk = audio_wav[i * chunk_size : (i + 1) * chunk_size]
176
+ mel_spec_chunk = mel_spec[:, i * mel_chunk_size : (i + 1) * mel_chunk_size]
177
+
178
+ distance = self.compute_distance(audio_chunk, mel_spec_chunk)
179
+ logger.warning(
180
+ f"mel dist: {distance:.1f}, zero: {self.consecutive_zeros}, low: {self.consecutive_low_distance}"
181
+ )
182
+ if distance == 0:
183
+ self.consecutive_low_distance = 0 # reset
184
+ self.consecutive_zeros += 1
185
+ if self.consecutive_zeros >= 12:
186
+ logger.warning("VoiceChecker detected 1.2 s silent. Marking as failed.")
187
+ return True
188
+ elif distance < thresh:
189
+ self.consecutive_zeros = 0
190
+ self.consecutive_low_distance += 1
191
+ if self.consecutive_low_distance >= 5:
192
+ logger.warning("VoiceChecker detected 5 consecutive low distance chunks. Marking as failed.")
193
+ return True
194
+ else:
195
+ self.consecutive_low_distance = 0
196
+ self.consecutive_zeros = 0
197
+
198
+ return False
199
+
200
+ def reset(self):
201
+ self.previous_mel = None
202
+ self.consecutive_zeros = 0
203
+ self.consecutive_low_distance = 0
vocab.json ADDED
The diff for this file is too large to render. See raw diff