update readme
Browse files
README.md
CHANGED
@@ -17,7 +17,7 @@ tags:
|
|
17 |
|
18 |
<h1>A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone</h1>
|
19 |
|
20 |
-
[GitHub](https://github.com/OpenBMB/MiniCPM-V) |
|
21 |
|
22 |
|
23 |
## MiniCPM-o 2.6
|
@@ -40,18 +40,17 @@ Advancing popular visual capabilites from MiniCPM-V series, MiniCPM-o 2.6 can pr
|
|
40 |
In addition to its friendly size, MiniCPM-o 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-o 2.6 can efficiently support **multimodal live streaming** on end-side devices such as iPad.
|
41 |
|
42 |
- 💫 **Easy Usage.**
|
43 |
-
MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](
|
44 |
-
) server and [US](https://minicpm-omni-webdemo-us.modelbest.cn/) server.
|
45 |
|
46 |
|
47 |
**Model Architecture.**
|
48 |
|
49 |
-
- **End-to-end Omni-modal Architecture.** Different modality encoder/decoders are connected and trained in an end-to-end fashion to fully exploit rich multimodal knowledge.
|
50 |
-
- **Omni-modal Live Streaming Mechanism.** (1) We change the offline modality encoder/decoders into online ones for streaminig inputs/outputs
|
51 |
-
- **Configurable Speech Modeling Design.** We devise a multimodal system prompt, including traditional text system prompt, and a new audio system prompt to determine the assistant voice
|
52 |
|
53 |
<div align="center">
|
54 |
-
<img src="https://github.com/
|
55 |
</div>
|
56 |
|
57 |
### Evaluation <!-- omit in toc -->
|
@@ -593,7 +592,7 @@ Note: For proprietary models, we calculate token density based on the image enco
|
|
593 |
<td>-</td>
|
594 |
<td>-</td>
|
595 |
</tr>
|
596 |
-
<tr
|
597 |
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
|
598 |
<td>8B</td>
|
599 |
<td><strong>1.6</strong></td>
|
@@ -714,7 +713,7 @@ Note: For proprietary models, we calculate token density based on the image enco
|
|
714 |
<td>3.4</td>
|
715 |
<td>10.0</td>
|
716 |
</tr>
|
717 |
-
<tr
|
718 |
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
|
719 |
<td>8B</td>
|
720 |
<td><u>61.0</u></td>
|
@@ -768,7 +767,7 @@ All results are from AudioEvals, and the evaluation methods along with further d
|
|
768 |
<td>63</td>
|
769 |
<td>46</td>
|
770 |
</tr>
|
771 |
-
<tr
|
772 |
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
|
773 |
<td>57</td>
|
774 |
<td>47</td>
|
@@ -899,7 +898,7 @@ Note: Mimick Task: Takes audio input, and outputs both an ASR transcription and
|
|
899 |
<td>33.4</td>
|
900 |
<td>57.7</td>
|
901 |
</tr>
|
902 |
-
<tr
|
903 |
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
|
904 |
<td>8B</td>
|
905 |
<td><strong>79.9</strong></td>
|
@@ -919,9 +918,9 @@ We deploy MiniCPM-o 2.6 on end devices. The demo video is the raw screen recordi
|
|
919 |
|
920 |
|
921 |
<div style="display: flex; flex-direction: column; align-items: center;">
|
922 |
-
<img src="https://github.com/
|
923 |
-
<img src="https://github.com/
|
924 |
-
<img src="https://github.com/
|
925 |
</div>
|
926 |
|
927 |
|
@@ -979,7 +978,7 @@ model.tts.float()
|
|
979 |
### Omni mode
|
980 |
we provide two inference modes: chat and streaming
|
981 |
|
982 |
-
####
|
983 |
```python
|
984 |
import math
|
985 |
import numpy as np
|
@@ -1044,7 +1043,7 @@ res = model.chat(
|
|
1044 |
)
|
1045 |
print(res)
|
1046 |
```
|
1047 |
-
####
|
1048 |
```python
|
1049 |
# a new conversation need reset session first, it will reset the kv-cache
|
1050 |
model.reset_session()
|
@@ -1238,7 +1237,7 @@ res = model.chat(
|
|
1238 |
|
1239 |
`MiniCPM-o-2_6` has the same inference methods as `MiniCPM-V-2_6`
|
1240 |
|
1241 |
-
####
|
1242 |
```python
|
1243 |
# test.py
|
1244 |
image = Image.open('xx.jpg').convert('RGB')
|
|
|
17 |
|
18 |
<h1>A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone</h1>
|
19 |
|
20 |
+
[GitHub](https://github.com/OpenBMB/MiniCPM-V) | Online Demo [US](https://minicpm-omni-webdemo-us.modelbest.cn)/[CN](https://minicpm-omni-webdemo.modelbest.cn)</a>
|
21 |
|
22 |
|
23 |
## MiniCPM-o 2.6
|
|
|
40 |
In addition to its friendly size, MiniCPM-o 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-o 2.6 can efficiently support **multimodal live streaming** on end-side devices such as iPad.
|
41 |
|
42 |
- 💫 **Easy Usage.**
|
43 |
+
MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [LLaMA-Factory](./docs/llamafactory_train.md), (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web demo on [CN](https://minicpm-omni-webdemo.modelbest.cn/) server and [US](https://minicpm-omni-webdemo-us.modelbest.cn/) server.
|
|
|
44 |
|
45 |
|
46 |
**Model Architecture.**
|
47 |
|
48 |
+
- **End-to-end Omni-modal Architecture.** Different modality encoder/decoders are connected and trained in an **end-to-end** fashion to fully exploit rich multimodal knowledge.
|
49 |
+
- **Omni-modal Live Streaming Mechanism.** (1) We change the offline modality encoder/decoders into online ones for **streaminig inputs/outputs.** (2) We devise a **time-division multiplexing (TDM) mechanism** for omni-modality streaminig processing in the LLM backbone. It divides parallel omni-modality streams into sequential info within small periodic time slices.
|
50 |
+
- **Configurable Speech Modeling Design.** We devise a multimodal system prompt, including traditional text system prompt, and **a new audio system prompt to determine the assistant voice**. This enables flexible voice configurations in inference time, and also facilitates voice cloning and description-based voice creation.
|
51 |
|
52 |
<div align="center">
|
53 |
+
<img src="https://github.com/OpenBMB/MiniCPM-V/blob/main/assets/minicpm-o-26-framework.png" , width=80%>
|
54 |
</div>
|
55 |
|
56 |
### Evaluation <!-- omit in toc -->
|
|
|
592 |
<td>-</td>
|
593 |
<td>-</td>
|
594 |
</tr>
|
595 |
+
<tr>
|
596 |
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
|
597 |
<td>8B</td>
|
598 |
<td><strong>1.6</strong></td>
|
|
|
713 |
<td>3.4</td>
|
714 |
<td>10.0</td>
|
715 |
</tr>
|
716 |
+
<tr>
|
717 |
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
|
718 |
<td>8B</td>
|
719 |
<td><u>61.0</u></td>
|
|
|
767 |
<td>63</td>
|
768 |
<td>46</td>
|
769 |
</tr>
|
770 |
+
<tr>
|
771 |
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
|
772 |
<td>57</td>
|
773 |
<td>47</td>
|
|
|
898 |
<td>33.4</td>
|
899 |
<td>57.7</td>
|
900 |
</tr>
|
901 |
+
<tr>
|
902 |
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
|
903 |
<td>8B</td>
|
904 |
<td><strong>79.9</strong></td>
|
|
|
918 |
|
919 |
|
920 |
<div style="display: flex; flex-direction: column; align-items: center;">
|
921 |
+
<img src="https://github.com/OpenBMB/MiniCPM-V/blob/main/assets/minicpmo2_6/minicpmo2_6_math_intersect.png" alt="math" style="margin-bottom: 5px;">
|
922 |
+
<img src="https://github.com/OpenBMB/MiniCPM-V/blob/main/assets/minicpmo2_6/minicpmo2_6_diagram_train_NN.png" alt="diagram" style="margin-bottom: 5px;">
|
923 |
+
<img src="https://github.com/OpenBMB/MiniCPM-V/blob/main/assets/minicpmo2_6/minicpmo2_6_multi-image_bike.png" alt="bike" style="margin-bottom: 5px;">
|
924 |
</div>
|
925 |
|
926 |
|
|
|
978 |
### Omni mode
|
979 |
we provide two inference modes: chat and streaming
|
980 |
|
981 |
+
#### Chat inference
|
982 |
```python
|
983 |
import math
|
984 |
import numpy as np
|
|
|
1043 |
)
|
1044 |
print(res)
|
1045 |
```
|
1046 |
+
#### Streaming inference
|
1047 |
```python
|
1048 |
# a new conversation need reset session first, it will reset the kv-cache
|
1049 |
model.reset_session()
|
|
|
1237 |
|
1238 |
`MiniCPM-o-2_6` has the same inference methods as `MiniCPM-V-2_6`
|
1239 |
|
1240 |
+
#### Chat with single image
|
1241 |
```python
|
1242 |
# test.py
|
1243 |
image = Image.open('xx.jpg').convert('RGB')
|