finalf0 commited on
Commit
8fd104c
·
1 Parent(s): c8c7670

update readme

Browse files
Files changed (1) hide show
  1. README.md +16 -17
README.md CHANGED
@@ -17,7 +17,7 @@ tags:
17
 
18
  <h1>A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone</h1>
19
 
20
- [GitHub](https://github.com/OpenBMB/MiniCPM-V) | [Online Demo](https://minicpm-omni-webdemo-us.modelbest.cn)</a>
21
 
22
 
23
  ## MiniCPM-o 2.6
@@ -40,18 +40,17 @@ Advancing popular visual capabilites from MiniCPM-V series, MiniCPM-o 2.6 can pr
40
  In addition to its friendly size, MiniCPM-o 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-o 2.6 can efficiently support **multimodal live streaming** on end-side devices such as iPad.
41
 
42
  - 💫 **Easy Usage.**
43
- MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](XXX) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [LLaMA-Factory](./docs/llamafactory_train.md), (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web demo on [CN](https://minicpm-omni-webdemo.modelbest.cn/
44
- ) server and [US](https://minicpm-omni-webdemo-us.modelbest.cn/) server.
45
 
46
 
47
  **Model Architecture.**
48
 
49
- - **End-to-end Omni-modal Architecture.** Different modality encoder/decoders are connected and trained in an end-to-end fashion to fully exploit rich multimodal knowledge.
50
- - **Omni-modal Live Streaming Mechanism.** (1) We change the offline modality encoder/decoders into online ones for streaminig inputs/outputs. (2) We devise a time-division multiplexing (TDM) mechanism for omni-modality streaminig processing in the LLM backbone. It divides parallel omni-modality streams into sequential info within small periodic time slices.
51
- - **Configurable Speech Modeling Design.** We devise a multimodal system prompt, including traditional text system prompt, and a new audio system prompt to determine the assistant voice. This enables flexible voice configurations in inference time, and also facilitates voice cloning and description-based voice creation.
52
 
53
  <div align="center">
54
- <img src="https://github.com/yiranyyu/MiniCPM-V-private/blob/main/assets/minicpm-o-26-framework.png" , width=80%>
55
  </div>
56
 
57
  ### Evaluation <!-- omit in toc -->
@@ -593,7 +592,7 @@ Note: For proprietary models, we calculate token density based on the image enco
593
  <td>-</td>
594
  <td>-</td>
595
  </tr>
596
- <tr style="background-color: #e6f2ff;">
597
  <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
598
  <td>8B</td>
599
  <td><strong>1.6</strong></td>
@@ -714,7 +713,7 @@ Note: For proprietary models, we calculate token density based on the image enco
714
  <td>3.4</td>
715
  <td>10.0</td>
716
  </tr>
717
- <tr style="background-color: #e6f2ff;">
718
  <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
719
  <td>8B</td>
720
  <td><u>61.0</u></td>
@@ -768,7 +767,7 @@ All results are from AudioEvals, and the evaluation methods along with further d
768
  <td>63</td>
769
  <td>46</td>
770
  </tr>
771
- <tr style="background-color: #e6f2ff;">
772
  <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
773
  <td>57</td>
774
  <td>47</td>
@@ -899,7 +898,7 @@ Note: Mimick Task: Takes audio input, and outputs both an ASR transcription and
899
  <td>33.4</td>
900
  <td>57.7</td>
901
  </tr>
902
- <tr style="background-color: #e6f2ff;">
903
  <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
904
  <td>8B</td>
905
  <td><strong>79.9</strong></td>
@@ -919,9 +918,9 @@ We deploy MiniCPM-o 2.6 on end devices. The demo video is the raw screen recordi
919
 
920
 
921
  <div style="display: flex; flex-direction: column; align-items: center;">
922
- <img src="https://github.com/yiranyyu/MiniCPM-V-private/blob/main/assets/minicpmo2_6/minicpmo2_6_math_intersect.png" alt="math" style="margin-bottom: 5px;">
923
- <img src="https://github.com/yiranyyu/MiniCPM-V-private/blob/main/assets/minicpmo2_6/minicpmo2_6_diagram_train_NN.png" alt="diagram" style="margin-bottom: 5px;">
924
- <img src="https://github.com/yiranyyu/MiniCPM-V-private/blob/main/assets/minicpmo2_6/minicpmo2_6_multi-image_bike.png" alt="bike" style="margin-bottom: 5px;">
925
  </div>
926
 
927
 
@@ -979,7 +978,7 @@ model.tts.float()
979
  ### Omni mode
980
  we provide two inference modes: chat and streaming
981
 
982
- #### chat inference
983
  ```python
984
  import math
985
  import numpy as np
@@ -1044,7 +1043,7 @@ res = model.chat(
1044
  )
1045
  print(res)
1046
  ```
1047
- #### streaming inference
1048
  ```python
1049
  # a new conversation need reset session first, it will reset the kv-cache
1050
  model.reset_session()
@@ -1238,7 +1237,7 @@ res = model.chat(
1238
 
1239
  `MiniCPM-o-2_6` has the same inference methods as `MiniCPM-V-2_6`
1240
 
1241
- #### chat with single image
1242
  ```python
1243
  # test.py
1244
  image = Image.open('xx.jpg').convert('RGB')
 
17
 
18
  <h1>A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone</h1>
19
 
20
+ [GitHub](https://github.com/OpenBMB/MiniCPM-V) | Online Demo [US](https://minicpm-omni-webdemo-us.modelbest.cn)/[CN](https://minicpm-omni-webdemo.modelbest.cn)</a>
21
 
22
 
23
  ## MiniCPM-o 2.6
 
40
  In addition to its friendly size, MiniCPM-o 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-o 2.6 can efficiently support **multimodal live streaming** on end-side devices such as iPad.
41
 
42
  - 💫 **Easy Usage.**
43
+ MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [LLaMA-Factory](./docs/llamafactory_train.md), (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web demo on [CN](https://minicpm-omni-webdemo.modelbest.cn/) server and [US](https://minicpm-omni-webdemo-us.modelbest.cn/) server.
 
44
 
45
 
46
  **Model Architecture.**
47
 
48
+ - **End-to-end Omni-modal Architecture.** Different modality encoder/decoders are connected and trained in an **end-to-end** fashion to fully exploit rich multimodal knowledge.
49
+ - **Omni-modal Live Streaming Mechanism.** (1) We change the offline modality encoder/decoders into online ones for **streaminig inputs/outputs.** (2) We devise a **time-division multiplexing (TDM) mechanism** for omni-modality streaminig processing in the LLM backbone. It divides parallel omni-modality streams into sequential info within small periodic time slices.
50
+ - **Configurable Speech Modeling Design.** We devise a multimodal system prompt, including traditional text system prompt, and **a new audio system prompt to determine the assistant voice**. This enables flexible voice configurations in inference time, and also facilitates voice cloning and description-based voice creation.
51
 
52
  <div align="center">
53
+ <img src="https://github.com/OpenBMB/MiniCPM-V/blob/main/assets/minicpm-o-26-framework.png" , width=80%>
54
  </div>
55
 
56
  ### Evaluation <!-- omit in toc -->
 
592
  <td>-</td>
593
  <td>-</td>
594
  </tr>
595
+ <tr>
596
  <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
597
  <td>8B</td>
598
  <td><strong>1.6</strong></td>
 
713
  <td>3.4</td>
714
  <td>10.0</td>
715
  </tr>
716
+ <tr>
717
  <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
718
  <td>8B</td>
719
  <td><u>61.0</u></td>
 
767
  <td>63</td>
768
  <td>46</td>
769
  </tr>
770
+ <tr>
771
  <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
772
  <td>57</td>
773
  <td>47</td>
 
898
  <td>33.4</td>
899
  <td>57.7</td>
900
  </tr>
901
+ <tr>
902
  <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
903
  <td>8B</td>
904
  <td><strong>79.9</strong></td>
 
918
 
919
 
920
  <div style="display: flex; flex-direction: column; align-items: center;">
921
+ <img src="https://github.com/OpenBMB/MiniCPM-V/blob/main/assets/minicpmo2_6/minicpmo2_6_math_intersect.png" alt="math" style="margin-bottom: 5px;">
922
+ <img src="https://github.com/OpenBMB/MiniCPM-V/blob/main/assets/minicpmo2_6/minicpmo2_6_diagram_train_NN.png" alt="diagram" style="margin-bottom: 5px;">
923
+ <img src="https://github.com/OpenBMB/MiniCPM-V/blob/main/assets/minicpmo2_6/minicpmo2_6_multi-image_bike.png" alt="bike" style="margin-bottom: 5px;">
924
  </div>
925
 
926
 
 
978
  ### Omni mode
979
  we provide two inference modes: chat and streaming
980
 
981
+ #### Chat inference
982
  ```python
983
  import math
984
  import numpy as np
 
1043
  )
1044
  print(res)
1045
  ```
1046
+ #### Streaming inference
1047
  ```python
1048
  # a new conversation need reset session first, it will reset the kv-cache
1049
  model.reset_session()
 
1237
 
1238
  `MiniCPM-o-2_6` has the same inference methods as `MiniCPM-V-2_6`
1239
 
1240
+ #### Chat with single image
1241
  ```python
1242
  # test.py
1243
  image = Image.open('xx.jpg').convert('RGB')