openbmb
/

MiniCPM-o-2_6

Model card Files Files and versions Community

finalf0 commited on 5 days ago

Commit

8fd104c

1 Parent(s): c8c7670

update readme

Browse files

Files changed (1) hide show

README.md +16 -17

README.md CHANGED Viewed

@@ -17,7 +17,7 @@ tags:
 <h1>A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone</h1>
-[GitHub](https://github.com/OpenBMB/MiniCPM-V) | [Online Demo](https://minicpm-omni-webdemo-us.modelbest.cn)</a>
 ## MiniCPM-o 2.6
@@ -40,18 +40,17 @@ Advancing popular visual capabilites from MiniCPM-V series, MiniCPM-o 2.6 can pr
   In addition to its friendly size, MiniCPM-o 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-o 2.6 can efficiently support **multimodal live streaming** on end-side devices such as iPad.
 -  💫  **Easy Usage.**
-MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](XXX) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [LLaMA-Factory](./docs/llamafactory_train.md), (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web demo on [CN](https://minicpm-omni-webdemo.modelbest.cn/
-) server and [US](https://minicpm-omni-webdemo-us.modelbest.cn/) server.
 **Model Architecture.**
-- **End-to-end Omni-modal Architecture.** Different modality encoder/decoders are connected and trained in an end-to-end fashion to fully exploit rich multimodal knowledge.
-- **Omni-modal Live Streaming Mechanism.** (1) We change the offline modality encoder/decoders into online ones for streaminig inputs/outputs. (2) We devise a time-division multiplexing (TDM) mechanism for omni-modality streaminig processing in the LLM backbone. It divides parallel omni-modality streams into sequential info within small periodic time slices.
-- **Configurable Speech Modeling Design.** We devise a multimodal system prompt, including traditional text system prompt, and a new audio system prompt to determine the assistant voice. This enables flexible voice configurations in inference time, and also facilitates voice cloning and description-based voice creation.
 <div align="center">
-<img src="https://github.com/yiranyyu/MiniCPM-V-private/blob/main/assets/minicpm-o-26-framework.png" , width=80%>
 </div>
 ### Evaluation  <!-- omit in toc -->
@@ -593,7 +592,7 @@ Note: For proprietary models, we calculate token density based on the image enco
             <td>-</td>
             <td>-</td>
         </tr>
-        <tr style="background-color: #e6f2ff;">
             <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
             <td>8B</td>
             <td><strong>1.6</strong></td>
@@ -714,7 +713,7 @@ Note: For proprietary models, we calculate token density based on the image enco
             <td>3.4</td>
             <td>10.0</td>
         </tr>
-        <tr style="background-color: #e6f2ff;">
             <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
             <td>8B</td>
             <td><u>61.0</u></td>
@@ -768,7 +767,7 @@ All results are from AudioEvals, and the evaluation methods along with further d
             <td>63</td>
             <td>46</td>
         </tr>
-        <tr style="background-color: #e6f2ff;">
             <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
             <td>57</td>
             <td>47</td>
@@ -899,7 +898,7 @@ Note: Mimick Task: Takes audio input, and outputs both an ASR transcription and
             <td>33.4</td>
             <td>57.7</td>
         </tr>
-        <tr style="background-color: #e6f2ff;">
             <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
             <td>8B</td>
             <td><strong>79.9</strong></td>
@@ -919,9 +918,9 @@ We deploy MiniCPM-o 2.6 on end devices. The demo video is the raw screen recordi
 <div style="display: flex; flex-direction: column; align-items: center;">
-  <img src="https://github.com/yiranyyu/MiniCPM-V-private/blob/main/assets/minicpmo2_6/minicpmo2_6_math_intersect.png" alt="math" style="margin-bottom: 5px;">
-  <img src="https://github.com/yiranyyu/MiniCPM-V-private/blob/main/assets/minicpmo2_6/minicpmo2_6_diagram_train_NN.png" alt="diagram" style="margin-bottom: 5px;">
-  <img src="https://github.com/yiranyyu/MiniCPM-V-private/blob/main/assets/minicpmo2_6/minicpmo2_6_multi-image_bike.png" alt="bike" style="margin-bottom: 5px;">
 </div>
@@ -979,7 +978,7 @@ model.tts.float()
 ### Omni mode
 we provide two inference modes: chat and streaming
-#### chat inference
 ```python
 import math
 import numpy as np
@@ -1044,7 +1043,7 @@ res = model.chat(
 )
 print(res)
 ```
-#### streaming inference
 ```python
 # a new conversation need reset session first, it will reset the kv-cache
 model.reset_session()
@@ -1238,7 +1237,7 @@ res = model.chat(
 `MiniCPM-o-2_6` has the same inference methods as `MiniCPM-V-2_6`
-#### chat with single image
 ```python
 # test.py
 image = Image.open('xx.jpg').convert('RGB')

 <h1>A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone</h1>
+[GitHub](https://github.com/OpenBMB/MiniCPM-V) | Online Demo [US](https://minicpm-omni-webdemo-us.modelbest.cn)/[CN](https://minicpm-omni-webdemo.modelbest.cn)</a>
 ## MiniCPM-o 2.6
   In addition to its friendly size, MiniCPM-o 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-o 2.6 can efficiently support **multimodal live streaming** on end-side devices such as iPad.
 -  💫  **Easy Usage.**
+MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [LLaMA-Factory](./docs/llamafactory_train.md), (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web demo on [CN](https://minicpm-omni-webdemo.modelbest.cn/) server and [US](https://minicpm-omni-webdemo-us.modelbest.cn/) server.
 **Model Architecture.**
+- **End-to-end Omni-modal Architecture.** Different modality encoder/decoders are connected and trained in an **end-to-end** fashion to fully exploit rich multimodal knowledge.
+- **Omni-modal Live Streaming Mechanism.** (1) We change the offline modality encoder/decoders into online ones for **streaminig inputs/outputs.** (2) We devise a **time-division multiplexing (TDM) mechanism** for omni-modality streaminig processing in the LLM backbone. It divides parallel omni-modality streams into sequential info within small periodic time slices.
+- **Configurable Speech Modeling Design.** We devise a multimodal system prompt, including traditional text system prompt, and **a new audio system prompt to determine the assistant voice**. This enables flexible voice configurations in inference time, and also facilitates voice cloning and description-based voice creation.
 <div align="center">
+<img src="https://github.com/OpenBMB/MiniCPM-V/blob/main/assets/minicpm-o-26-framework.png" , width=80%>
 </div>
 ### Evaluation  <!-- omit in toc -->
             <td>-</td>
             <td>-</td>
         </tr>
+        <tr>
             <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
             <td>8B</td>
             <td><strong>1.6</strong></td>
             <td>3.4</td>
             <td>10.0</td>
         </tr>
+        <tr>
             <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
             <td>8B</td>
             <td><u>61.0</u></td>
             <td>63</td>
             <td>46</td>
         </tr>
+        <tr>
             <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
             <td>57</td>
             <td>47</td>
             <td>33.4</td>
             <td>57.7</td>
         </tr>
+        <tr>
             <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
             <td>8B</td>
             <td><strong>79.9</strong></td>
 <div style="display: flex; flex-direction: column; align-items: center;">
+  <img src="https://github.com/OpenBMB/MiniCPM-V/blob/main/assets/minicpmo2_6/minicpmo2_6_math_intersect.png" alt="math" style="margin-bottom: 5px;">
+  <img src="https://github.com/OpenBMB/MiniCPM-V/blob/main/assets/minicpmo2_6/minicpmo2_6_diagram_train_NN.png" alt="diagram" style="margin-bottom: 5px;">
+  <img src="https://github.com/OpenBMB/MiniCPM-V/blob/main/assets/minicpmo2_6/minicpmo2_6_multi-image_bike.png" alt="bike" style="margin-bottom: 5px;">
 </div>
 ### Omni mode
 we provide two inference modes: chat and streaming
+#### Chat inference
 ```python
 import math
 import numpy as np
 )
 print(res)
 ```
+#### Streaming inference
 ```python
 # a new conversation need reset session first, it will reset the kv-cache
 model.reset_session()
 `MiniCPM-o-2_6` has the same inference methods as `MiniCPM-V-2_6`
+#### Chat with single image
 ```python
 # test.py
 image = Image.open('xx.jpg').convert('RGB')