update readme
Browse files
README.md
CHANGED
@@ -31,7 +31,7 @@ tags:
|
|
31 |
- π₯ **Leading Visual Capability.**
|
32 |
MiniCPM-o 2.6 achieves an average score of 70.2 on OpenCompass, a comprehensive evaluation over 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding. It also **outperforms GPT-4V and Claude 3.5 Sonnet** in mutli-image and video understanding, and shows promising in-context learning capability.
|
33 |
|
34 |
-
- π **State-of-the-art Speech Capability.** MiniCPM-o 2.6 supports **bilingual realtime speech conversation with configurable voices** in English and Chinese. It **outperforms GPT-4o-realtime on audio understanding tasks** such as ASR and STT translation, and shows **state-of-the-art performance on speech conversation in both semantic and acoustic evaluations in the open-source community**. It also allows for fun features such as emotion/speed/style control, voice cloning, role play, etc.
|
35 |
|
36 |
- π¬ **Strong Multimodal Live Streaming Capability.** As a new feature, MiniCPM-o 2.6 can **accept continous video and audio streams independent of user queries, and support realtime speech interaction**. It **outperforms GPT-4o-realtime and Claude 3.5 Sonnet and shows state-of-art performance in open-source community on StreamingBench**, a comprehensive benchmark for real-time video understanding, omni-source (video & audio) understanding , and multimodal contextual understanding.
|
37 |
|
@@ -51,7 +51,7 @@ MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](https://github
|
|
51 |
|
52 |
- **End-to-end Omni-modal Architecture.** Different modality encoder/decoders are connected and trained in an **end-to-end** fashion to fully exploit rich multimodal knowledge.
|
53 |
- **Omni-modal Live Streaming Mechanism.** (1) We change the offline modality encoder/decoders into online ones for **streaminig inputs/outputs.** (2) We devise a **time-division multiplexing (TDM) mechanism** for omni-modality streaminig processing in the LLM backbone. It divides parallel omni-modality streams into sequential info within small periodic time slices.
|
54 |
-
- **Configurable Speech Modeling Design.** We devise a multimodal system prompt, including traditional text system prompt, and **a new audio system prompt to determine the assistant voice**. This enables flexible voice configurations in inference time, and also facilitates voice cloning and description-based voice creation.
|
55 |
|
56 |
<div align="center">
|
57 |
<img src="https://github.com/OpenBMB/MiniCPM-V/blob/main/assets/minicpm-o-26-framework.png" , width=80%>
|
@@ -735,7 +735,7 @@ Note: For proprietary models, we calculate token density based on the image enco
|
|
735 |
</div>
|
736 |
All results are from AudioEvals, and the evaluation methods along with further details can be found in <a href="https://github.com/OpenBMB/UltraEval-Audio" target="_blank">AudioEvals</a>.<br><br>
|
737 |
|
738 |
-
**Voice Cloning**
|
739 |
|
740 |
<div align="center">
|
741 |
<table style="margin: 0px auto;">
|
@@ -1390,4 +1390,4 @@ If you find our work helpful, please consider citing our papers π and liking
|
|
1390 |
journal={arXiv preprint arXiv:2408.01800},
|
1391 |
year={2024}
|
1392 |
}
|
1393 |
-
```
|
|
|
31 |
- π₯ **Leading Visual Capability.**
|
32 |
MiniCPM-o 2.6 achieves an average score of 70.2 on OpenCompass, a comprehensive evaluation over 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding. It also **outperforms GPT-4V and Claude 3.5 Sonnet** in mutli-image and video understanding, and shows promising in-context learning capability.
|
33 |
|
34 |
+
- π **State-of-the-art Speech Capability.** MiniCPM-o 2.6 supports **bilingual realtime speech conversation with configurable voices** in English and Chinese. It **outperforms GPT-4o-realtime on audio understanding tasks** such as ASR and STT translation, and shows **state-of-the-art performance on speech conversation in both semantic and acoustic evaluations in the open-source community**. It also allows for fun features such as emotion/speed/style control, end-to-end voice cloning, role play, etc.
|
35 |
|
36 |
- π¬ **Strong Multimodal Live Streaming Capability.** As a new feature, MiniCPM-o 2.6 can **accept continous video and audio streams independent of user queries, and support realtime speech interaction**. It **outperforms GPT-4o-realtime and Claude 3.5 Sonnet and shows state-of-art performance in open-source community on StreamingBench**, a comprehensive benchmark for real-time video understanding, omni-source (video & audio) understanding , and multimodal contextual understanding.
|
37 |
|
|
|
51 |
|
52 |
- **End-to-end Omni-modal Architecture.** Different modality encoder/decoders are connected and trained in an **end-to-end** fashion to fully exploit rich multimodal knowledge.
|
53 |
- **Omni-modal Live Streaming Mechanism.** (1) We change the offline modality encoder/decoders into online ones for **streaminig inputs/outputs.** (2) We devise a **time-division multiplexing (TDM) mechanism** for omni-modality streaminig processing in the LLM backbone. It divides parallel omni-modality streams into sequential info within small periodic time slices.
|
54 |
+
- **Configurable Speech Modeling Design.** We devise a multimodal system prompt, including traditional text system prompt, and **a new audio system prompt to determine the assistant voice**. This enables flexible voice configurations in inference time, and also facilitates end-to-end voice cloning and description-based voice creation.
|
55 |
|
56 |
<div align="center">
|
57 |
<img src="https://github.com/OpenBMB/MiniCPM-V/blob/main/assets/minicpm-o-26-framework.png" , width=80%>
|
|
|
735 |
</div>
|
736 |
All results are from AudioEvals, and the evaluation methods along with further details can be found in <a href="https://github.com/OpenBMB/UltraEval-Audio" target="_blank">AudioEvals</a>.<br><br>
|
737 |
|
738 |
+
**End-to-end Voice Cloning**
|
739 |
|
740 |
<div align="center">
|
741 |
<table style="margin: 0px auto;">
|
|
|
1390 |
journal={arXiv preprint arXiv:2408.01800},
|
1391 |
year={2024}
|
1392 |
}
|
1393 |
+
```
|