openbmb
/

MiniCPM-o-2_6

Model card Files Files and versions Community

finalf0 commited on 4 days ago

Commit

e4db629

1 Parent(s): 394a7aa

update readme

Browse files

Files changed (1) hide show

README.md +4 -4

README.md CHANGED Viewed

@@ -31,7 +31,7 @@ tags:
 - 🔥 **Leading Visual Capability.**
   MiniCPM-o 2.6 achieves an average score of 70.2 on OpenCompass, a comprehensive evaluation over 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding. It also **outperforms GPT-4V and Claude 3.5 Sonnet** in mutli-image and video understanding, and shows promising in-context learning capability.
-- 🎙 **State-of-the-art Speech Capability.** MiniCPM-o 2.6 supports **bilingual realtime speech conversation with configurable voices** in English and Chinese. It **outperforms GPT-4o-realtime on audio understanding tasks** such as ASR and STT translation, and shows **state-of-the-art performance on speech conversation in both semantic and acoustic evaluations in the open-source community**. It also allows for fun features such as emotion/speed/style control, voice cloning, role play, etc.
 - 🎬 **Strong Multimodal Live Streaming Capability.** As a new feature, MiniCPM-o 2.6 can **accept continous video and audio streams independent of user queries, and support realtime speech interaction**. It **outperforms GPT-4o-realtime and Claude 3.5 Sonnet and shows state-of-art performance in open-source community on StreamingBench**, a comprehensive benchmark for real-time video understanding, omni-source (video & audio) understanding , and multimodal contextual understanding.
@@ -51,7 +51,7 @@ MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](https://github
 - **End-to-end Omni-modal Architecture.** Different modality encoder/decoders are connected and trained in an **end-to-end** fashion to fully exploit rich multimodal knowledge.
 - **Omni-modal Live Streaming Mechanism.** (1) We change the offline modality encoder/decoders into online ones for **streaminig inputs/outputs.** (2) We devise a **time-division multiplexing (TDM) mechanism** for omni-modality streaminig processing in the LLM backbone. It divides parallel omni-modality streams into sequential info within small periodic time slices.
-- **Configurable Speech Modeling Design.** We devise a multimodal system prompt, including traditional text system prompt, and **a new audio system prompt to determine the assistant voice**. This enables flexible voice configurations in inference time, and also facilitates voice cloning and description-based voice creation.
 <div align="center">
 <img src="https://github.com/OpenBMB/MiniCPM-V/blob/main/assets/minicpm-o-26-framework.png" , width=80%>
@@ -735,7 +735,7 @@ Note: For proprietary models, we calculate token density based on the image enco
 </div>
 All results are from AudioEvals, and the evaluation methods along with further details can be found in <a href="https://github.com/OpenBMB/UltraEval-Audio" target="_blank">AudioEvals</a>.<br><br>
-**Voice Cloning**
 <div align="center">
 <table style="margin: 0px auto;">
@@ -1390,4 +1390,4 @@ If you find our work helpful, please consider citing our papers 📝 and liking
   journal={arXiv preprint arXiv:2408.01800},
   year={2024}
 }
-```

 - 🔥 **Leading Visual Capability.**
   MiniCPM-o 2.6 achieves an average score of 70.2 on OpenCompass, a comprehensive evaluation over 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding. It also **outperforms GPT-4V and Claude 3.5 Sonnet** in mutli-image and video understanding, and shows promising in-context learning capability.
+- 🎙 **State-of-the-art Speech Capability.** MiniCPM-o 2.6 supports **bilingual realtime speech conversation with configurable voices** in English and Chinese. It **outperforms GPT-4o-realtime on audio understanding tasks** such as ASR and STT translation, and shows **state-of-the-art performance on speech conversation in both semantic and acoustic evaluations in the open-source community**. It also allows for fun features such as emotion/speed/style control, end-to-end voice cloning, role play, etc.
 - 🎬 **Strong Multimodal Live Streaming Capability.** As a new feature, MiniCPM-o 2.6 can **accept continous video and audio streams independent of user queries, and support realtime speech interaction**. It **outperforms GPT-4o-realtime and Claude 3.5 Sonnet and shows state-of-art performance in open-source community on StreamingBench**, a comprehensive benchmark for real-time video understanding, omni-source (video & audio) understanding , and multimodal contextual understanding.
 - **End-to-end Omni-modal Architecture.** Different modality encoder/decoders are connected and trained in an **end-to-end** fashion to fully exploit rich multimodal knowledge.
 - **Omni-modal Live Streaming Mechanism.** (1) We change the offline modality encoder/decoders into online ones for **streaminig inputs/outputs.** (2) We devise a **time-division multiplexing (TDM) mechanism** for omni-modality streaminig processing in the LLM backbone. It divides parallel omni-modality streams into sequential info within small periodic time slices.
+- **Configurable Speech Modeling Design.** We devise a multimodal system prompt, including traditional text system prompt, and **a new audio system prompt to determine the assistant voice**. This enables flexible voice configurations in inference time, and also facilitates end-to-end voice cloning and description-based voice creation.
 <div align="center">
 <img src="https://github.com/OpenBMB/MiniCPM-V/blob/main/assets/minicpm-o-26-framework.png" , width=80%>
 </div>
 All results are from AudioEvals, and the evaluation methods along with further details can be found in <a href="https://github.com/OpenBMB/UltraEval-Audio" target="_blank">AudioEvals</a>.<br><br>
+**End-to-end Voice Cloning**
 <div align="center">
 <table style="margin: 0px auto;">
   journal={arXiv preprint arXiv:2408.01800},
   year={2024}
 }
+```