Open-source Omni-modal Foundation Model Supporting Text, Image, Video, and Audio Inputs as Well as Text and Audio Outputs

Baichuan-Omni-1.5 πŸ€— | Baichuan-Omni-1.5-Base πŸ€— |Github πŸ“– | Report πŸ“–

OpenMM-Medical πŸ€— | OpenAudioBench πŸ€—

Baichuan-Omni-1.5

The Baichuan-Omni-1.5 is the latest, top-performing model in the Baichuan-omni series. This model is trained and inferred in an end-to-end manner. Compared with Baichuan-omni, this model has significant improvements in text/image/audio/video understanding and text/audio generation, and supports new features such as controllable real-time voice conversations and multi-modal real-time interactions. The main features of Baichuan-Omni-1.5 include:

  • πŸ”₯ Possess Multimodal Understanding and Interaction Capabilities. Baichuan-Omni-1.5 not only supports images, videos, text, and audio as input, and generates high-quality text and voice output, but also supports continuous video and audio streaming, and real-time voice interaction with users. In OminiBench, a comprehensive evaluation benchmark for omnimodal understanding, Baichuan-Omni-1.5 has achieved the first-class level of the open source community and surpassed GPT-4o-mini.

  • πŸ’ͺ Strong Visual Capability. Baichuan-Omni-1.5 has an average score of 73.3 on the OpenCompass list (comprehensive 10 mainstream multimodal evaluation benchmarks). With the size of 7B, it surpasses mainstream commercial closed-source multimodal large models such as GPT-4o-mini, Gemini 1.5 Pro and Claude 3.5 Sonnet in single-image understanding. In addition, its video understanding performance is also better than GPT-4V and Claude 3.5 Sonnet and open source omnimodal models.

  • πŸš€ Leading Medical Image Understanding Capabilities. Baichuan-Omni-1.5 achieved the best performance on GMAI-MMBench and Openmm-Medical. Using only 7B LLM, the average score exceeded Qwen2-VL-72b by 3%, i.e. 80.7% v.s 83.8%.

  • πŸŽ™ Excellent Voice Capabilities. Baichuan-Omni-1.5 supports high-quality, controllable voice bilingual real-time conversations in Chinese and English. It outperforms GPT-4o-realtime in speech understanding tasks (such as ASR and STT, etc.), and demonstrates the highest speech generation performance among open source models in semantic and acoustic evaluation of voice conversations.

  • 🎬 Powerful Real-world Understanding and Other Features. Baichuan-Omni-1.5 further optimizes the many visual understanding capabilities of Baichuan-omni. It can process images of any aspect ratio and up to 1.8 million pixels (such as 1344x1344). It scored 68.8 points on RealWorldQA, surpassing commercial closed-source models such as GPT-4o-mini and recently open-sourced omnimodal models. It scored 85.6/83.6 on the English/Chinese evaluation subsets of MMBench, respectively, which is also in the first echelon of models with the same size.

  • πŸ’« Provides πŸ€— Base Model and πŸ€— Instruct Model. Baichuan-Omni-1.5-Base is a high-performance foundational omni-modal model in the industry. Based on the powerful base, Baichuan-Omni-1.5 employs high-quality omnimodal alignment data to perform end-to-end multimodal instruction data training.

Model Architecture


  • End-to-end Omni-modal Architecture. We carefully design multi-stage and end-to-end progressive training of different modal encoding/decoding modules to make full use of the rich knowledge in different modalities, we expect different modal knowledge to complement each other. Notably, the model is fully trained end-to-end using NTP loss in the whole pre-training stage.
  • High-quality Controllable Audio Solution. Multimodal system prompts have been redesigned to include traditional text system prompts and speech system prompts for specifying model sounds. It provides the flexibility to control voice style through text or speech samples at inference time, and supports advanced capabilities such as end-to-end voice cloning and timbre creation.

Open-source Evaluation Datasets

OpenMM-Medical

To comprehensively evaluate the model's multi-modal medical capabilities, we have constructed OpenMM-Medical, which includes data from 42 publicly available medical image datasets such as ACRIMA (retinal images), BioMediTech (microscope images), and CoronaHack (X-rays), totaling 88,996 images.

OpenAudioBench

To efficiently assess the model's "IQ" issues, we developed OpenAudioBench, comprising five end-to-end audio understanding sub-datasets: four public benchmarks (Llama Question, WEB QA, TriviaQA, AlpacaEval), and an internally created speech logical reasoning dataset by the Baichuan team, totaling 2,701 entries. This suite reflects the model's comprehensive "IQ" level.

Evaluation

We sugguest readers to refer to our Github for more details.


click to view

Pure Text Understanding

Comprehensive Tasks
Model Size MMLU (Acc.) CMMLU (Acc.) AGIEval (Acc.) C-Eval (Acc.) GAOKAO (Acc.)
Proprietary Models
GPT 4o - 88.0β™’
78.3β™’
62.3β™’
86.0β™’
-
GPT 4o mini - 82.0 67.6 52.2 63.6 70.8
Open-source Models (Pure text)
MAP-Neo 7B 58.2 55.1 33.9 57.5 -
Qwen1.5-Chat 7B 61.5 68.0 39.3 68.8 -
Llama3-Instruct 8B 67.1 51.7 38.4 50.7 -
OLMo 7B 28.4 25.6 19.9 27.3 -
Open-source Models (Omni-modal)
VITA 8x7B 71.0* 46.6 46.2* 56.7* -
VITA-1.5 7B 71.0 75.1 47.9 65.6 57.4
Baichuan-Omni 7B 65.3 72.2 47.7 68.9 -
MiniCPM-o 2.6 7B 65.3 63.3 50.9 61.5 56.3
Baichuan-Omni-1.5
7B 72.2 75.5 54.4 73.1 73.5
click to view

Image Understanding

Multi-choice & Yes-or-No Question
Model Size MMBench-EN (Acc.) MMbench-CN (Acc.) SEED-IMG (Acc.) MMMU-val (Acc.) HallusionBench (Acc.)
Proprietary Models
GPT-4o - 83.4β™’ 82.1β™’ - 69.1β™’
55.0β™’
GPT-4o-mini - 77.7 76.9 72.3 60.0β™’ 46.1β™’
Open Source Models (Vision-Language)
Qwen2-VL-7B 7B 86.4
81.9 76.5
52.7 50.6βˆ—
MiniCPM-Llama3-V 2.5 8B 76.7 73.3 72.4 45.8βˆ— 42.5
Open Source Models (Omni-modal)
VITA 8x7B 74.7 71.4 72.6 45.3 39.7βˆ—
VITA-1.5 7B 80.8 80.2 74.2 53.1 44.1
Baichuan-Omni 7B 76.2 74.9 74.1 47.3 47.8
MiniCPM-o 2.6 7B 83.6 81.8 75.4 51.1 50.1
Baichuan-Omni-1.5
7B 85.6 83.6
75.7 53.9 49.7

Visual Question Answering
Model Size RealWorldQA (Acc.) MathVista-mini (Acc.) TextVQA-val (Acc.) ChartQA (Acc.) OCRBench (Acc.)
Proprietary Models
GPT-4o - 75.4β™’
63.8β™’ - 85.7β™’ 73.6β™’
GPT-4o-mini - 66.3 53.4 66.8 - 77.4
Open Source Models (Vision-Language)
Qwen2-VL-7B 7B 69.7 58.2βˆ— 84.3βˆ—
83.0βˆ— 84.5βˆ—
MiniCPM-Llama3-V 2.5 8B 63.5 54.3βˆ— 76.6 72.0 72.5
Open Source Models (Omni-modal)
VITA 8x7B 59.0 44.9βˆ— 71.8 76.6 68.5βˆ—
VITA-1.5 7B 66.8 66.5
74.9 79.6 73.3
Baichuan-Omni 7B 62.6 51.9 74.3 79.6 70.0
MiniCPM-o 2.6 7B 67.7 64.6 80.1 87.6
89.7βˆ—
Baichuan-Omni-1.5 7B 68.8 63.6 83.2 84.9 84.0
click to view

Video Understanding

General VQA   
Model Size # Frames MVBench (Acc.) Egoschema (Acc.) VideoMME (Acc.) Perception-Test (Acc.)
Proprietary Models
Gemini 1.5 Pro - - 81.3β™’
63.2* 75.0β™’
-
GPT 4o mini - - 55.2 58.5 63.6 48.2
GPT 4o - - - 77.2*
71.9β™’ -
GPT 4V - - 43.7β™’ 55.6* 59.9β™’ -
Open-source Models (Vision-language)
Qwen2-VL-7B 7B 2 fps (max 768) 67.0* | 64.4 66.7* | 66.6 63.3* | 59.0 62.3* | 60.3
AnyGPT 8B 48 33.2 32.1 29.8 29.1
VideoLLaMA 2 7B 16 54.6* 51.7* 46.6* 51.4*
VideoChat2 7B 16 51.1* 42.1β™’ 33.7β™’ 47.3β™’
LLaVA-NeXT-Video 7B 32 46.5β™’ 43.9β™’ 33.7β™’ 48.8β™’
Video-LLaVA 7B 8 41.0β™’ 38.4β™’ 39.9β™’ 44.3β™’
Open-source Models (Omni-modal)
VITA 8x7B 1 fps (max 32) 53.4 53.9 56.1 56.2
VITA-1.5 7B 1 fps (max 32) 55.5 54.7 57.3 57.6
Baichuan-Omni 7B 1 fps (max 32) 60.9 58.8 58.2 56.8
MiniCPM-o 2.6 7B 1 fps (max 64) 58.6 50.7 63.4 66.6
Baichuan-Omini-1.5 7B 1 fps (max 32) 63.7 62.4 60.1 68.9

Open-ended VQA
Model Size # Frames ActivityNet-QA MSVD-QA
(Acc.) (Score) (Acc.) (Score)
Proprietary Models
Gemini 1.5 Pro - - 56.7* - - -
GPT 4o mini - 1 fps (max 32) 62.1 3.1 67.5 3.3
GPT 4o - - 61.9* - - -
GPT 4V - - 59.5* - - -
Open-source Models (Vision-language)
Qwen2 VL 7B 2 fps (max 768) 17.4 1.9 61.1 3.5
VideoLLaMA 2 7B 16 50.2* 3.3* 70.9* 3.8*
VideoChat2 7B 16 49.1* 3.3* 70.0* 3.9*
LLaVA-NeXT-Video 7B 32 53.5* 3.2* 67.4 3.4
Video-LLaVA 7B 8 45.3* 3.3* 70.7* 3.9*
Open-source Models (Omni-modal)
VITA 8x7B 1 fps (max 32) 55.0 3.5 63.9 3.7
VITA-1.5 7B 1 fps (max 32) 59.6 3.0 67.6 3.3
Baichuan-Omni 7B 1 fps (max 48) 58.6 3.7
72.2 4.0
MiniCPM-o 2.6 7B 1 fps (max 64) 63.0
3.1 73.7 3.6
Baichuan-Omni-1.5 7B 1 fps (max 48) 62.0 3.1 74.2
3.6
click to view

Audio Comprehensive and Speech Generation

Audio Comprehensive Capacity
Model Size Reasoning QA Llama Questions Web Questions TriviaQA AlpacaEval
s→t s→s s→t s→s s→t s→s s→t s→s s→t s→s
Proprietary Models
GPT-4o-Audio - 55.6 - 88.4 - 8.10 - 9.06 - 8.01 -
Open-source Models (Pure Audio)
GLM-4-Voice 9B - 26.5 - 71.0 - 5.15 - 4.66 - 4.89
Open-source Models (Omni-modal)
VITA-1.5 7B 41.0 - 74.2 - 5.73 - 4.68 - 6.82 -
MiniCPM-o 2.6 7B 38.6 - 77.8 - 6.86 - 6.19 - 5.18 -
Baichuan-Omni-1.5 7B 50.0 40.9 78.5 75.3 5.91 5.52 5.72 5.31 7.79 6.94
click to view

Omni-modal Understanding

Omni-Undesratnding
Model Size Image & Audio Image Caption & Audio Image & Audio Transcript Image Caption & Audio Transcript
Proprietary Models
GPT4o-mini - - - 37.0 37.7
Open-source Models (Omni-modal)
VITA 8x7B 33.1 31.8 42.0 44.2
VITA-1.5 7B 33.4 29.6 48.5 47.2
Baichuan-Omni 7B 32.2 26.5 42.6 44.2
MiniCPM-o 2.6 7B 40.5 30.8 53.2
46.3
Baichuan-Omni-1.5
7B 42.9
37.7
47.9 46.9
click to view

Medical Image Understanding Capabilities

Medical Understanding   
Model Size GMAI-MMB-VAL (Acc.) OpenMM-Medical (Acc.)
Proprietary Models
GPT4o-mini - 46.4 74.3
Open-source Models (Vision-Language)
Qwen2 VL 7B 46.3 76.9
Qwen2 VL 72B 50.7
80.7
Open-source Models (Omni-modal)
VITA-1.5 7B 36.7 67.1
MiniCPM-o 2.6 7B 41.5 73.6
Baichuan-Omni-1.5
7B 49.9 83.8

Examples


pipeline math fly_bill

πŸš€ Quick Start

We recommend interested scholars to visit our github repo for more details. Github

Statement

  • We hereby declare that our team has not developed any applications based on Baichuan-Omni-1.5/Baichuan-Omni-1.5-base models, not on iOS, Android, the web, or any other platform. We strongly call on all users not to use Baichuan-Omni-1.5/Baichuan-Omni-1.5-base models for any activities that harm national / social security or violate the law. Also, we ask users not to use Baichuan-Omni-1.5/Baichuan-Omni-1.5-base models for Internet services that have not undergone appropriate security reviews and filings. We hope that all users can abide by this principle and ensure that the development of technology proceeds in a regulated and legal environment.

  • We have done our best to ensure the compliance of the data used in the model training process. However, despite our considerable efforts, there may still be some unforeseeable issues due to the complexity of the model and data. Therefore, if any problems arise due to the use of Baichuan-Omni-1.5/Baichuan-Omni-1.5-base open-source models, including but not limited to data security issues, public opinion risks, or any risks and problems brought about by the model being misled, abused, spread or improperly exploited, we will not assume any responsibility.

License

The community usage of Baichuan-Omni-1.5/Baichuan-Omni-1.5-base requires adherence to Apache 2.0 and Community License for Baichuan-Omni-1.5 Models. The Baichuan-Omni-1.5/Baichuan-Omni-1.5-base models supports commercial use. If you plan to use the Baichuan-Omni-1.5/Baichuan-Omni-1.5-base models or its derivatives for commercial purposes, please ensure that your entity meets the following conditions:

  1. The Daily Active Users (DAU) of your or your affiliate's service or product is less than 1 million.
  2. Neither you nor your affiliates are software service providers or cloud service providers.
  3. There is no possibility for you or your affiliates to grant the commercial license given to you, to reauthorize it to other third parties without Baichuan's permission.

Upon meeting the above conditions, you need to submit the application materials required by the Baichuan-Omni-1.5 Model Community License Agreement via the following contact email: [email protected]. Once approved, Baichuan will hereby grant you a non-exclusive, global, non-transferable, non-sublicensable, revocable commercial copyright license.


Downloads last month
37
Safetensors
Model size
11B params
Tensor type
BF16
Β·
F32
Β·
Inference API
Unable to determine this model's library. Check the docs .