游雁 commited on
Commit
d02de2d
·
1 Parent(s): a7f5585
README.md CHANGED
@@ -1,17 +1,33 @@
1
- ---
2
- license: other
3
- license_name: model-license
4
- license_link: https://github.com/modelscope/FunASR/blob/main/MODEL_LICENSE
5
- ---
6
 
7
  # Introduction
8
 
9
  SenseVoice is a speech foundation model with multiple speech understanding capabilities, including automatic speech recognition (ASR), spoken language identification (LID), speech emotion recognition (SER), and audio event detection (AED).
10
 
11
- <div align="center"><img src="fig/sensevoice.png" width="1000"/> </div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
 
14
- # Highlights
 
15
  **SenseVoice** focuses on high-accuracy multilingual speech recognition, speech emotion recognition, and audio event detection.
16
  - **Multilingual Speech Recognition:** Trained with over 400,000 hours of data, supporting more than 50 languages, the recognition performance surpasses that of the Whisper model.
17
  - **Rich transcribe:**
@@ -21,189 +37,168 @@ SenseVoice is a speech foundation model with multiple speech understanding capab
21
  - **Convenient Finetuning:** Provide convenient finetuning scripts and strategies, allowing users to easily address long-tail sample issues according to their business scenarios.
22
  - **Service Deployment:** Offer service deployment pipeline, supporting multi-concurrent requests, with client-side languages including Python, C++, HTML, Java, and C#, among others.
23
 
 
 
 
 
 
24
 
25
- ## <strong>[SenseVoice Project]()</strong>
26
- <strong>[SenseVoice]()</strong> is a speech foundation model with multiple speech understanding capabilities, including automatic speech recognition (ASR), spoken language identification (LID), speech emotion recognition (SER), and acoustic event detection (AED).
27
 
28
- [**github**]()
29
- | [**What's New**]()
30
- | [**Requirements**]()
31
 
 
 
 
32
 
33
- # SenseVoice Model
34
- SenseVoice-Small is an encoder-only speech foundation model designed for rapid voice understanding. It encompasses a variety of features including automatic speech recognition (ASR), spoken language identification (LID), speech emotion recognition (SER), acoustic event detection (AED), and Inverse Text Normalization (ITN). SenseVoice-Small supports multilingual recognition for Chinese, English, Cantonese, Japanese, and Korean.
35
 
 
36
 
37
- <p align="center">
38
- <img src="fig/sensevoice.png" width="1500" />
39
- </p>
40
 
41
- The SenseVoice-Small model is based on a non-autoregressive end-to-end framework. For a specified task, we prepend four embeddings as input to the encoder:
42
 
43
- LID: For predicting the language id of the audio.
 
 
44
 
45
- SER: For predicting the emotion label of the audio.
46
 
47
- AED: For predicting the event label of the audio.
48
 
49
- ITN: Used to specify whether the recognition output text is subjected to inverse text normalization.
 
 
50
 
51
- # Usage
52
 
53
- ## Inference
54
 
55
- ### Method 1
 
 
56
 
57
- ```python
58
- from model import SenseVoiceSmall
59
 
60
- model_dir = "iic/SenseVoiceSmall"
61
- m, kwargs = SenseVoiceSmall.from_pretrained(model=model_dir)
62
 
 
 
 
63
 
64
- res = m.inference(
65
- data_in="https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav",
66
- language="auto", # "zn", "en", "yue", "ja", "ko", "nospeech"
67
- use_itn=False,
68
- **kwargs,
69
- )
70
 
71
- print(res)
72
- ```
73
 
74
- ### Method 2
75
 
76
  ```python
77
  from funasr import AutoModel
 
78
 
79
- model_dir = "iic/SenseVoiceSmall"
80
- input_file = (
81
- "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav"
82
- )
83
 
84
- model = AutoModel(model=model_dir,
85
- vad_model="fsmn-vad",
86
- vad_kwargs={"max_single_segment_time": 30000},
87
- trust_remote_code=True, device="cuda:0")
88
 
 
 
 
 
 
 
 
 
 
89
  res = model.generate(
90
- input=input_file,
91
  cache={},
92
- language="auto", # "zn", "en", "yue", "ja", "ko", "nospeech"
93
- use_itn=False,
94
- batch_size_s=0,
 
 
95
  )
96
-
97
- print(res)
98
  ```
99
 
100
- The funasr version has integrated the VAD (Voice Activity Detection) model and supports audio input of any duration, with `batch_size_s` in seconds.
101
- If all inputs are short audios, and batch inference is needed to speed up inference efficiency, the VAD model can be removed, and `batch_size` can be set accordingly.
 
 
 
 
 
102
 
 
103
  ```python
104
- model = AutoModel(model=model_dir, trust_remote_code=True, device="cuda:0")
105
 
106
  res = model.generate(
107
- input=input_file,
108
  cache={},
109
- language="auto", # "zn", "en", "yue", "ja", "ko", "nospeech"
110
  use_itn=False,
111
- batch_size=64,
 
112
  )
113
  ```
114
 
115
- For more usage, please ref to [docs](https://github.com/modelscope/FunASR/blob/main/docs/tutorial/README.md)
116
 
 
117
 
118
- ### Export and Test
119
 
120
  ```python
121
- # pip3 install -U funasr-onnx
122
- from funasr_onnx import SenseVoiceSmall
123
 
124
- model_dir = "iic/SenseVoiceCTC"
125
- model = SenseVoiceSmall(model_dir, batch_size=1, quantize=True)
126
 
127
- wav_path = [f'~/.cache/modelscope/hub/{model_dir}/example/asr_example.wav']
128
 
129
- result = model(wav_path)
130
- print(result)
131
- ```
 
 
 
132
 
 
 
 
133
 
 
 
134
  ## Service
135
 
136
- Undo
137
-
138
 
139
  ## Finetune
140
 
141
- ### Requirements
142
-
143
- ```shell
144
- git clone https://github.com/alibaba/FunASR.git && cd FunASR
145
- pip3 install -e ./
146
- ```
147
-
148
- ### Data prepare
149
 
150
- Data examples
151
-
152
- ```text
153
- {"key": "YOU0000008470_S0000238_punc_itn", "text_language": "<|en|>", "emo_target": "<|NEUTRAL|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|withitn|>", "target": "Including legal due diligence, subscription agreement, negotiation.", "source": "/cpfs01/shared/Group-speech/beinian.lzr/data/industrial_data/english_all/audio/YOU0000008470_S0000238.wav", "target_len": 7, "source_len": 140}
154
- {"key": "AUD0000001556_S0007580", "text_language": "<|en|>", "emo_target": "<|NEUTRAL|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|woitn|>", "target": "there is a tendency to identify the self or take interest in what one has got used to", "source": "/cpfs01/shared/Group-speech/beinian.lzr/data/industrial_data/english_all/audio/AUD0000001556_S0007580.wav", "target_len": 18, "source_len": 360}
155
- ```
156
-
157
- Full ref to `data/train_example.jsonl`
158
-
159
- ### Finetune
160
-
161
- Ensure to modify the train_tool in finetune.sh to the absolute path of `funasr/bin/train_ds.py` from the FunASR installation directory you have set up earlier.
162
 
163
  ```shell
164
- bash finetune.sh
165
  ```
166
 
 
167
 
168
- # Performance
169
-
170
- ## Multilingual Speech Recognition
171
-
172
- We compared the performance of multilingual speech recognition between SenseVoice and Whisper on open-source benchmark datasets, including AISHELL-1, AISHELL-2, Wenetspeech, LibriSpeech, and Common Voice. n terms of Chinese and Cantonese recognition, the SenseVoice-Small model has advantages.
173
-
174
- <div align="center">
175
- <img src="fig/asr_results.png" width="1000" />
176
- </div>
177
-
178
-
179
-
180
- ## Speech Emotion Recognition
181
-
182
- Due to the current lack of widely-used benchmarks and methods for speech emotion recognition, we conducted evaluations across various metrics on multiple test sets and performed a comprehensive comparison with numerous results from recent benchmarks. The selected test sets encompass data in both Chinese and English, and include multiple styles such as performances, films, and natural conversations. Without finetuning on the target data, SenseVoice was able to achieve and exceed the performance of the current best speech emotion recognition models.
183
-
184
- <div align="center">
185
- <img src="fig/ser_table.png" width="1000" />
186
- </div>
187
 
188
- Furthermore, we compared multiple open-source speech emotion recognition models on the test sets, and the results indicate that the SenseVoice-Large model achieved the best performance on nearly all datasets, while the SenseVoice-Small model also surpassed other open-source models on the majority of the datasets.
189
 
190
- <div align="center">
191
- <img src="fig/ser_figure.png" width="500" />
192
- </div>
193
 
194
- ## Audio Event Detection
195
 
196
- Although trained exclusively on speech data, SenseVoice can still function as a standalone event detection model. We compared its performance on the environmental sound classification ESC-50 dataset against the widely used industry models BEATS and PANN. The SenseVoice model achieved commendable results on these tasks. However, due to limitations in training data and methodology, its event classification performance has some gaps compared to specialized AED models.
197
-
198
- <div align="center">
199
- <img src="fig/aed_figure.png" width="500" />
200
- </div>
201
-
202
-
203
- ## Computational Efficiency
204
-
205
- The SenseVoice-Small model non-autoregressive end-to-end architecture, resulting in extremely low inference latency. With a similar number of parameters to the Whisper-Small model, it infers 7 times faster than Whisper-Small and 17 times faster than Whisper-Large.
206
-
207
- <div align="center">
208
- <img src="fig/inference.png" width="1000" />
209
- </div>
 
1
+ ([简体中文](./README_zh.md)|English)
2
+
 
 
 
3
 
4
  # Introduction
5
 
6
  SenseVoice is a speech foundation model with multiple speech understanding capabilities, including automatic speech recognition (ASR), spoken language identification (LID), speech emotion recognition (SER), and audio event detection (AED).
7
 
8
+ <img src="image/sensevoice2.png">
9
+
10
+ [//]: # (<div align="center"><img src="image/sensevoice.png" width="700"/> </div>)
11
+
12
+ <div align="center">
13
+ <h4>
14
+ <a href="https://www.modelscope.cn/studios/iic/SenseVoice"> Online Demo </a>
15
+ |<a href="https://fun-audio-llm.github.io/"> Homepage </a>
16
+ |<a href="#What's News"> What's News </a>
17
+ |<a href="#Benchmarks"> Benchmarks </a>
18
+ |<a href="#Install"> Install </a>
19
+ |<a href="#Usage"> Usage </a>
20
+ |<a href="#Community"> Community </a>
21
+ </h4>
22
+
23
+ Model Zoo:
24
+ [modelscope](https://www.modelscope.cn/models/iic/SenseVoiceSmall), [huggingface](https://huggingface.co/FunAudioLLM/SenseVoiceSmall)
25
+
26
+ </div>
27
 
28
 
29
+ <a name="Highligts"></a>
30
+ # Highligts 🎯
31
  **SenseVoice** focuses on high-accuracy multilingual speech recognition, speech emotion recognition, and audio event detection.
32
  - **Multilingual Speech Recognition:** Trained with over 400,000 hours of data, supporting more than 50 languages, the recognition performance surpasses that of the Whisper model.
33
  - **Rich transcribe:**
 
37
  - **Convenient Finetuning:** Provide convenient finetuning scripts and strategies, allowing users to easily address long-tail sample issues according to their business scenarios.
38
  - **Service Deployment:** Offer service deployment pipeline, supporting multi-concurrent requests, with client-side languages including Python, C++, HTML, Java, and C#, among others.
39
 
40
+ <a name="What's News"></a>
41
+ # What's New 🔥
42
+ - 2024/7: The [SenseVoice-Small](https://www.modelscope.cn/models/iic/SenseVoiceSmall) voice understanding model is open-sourced, which offers high-precision multilingual speech recognition, emotion recognition, and audio event detection capabilities for Mandarin, Cantonese, English, Japanese, and Korean and leads to exceptionally low inference latency.
43
+ - 2024/7: The CosyVoice for natural speech generation with multi-language, timbre, and emotion control. CosyVoice excels in multi-lingual voice generation, zero-shot voice generation, cross-lingual voice cloning, and instruction-following capabilities. [CosyVoice repo](https://github.com/FunAudioLLM/CosyVoice) and [CosyVoice space](https://www.modelscope.cn/studios/iic/CosyVoice-300M).
44
+ - 2024/7: [FunASR](https://github.com/modelscope/FunASR) is a fundamental speech recognition toolkit that offers a variety of features, including speech recognition (ASR), Voice Activity Detection (VAD), Punctuation Restoration, Language Models, Speaker Verification, Speaker Diarization and multi-talker ASR.
45
 
46
+ <a name="Benchmarks"></a>
47
+ # Benchmarks 📝
48
 
49
+ ## Multilingual Speech Recognition
50
+ We compared the performance of multilingual speech recognition between SenseVoice and Whisper on open-source benchmark datasets, including AISHELL-1, AISHELL-2, Wenetspeech, LibriSpeech, and Common Voice. In terms of Chinese and Cantonese recognition, the SenseVoice-Small model has advantages.
 
51
 
52
+ <div align="center">
53
+ <img src="image/asr_results1.png" width="400" /><img src="image/asr_results2.png" width="400" />
54
+ </div>
55
 
56
+ ## Speech Emotion Recognition
 
57
 
58
+ Due to the current lack of widely-used benchmarks and methods for speech emotion recognition, we conducted evaluations across various metrics on multiple test sets and performed a comprehensive comparison with numerous results from recent benchmarks. The selected test sets encompass data in both Chinese and English, and include multiple styles such as performances, films, and natural conversations. Without finetuning on the target data, SenseVoice was able to achieve and exceed the performance of the current best speech emotion recognition models.
59
 
60
+ <div align="center">
61
+ <img src="image/ser_table.png" width="1000" />
62
+ </div>
63
 
64
+ Furthermore, we compared multiple open-source speech emotion recognition models on the test sets, and the results indicate that the SenseVoice-Large model achieved the best performance on nearly all datasets, while the SenseVoice-Small model also surpassed other open-source models on the majority of the datasets.
65
 
66
+ <div align="center">
67
+ <img src="image/ser_figure.png" width="500" />
68
+ </div>
69
 
70
+ ## Audio Event Detection
71
 
72
+ Although trained exclusively on speech data, SenseVoice can still function as a standalone event detection model. We compared its performance on the environmental sound classification ESC-50 dataset against the widely used industry models BEATS and PANN. The SenseVoice model achieved commendable results on these tasks. However, due to limitations in training data and methodology, its event classification performance has some gaps compared to specialized AED models.
73
 
74
+ <div align="center">
75
+ <img src="image/aed_figure.png" width="500" />
76
+ </div>
77
 
78
+ ## Computational Efficiency
79
 
80
+ The SenseVoice-Small model deploys a non-autoregressive end-to-end architecture, resulting in extremely low inference latency. With a similar number of parameters to the Whisper-Small model, it infers more than 5 times faster than Whisper-Small and 15 times faster than Whisper-Large.
81
 
82
+ <div align="center">
83
+ <img src="image/inference.png" width="1000" />
84
+ </div>
85
 
 
 
86
 
87
+ # Requirements
 
88
 
89
+ ```shell
90
+ pip install -r requirements.txt
91
+ ```
92
 
93
+ <a name="Usage"></a>
94
+ # Usage
 
 
 
 
95
 
96
+ ## Inference
 
97
 
98
+ Supports input of audio in any format and of any duration.
99
 
100
  ```python
101
  from funasr import AutoModel
102
+ from funasr.utils.postprocess_utils import rich_transcription_postprocess
103
 
104
+ model_dir = "FunAudioLLM/SenseVoiceSmall"
 
 
 
105
 
 
 
 
 
106
 
107
+ model = AutoModel(
108
+ model=model_dir,
109
+ vad_model="fsmn-vad",
110
+ vad_kwargs={"max_single_segment_time": 30000},
111
+ device="cuda:0",
112
+ hub="hf",
113
+ )
114
+
115
+ # en
116
  res = model.generate(
117
+ input=f"{model.model_path}/example/en.mp3",
118
  cache={},
119
+ language="auto", # "zn", "en", "yue", "ja", "ko", "nospeech"
120
+ use_itn=True,
121
+ batch_size_s=60,
122
+ merge_vad=True, #
123
+ merge_length_s=15,
124
  )
125
+ text = rich_transcription_postprocess(res[0]["text"])
126
+ print(text)
127
  ```
128
 
129
+ Parameter Description:
130
+ - `model_dir`: The name of the model, or the path to the model on the local disk.
131
+ - `vad_model`: This indicates the activation of VAD (Voice Activity Detection). The purpose of VAD is to split long audio into shorter clips. In this case, the inference time includes both VAD and SenseVoice total consumption, and represents the end-to-end latency. If you wish to test the SenseVoice model's inference time separately, the VAD model can be disabled.
132
+ - `vad_kwargs`: Specifies the configurations for the VAD model. `max_single_segment_time`: denotes the maximum duration for audio segmentation by the `vad_model`, with the unit being milliseconds (ms).
133
+ - `use_itn`: Whether the output result includes punctuation and inverse text normalization.
134
+ - `batch_size_s`: Indicates the use of dynamic batching, where the total duration of audio in the batch is measured in seconds (s).
135
+ - `merge_vad`: Whether to merge short audio fragments segmented by the VAD model, with the merged length being `merge_length_s`, in seconds (s).
136
 
137
+ If all inputs are short audios (<30s), and batch inference is needed to speed up inference efficiency, the VAD model can be removed, and `batch_size` can be set accordingly.
138
  ```python
139
+ model = AutoModel(model=model_dir, device="cuda:0", hub="hf")
140
 
141
  res = model.generate(
142
+ input=f"{model.model_path}/example/en.mp3",
143
  cache={},
144
+ language="zh", # "zn", "en", "yue", "ja", "ko", "nospeech"
145
  use_itn=False,
146
+ batch_size=64,
147
+ hub="hf",
148
  )
149
  ```
150
 
151
+ For more usage, please refer to [docs](https://github.com/modelscope/FunASR/blob/main/docs/tutorial/README.md)
152
 
153
+ ### Inference directly
154
 
155
+ Supports input of audio in any format, with an input duration limit of 30 seconds or less.
156
 
157
  ```python
158
+ from model import SenseVoiceSmall
159
+ from funasr.utils.postprocess_utils import rich_transcription_postprocess
160
 
161
+ model_dir = "FunAudioLLM/SenseVoiceSmall"
162
+ m, kwargs = SenseVoiceSmall.from_pretrained(model=model_dir, device="cuda:0", hub="hf")
163
 
 
164
 
165
+ res = m.inference(
166
+ data_in=f"{kwargs['model_path']}/example/en.mp3",
167
+ language="auto", # "zn", "en", "yue", "ja", "ko", "nospeech"
168
+ use_itn=False,
169
+ **kwargs,
170
+ )
171
 
172
+ text = rich_transcription_postprocess(res[0][0]["text"])
173
+ print(text)
174
+ ```
175
 
176
+ ### Export and Test (*On going*)
177
+ Ref to [SenseVoice](https://github.com/FunAudioLLM/SenseVoice)
178
  ## Service
179
 
180
+ Ref to [SenseVoice](https://github.com/FunAudioLLM/SenseVoice)
 
181
 
182
  ## Finetune
183
 
184
+ Ref to [SenseVoice](https://github.com/FunAudioLLM/SenseVoice)
 
 
 
 
 
 
 
185
 
186
+ ## WebUI
 
 
 
 
 
 
 
 
 
 
 
187
 
188
  ```shell
189
+ python webui.py
190
  ```
191
 
192
+ <div align="center"><img src="image/webui.png" width="700"/> </div>
193
 
194
+ <a name="Community"></a>
195
+ # Community
196
+ If you encounter problems in use, you can directly raise Issues on the github page.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
197
 
198
+ You can also scan the following DingTalk group QR code to join the community group for communication and discussion.
199
 
200
+ | FunAudioLLM | FunASR |
201
+ |:----------------------------------------------------------------:|:--------------------------------------------------------:|
202
+ | <div align="left"><img src="image/dingding_sv.png" width="250"/> | <img src="image/dingding_funasr.png" width="250"/></div> |
203
 
 
204
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README_zh.md ADDED
@@ -0,0 +1,213 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SenseVoice
2
+
3
+ 「简体中文」|「[English](./README.md)」
4
+
5
+ SenseVoice是具有音频理解能力的音频基础模型,包括语音识别(ASR)、语种识别(LID)、语音情感识别(SER)和声学事件分类(AEC)或声学事件检测(AED)。本项目提供SenseVoice模型的介绍以及在多个任务测试集上的benchmark,以及体验模型所需的环境安装的与推理方式。
6
+
7
+ <div align="center">
8
+ <img src="image/sensevoice2.png">
9
+
10
+ [//]: # (<div align="center"><img src="image/sensevoice2.png" width="700"/> </div>)
11
+
12
+ <h4>
13
+ <a href="https://www.modelscope.cn/studios/iic/SenseVoice"> 在线体验 </a>
14
+ |<a href="#What's New"> 文档主页 </a>
15
+ |<a href="#核心功能"> 核心功能 </a>
16
+ </h4>
17
+ <h4>
18
+ <a href="#On Going"> 最新动态 </a>
19
+ |<a href="#Benchmark"> Benchmark </a>
20
+ |<a href="#环境安装"> 环境安装 </a>
21
+ |<a href="#用法教程"> 用法教程 </a>
22
+ |<a href="#联系我们"> 联系我们 </a>
23
+ </h4>
24
+
25
+ 模型仓库:中国大陆用户推荐 [modelscope](https://www.modelscope.cn/models/iic/SenseVoiceSmall),海外用户推荐 [huggingface](https://huggingface.co/FunAudioLLM/SenseVoiceSmall)
26
+ </div>
27
+
28
+ <a name="核心功能"></a>
29
+ # 核心功能 🎯
30
+ **SenseVoice**专注于高精度多语言语音识别、情感辨识和音频事件检测
31
+ - **多语言识别:** 采用超过40万小时数据训练,支持超过50种语言,识别效果上优于Whisper模型。
32
+ - **富文本识别:**
33
+ - 具备优秀的情感识别,能够在测试数据上达到和超过目前最佳情感识别模型的效果。
34
+ - 支持声音事件检测能力,支持音乐、掌声、笑声、哭声、咳嗽、喷嚏等多种常见人机交互事件进行检测。
35
+ - **高效推理:** SenseVoice-Small模型采用非自回归端到端框架,推理延迟极低,10s音频推理仅耗时70ms,15倍优于Whisper-Large。
36
+ - **微调定制:** 具备便捷的微调脚本与策略,方便用户根据业务场景修复长尾样本问题。
37
+ - **服务部署:** 具有完整的服务部署链路,支持多并发请求,支持客户端语言有,python、c++、html、java与c#等。
38
+
39
+ <a name="最新动态"></a>
40
+ # 最新动态 🔥
41
+ - 2024/7: [SenseVoice-Small](https://www.modelscope.cn/models/iic/SenseVoiceSmall) 多语言音频理解模型开源,支持中、粤、英、日、韩语的多语言语音识别,情感识别和事件检测能力,具有极低的推理延迟。。
42
+ - 2024/7: CosyVoice致力于自然语音生成,支持多语言、音色和情感控制,擅长多语言语音生成、零样本语音生成、跨语言语音克隆以及遵循指令的能力。[CosyVoice repo](https://github.com/FunAudioLLM/CosyVoice) and [CosyVoice 在线体验](https://www.modelscope.cn/studios/iic/CosyVoice-300M).
43
+ - 2024/7: [FunASR](https://github.com/modelscope/FunASR) 是一个基础语音识别工具包,提供多种功能,包括语音识别(ASR)、语音端点检测(VAD)、标点恢复、语言模型、说话人验证、说话人分离和多人对话语音识别等。
44
+
45
+ <a name="Benchmarks"></a>
46
+ # Benchmarks 📝
47
+
48
+ ## 多语言语音识别
49
+
50
+ 我们在开源基准数据集(包括 AISHELL-1、AISHELL-2、Wenetspeech、Librispeech和Common Voice)上比较了SenseVoice与Whisper的多语言语音识别性能和推理效率。在中文和粤语识别效果上,SenseVoice-Small模型具有明显的效果优势。
51
+
52
+ <div align="center">
53
+ <img src="image/asr_results1.png" width="400" /><img src="image/asr_results2.png" width="400" />
54
+ </div>
55
+
56
+ ## 情感识别
57
+
58
+ 由于目前缺乏被广泛使用的情感识别测试指标和方法,我们在多个测试集的多种指标进行测试,并与近年来Benchmark上的多个结果进行了全面的对比。所选取的测试集同时包含中文/英文两种语言以及表演、影视剧、自然对话等多种风格的数据,在不进行目标数据微调的前提下,SenseVoice能够在测试数据上达到和超过目前最佳情感识别模型的效果。
59
+
60
+ <div align="center">
61
+ <img src="image/ser_table.png" width="1000" />
62
+ </div>
63
+
64
+ 同时,我们还在测试集上对多个开源情感识别模型进行对比,结果表明,SenseVoice-Large模型可以在几乎所有数据上都达到了最佳效果,而SenseVoice-Small模型同样可以在多数数据集上取得超越其他开源模型的效果。
65
+
66
+ <div align="center">
67
+ <img src="image/ser_figure.png" width="500" />
68
+ </div>
69
+
70
+ ## 事件检测
71
+
72
+ 尽管SenseVoice只在语音数据上进行训练,它仍然可以作为事件检测模型进行单独使用。我们在环境音分类ESC-50数据集上与目前业内广泛使用的BEATS与PANN模型的效果进行了对比。SenseVoice模型能够在这些任务上取得较好的效果,但受限于训练数据与训练方式,其事件分类效果专业的事件检测模型相比仍然有一定的差距。
73
+
74
+ <div align="center">
75
+ <img src="image/aed_figure.png" width="500" />
76
+ </div>
77
+
78
+ ## 推理效率
79
+
80
+ SenseVoice-small模型采用非自回归端��端架构,推理延迟极低。在参数量与Whisper-Small模型相当的情况下,比Whisper-Small模型推理速度快5倍,比Whisper-Large模型快15倍。同时SenseVoice-small模型在音频时长增加的情况下,推理耗时也无明显增加。
81
+
82
+ <div align="center">
83
+ <img src="image/inference.png" width="1000" />
84
+ </div>
85
+
86
+ <a name="环境安装"></a>
87
+ # 安装依赖环境 🐍
88
+
89
+ ```shell
90
+ pip install -r requirements.txt
91
+ ```
92
+
93
+ <a name="用法教程"></a>
94
+ # 用法 🛠️
95
+
96
+ ## 推理
97
+
98
+
99
+
100
+ ### 使用funasr推理
101
+
102
+ 支持任意格式音频输入,支持任意时长输入
103
+
104
+ ```python
105
+ from funasr import AutoModel
106
+ from funasr.utils.postprocess_utils import rich_transcription_postprocess
107
+
108
+ model_dir = "FunAudioLLM/SenseVoiceSmall"
109
+
110
+
111
+ model = AutoModel(
112
+ model=model_dir,
113
+ trust_remote_code=True,
114
+ vad_kwargs={"max_single_segment_time": 30000},
115
+ device="cuda:0",
116
+ hub="hf",
117
+ )
118
+
119
+ # en
120
+ res = model.generate(
121
+ input=f"{model.model_path}/example/en.mp3",
122
+ cache={},
123
+ language="auto", # "zn", "en", "yue", "ja", "ko", "nospeech"
124
+ use_itn=True,
125
+ batch_size_s=60,
126
+ merge_vad=True, #
127
+ merge_length_s=15,
128
+ )
129
+ text = rich_transcription_postprocess(res[0]["text"])
130
+ print(text)
131
+ ```
132
+ 参数说明:
133
+ - `model_dir`:模型名称,或本地磁盘中的模型路径。
134
+ - `vad_model`:表示开启VAD,VAD的作用是将长音频切割成短音频,此时推理耗时包括了VAD与SenseVoice总耗时,为链路耗时,如果需要单独测试SenseVoice模型耗时,可以关闭VAD模型。
135
+ - `vad_kwargs`:表示VAD模型配置,`max_single_segment_time`: 表示`vad_model`最大切割音频时长, 单位是毫秒ms。
136
+ - `use_itn`:输出结果中是否包含标点与逆文本正则化。
137
+ - `batch_size_s` 表示采用动态batch,batch中总音频时长,单位为秒s。
138
+ - `merge_vad`:是否将 vad 模型切割的短音频碎片合成,合并后长度为`merge_length_s`,单位为秒s。
139
+
140
+ 如果输入均为短音频(小于30s),并且需要批量化推理,为了加快推理效率,可以移除vad模型,并设置`batch_size`
141
+
142
+ ```python
143
+ model = AutoModel(model=model_dir, trust_remote_code=True, device="cuda:0", hub="hf")
144
+
145
+ res = model.generate(
146
+ input=f"{model.model_path}/example/en.mp3",
147
+ cache={},
148
+ language="auto", # "zn", "en", "yue", "ja", "ko", "nospeech"
149
+ use_itn=True,
150
+ batch_size=64,
151
+ )
152
+ ```
153
+
154
+ 更多详细用法,请参考 [文档](https://github.com/modelscope/FunASR/blob/main/docs/tutorial/README.md)
155
+
156
+ ### 直接推理
157
+
158
+ 支持任意格式音频输入,输入音频时长限制在30s以下
159
+
160
+ ```python
161
+ from model import SenseVoiceSmall
162
+ from funasr.utils.postprocess_utils import rich_transcription_postprocess
163
+
164
+ model_dir = "FunAudioLLM/SenseVoiceSmall"
165
+ m, kwargs = SenseVoiceSmall.from_pretrained(model=model_dir, device="cuda:0", hub="hf")
166
+
167
+
168
+ res = m.inference(
169
+ data_in=f"{kwargs['model_path']}/example/en.mp3",
170
+ language="auto", # "zn", "en", "yue", "ja", "ko", "nospeech"
171
+ use_itn=False,
172
+ **kwargs,
173
+ )
174
+
175
+ text = rich_transcription_postprocess(res[0][0]["text"])
176
+ print(text)
177
+ ```
178
+
179
+
180
+ ## 服务部署
181
+
182
+ Ref to [SenseVoice](https://github.com/FunAudioLLM/SenseVoice)
183
+
184
+ ### 导出与测试(*进行中*)
185
+
186
+ Ref to [SenseVoice](https://github.com/FunAudioLLM/SenseVoice)
187
+
188
+ ### 部署
189
+
190
+ Ref to [SenseVoice](https://github.com/FunAudioLLM/SenseVoice)
191
+
192
+ ## 微调
193
+ Ref to [SenseVoice](https://github.com/FunAudioLLM/SenseVoice)
194
+
195
+
196
+ ## WebUI
197
+
198
+ ```shell
199
+ python webui.py
200
+ ```
201
+
202
+ <div align="center"><img src="image/webui.png" width="700"/> </div>
203
+
204
+ # 联系我们
205
+
206
+ 如果您在使用中遇到问题,可以直接在github页面提Issues。欢迎语音兴趣爱好者扫描以下的钉钉群二维码加入社区群,进行交流和讨论。
207
+
208
+ | FunAudioLLM | FunASR |
209
+ |:----------------------------------------------------------------:|:--------------------------------------------------------:|
210
+ | <div align="left"><img src="image/dingding_sv.png" width="250"/> | <img src="image/dingding_funasr.png" width="250"/></div> |
211
+
212
+
213
+
{fig → image}/aed_figure.png RENAMED
File without changes
{fig → image}/asr_results.png RENAMED
File without changes
{fig → image}/inference.png RENAMED
File without changes
{fig → image}/sensevoice.png RENAMED
File without changes
{fig → image}/ser_figure.png RENAMED
File without changes
{fig → image}/ser_table.png RENAMED
File without changes
requirements.txt ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ torch>=1.13
2
+ torchaudio
3
+ modelscope
4
+ huggingface
5
+ huggingface_hub
6
+ funasr>=1.1.2
7
+ numpy<=1.26.4
8
+ gradio