openbmb
/

MiniCPM-o-2_6

Model card Files Files and versions Community

hfw commited on 1 day ago

Commit

4d54644

1 Parent(s): a884350

update audio demo

Browse files

Files changed (3) hide show

README.md +4 -4
assets/audio_understanding.mp3 +0 -0
assets/mimick.wav +3 -0

README.md CHANGED Viewed

@@ -1127,7 +1127,7 @@ else:
 `Mimick` task reflects a model's end-to-end speech modeling capability. The model takes audio input, and outputs an ASR transcription and subsequently reconstructs the original audio with high similarity. The higher the similarity between the reconstructed audio and the original audio, the stronger the model's foundational capability in end-to-end speech modeling.
 ```python
 mimick_prompt = "Please repeat each user's speech, including voice style and speech content."
-audio_input, _ = librosa.load('xxx.wav', sr=16000, mono=True)
 msgs = [{'role': 'user', 'content': [mimick_prompt,audio_input]}]
 res = model.chat(
@@ -1155,7 +1155,7 @@ ref_audio, _ = librosa.load('assets/demo.wav', sr=16000, mono=True) # load the r
 Audio Assistant: # With this mode, model will speak with the voice in ref_audio as a AI assistant. (Stable and more suitable for general conversation)
 sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_assistant', language='en')
-user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # Try to ask something by recording it in 'xxx.wav'!!!
 ```
 ```python
 msgs = [sys_prompt, user_question]
@@ -1205,8 +1205,8 @@ General Audio:
     Audio Caption: Summarize the main content of the audio.
     Sound Scene Tagging: Utilize one keyword to convey the audio's content or the associated scene.
 '''
-task_prompt = "" # Choose the task prompt above
-audio_input, _ = librosa.load('xxx.wav', sr=16000, mono=True)
 msgs = [{'role': 'user', 'content': [task_prompt,audio_input]}]

 `Mimick` task reflects a model's end-to-end speech modeling capability. The model takes audio input, and outputs an ASR transcription and subsequently reconstructs the original audio with high similarity. The higher the similarity between the reconstructed audio and the original audio, the stronger the model's foundational capability in end-to-end speech modeling.
 ```python
 mimick_prompt = "Please repeat each user's speech, including voice style and speech content."
+audio_input, _ = librosa.load('assets/mimick.wav', sr=16000, mono=True)
 msgs = [{'role': 'user', 'content': [mimick_prompt,audio_input]}]
 res = model.chat(
 Audio Assistant: # With this mode, model will speak with the voice in ref_audio as a AI assistant. (Stable and more suitable for general conversation)
 sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_assistant', language='en')
+user_question = {'role': 'user', 'content': [librosa.load('assets/qa.wav', sr=16000, mono=True)[0]]} # Try to ask something by recording it in 'xxx.wav'!!!
 ```
 ```python
 msgs = [sys_prompt, user_question]
     Audio Caption: Summarize the main content of the audio.
     Sound Scene Tagging: Utilize one keyword to convey the audio's content or the associated scene.
 '''
+task_prompt = "Summarize the main content of the audio.\n" # Choose the task prompt above
+audio_input, _ = librosa.load('assets/audio_understanding.mp3', sr=16000, mono=True)
 msgs = [{'role': 'user', 'content': [task_prompt,audio_input]}]

assets/audio_understanding.mp3 ADDED Viewed

Binary file (321 kB). View file

assets/mimick.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dbb0860cb4dd7c7003b6f0406299fc7c0febc5c6a990e1c670d29b763e84e7ed
+size 384046