--- tags: - espnet - audio - automatic-speech-recognition - speech-translation - language-identification language: multilingual datasets: - owsm_v3.2_ctc base_model: - espnet/owsm_ctc_v3.2_ft_1B license: cc-by-4.0 --- [OWSM-CTC](https://aclanthology.org/2024.acl-long.549/) (Peng et al., ACL 2024) is an encoder-only speech foundation model based on hierarchical multi-task self-conditioned CTC. This model is trained on 180k hours of public audio data for multilingual speech recognition, any-to-any speech translation, and language identification, which follows the design of the project, [Open Whisper-style Speech Model (OWSM)](https://www.wavlab.org/activities/2024/owsm/). This model is initialized with [OWSM-CTC v3.1](https://huggingface.co./pyf98/owsm_ctc_v3.1_1B) and then fine-tuned on [v3.2 data](https://arxiv.org/abs/2406.09282) for 225k steps. To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are: ``` librosa torch espnet espnet_model_zoo ``` **The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v3.1/s2t1 ### Example script for batched inference `Speech2TextGreedySearch` now provides a unified batched inference method `batch_decode`. It performs CTC greedy decoding for a batch of short-form or long-form audios. If an audio is shorter than 30s, it will be padded to 30s; otherwise it will be split into overlapped segments (same as the "long-form ASR/ST" method below). ```python from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch s2t = Speech2TextGreedySearch.from_pretrained( "espnet/owsm_ctc_v3.2_ft_1B", device="cuda", use_flash_attn=False, # set to True for better efficiency if flash attn is installed and dtype is float16 or bfloat16 lang_sym='', task_sym='', ) res = s2t.batch_decode( "audio.wav", # a single audio (path or 1-D array/tensor) as input batch_size=16, context_len_in_secs=4, ) # res is a single str, i.e., the predicted text without special tokens res = s2t.batch_decode( ["audio1.wav", "audio2.wav", "audio3.wav"], # a list of audios as input batch_size=16, context_len_in_secs=4, ) # res is a list of str # Please check the code of `batch_decode` for all supported inputs ``` ### Example script for short-form ASR/ST/LID Our models are trained on 16kHz audio with a fixed duration of 30s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 30s. ```python import librosa from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch s2t = Speech2TextGreedySearch.from_pretrained( "espnet/owsm_ctc_v3.2_ft_1B", device="cuda", generate_interctc_outputs=False, lang_sym='', task_sym='', ) # NOTE: OWSM-CTC is trained on 16kHz audio with a fixed 30s duration. Please ensure your input has the correct sample rate; otherwise resample it to 16k before feeding it to the model speech, rate = librosa.load("xxx.wav", sr=16000) speech = librosa.util.fix_length(speech, size=(16000 * 30)) res = s2t(speech)[0] print(res) ``` ### Example script for long-form ASR/ST ```python import soundfile as sf import torch from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch context_len_in_secs = 4 # left and right context when doing buffered inference batch_size = 32 # depends on the GPU memory s2t = Speech2TextGreedySearch.from_pretrained( "espnet/owsm_ctc_v3.2_ft_1B", device='cuda' if torch.cuda.is_available() else 'cpu', generate_interctc_outputs=False, lang_sym='', task_sym='', ) speech, rate = sf.read( "xxx.wav" ) text = s2t.decode_long_batched_buffered( speech, batch_size=batch_size, context_len_in_secs=context_len_in_secs, ) print(text) ``` ### Example of CTC forced alignment using `ctc-segmentation` CTC segmentation can be efficiently applied to audio of an arbitrary length. ```python import soundfile as sf from espnet2.bin.s2t_ctc_align import CTCSegmentation from espnet_model_zoo.downloader import ModelDownloader # Download model first d = ModelDownloader() downloaded = d.download_and_unpack("espnet/owsm_ctc_v3.2_ft_1B") aligner = CTCSegmentation( **downloaded, fs=16000, ngpu=1, batch_size=32, # batched parallel decoding; reduce it if your GPU memory is smaller kaldi_style_text=True, time_stamps="auto", # "auto" can be more accurate than "fixed" when converting token index to timestamp lang_sym="", task_sym="", context_len_in_secs=2, # left and right context in buffered decoding ) speech, rate = sf.read( "./test_utils/ctc_align_test.wav" ) print(f"speech duration: {len(speech) / rate : .2f} seconds") text = """ utt1 THE SALE OF THE HOTELS utt2 IS PART OF HOLIDAY'S STRATEGY utt3 TO SELL OFF ASSETS utt4 AND CONCENTRATE ON PROPERTY MANAGEMENT """ segments = aligner(speech, text) print(segments) ``` ## Citations #### OWSM-CTC ```BibTex @inproceedings{owsm-ctc, title = "{OWSM}-{CTC}: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification", author = "Peng, Yifan and Sudo, Yui and Shakeel, Muhammad and Watanabe, Shinji", booktitle = "Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)", year = "2024", month= {8}, url = "https://aclanthology.org/2024.acl-long.549", } ``` #### OWSM v3.1 and v3.2 ```BibTex @inproceedings{owsm-v32, title={On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models}, author={Jinchuan Tian and Yifan Peng and William Chen and Kwanghee Choi and Karen Livescu and Shinji Watanabe}, booktitle={Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH)}, year={2024}, month={9}, pdf="https://arxiv.org/pdf/2406.09282" } @inproceedings{owsm-v31, title={{OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer}}, author={Yifan Peng and Jinchuan Tian and William Chen and Siddhant Arora and Brian Yan and Yui Sudo and Muhammad Shakeel and Kwanghee Choi and Jiatong Shi and Xuankai Chang and Jee-weon Jung and Shinji Watanabe}, booktitle={Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH)}, year={2024}, month={9}, pdf="https://arxiv.org/pdf/2401.16658", } ``` #### Initial OWSM (v1, v2, v3) ```BibTex @inproceedings{owsm, title={Reproducing Whisper-Style Training Using An Open-Source Toolkit And Publicly Available Data}, author={Yifan Peng and Jinchuan Tian and Brian Yan and Dan Berrebbi and Xuankai Chang and Xinjian Li and Jiatong Shi and Siddhant Arora and William Chen and Roshan Sharma and Wangyou Zhang and Yui Sudo and Muhammad Shakeel and Jee-weon Jung and Soumi Maiti and Shinji Watanabe}, booktitle={Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)}, year={2023}, month={12}, pdf="https://arxiv.org/pdf/2309.13876", } ```