Inquiry on Upcoming Language Support and Fine-Tuning Feasibility
Hi LiveKit team! 👋
First, thank you for your incredible work on the livekit/turn-detector model and the open-source ecosystem around it. The end-of-turn detection capabilities have been a game-changer for our conversational AI projects, especially with the improved accuracy over traditional VAD methods.
I wanted to ask about your plans for expanding language support. I recall seeing a post on X.com suggesting that multilingual support is in the pipeline for the near future. Could you share any updates on this? Many communities would greatly benefit from non-English implementations, and we’re eager to know timelines or prioritized languages.
Additionally, if broader language support isn’t imminent, is it feasible to fine-tune the current model on a custom language corpus?
For instance:
Training Requirements: What dataset format/size is recommended (e.g., conversational transcripts with turn boundaries labeled)?
Annotation Guidelines: Are specific metadata or annotations (e.g., silence duration, speaker roles) needed for training?
Architecture Constraints: Does the ONNX-based inference setup 68 allow for fine-tuning, or would adjustments to the model architecture be necessary?
We’re prepared to collaborate on preparing a training set for our target language and would appreciate guidance on best practices.
Thanks again for your transparency and dedication to advancing real-time communication tools! Looking forward to your insights.
Hi johndili, I hope you are doing well,
I wanted to ask if you tried to train your own EOU (bert based) classification model instead of being willing to fine-tune "livekit/turn-detector",
Maybe using some bert-like (encoder-only) model of your language? I want to hear your thoughts about that, Thanks. 🥰
Hi mate,
It's a great idea !
I wonder how it would behave in production though.
You recommend on training the model from scratch or just fine tuning a pre-trained BERT ?
Thanks