Instructions to use microsoft/VibeVoice-ASR with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use microsoft/VibeVoice-ASR with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="microsoft/VibeVoice-ASR")# Load model directly from transformers import VibeVoiceForASRTraining model = VibeVoiceForASRTraining.from_pretrained("microsoft/VibeVoice-ASR", dtype="auto") - Notebooks
- Google Colab
- Kaggle
speaker diarization less 1s seems not good
Excellent work, thank u! the timestamps are very precise, and the speaker diarization is also quite good. However, it doesn't seem to be very accurate, that the speaker diarization for brief interjections of approximately one second within sentences. Would it be possible to address this issue by adjusting the inference parameters?
Thanks for your interest!
Short, frequent speaker interchanges (e.g., ~1s interjections within an ongoing sentence) are still challenging for the current model and typically require targeted training data to improve. From an inference-only perspective, parameter tuning is unlikely to reliably fix this issue at the moment—sorry about that. o(╥﹏╥)o
That said, we’re actively iterating on the model, and upcoming versions should deliver better performance for this scenario. ヾ(◍°∇°◍)ノ゙