Automatic Speech Recognition
Transformers
Safetensors
VibeVoice
ASR
Transcriptoin
Diarization
Speech-to-Text
Instructions to use microsoft/VibeVoice-ASR with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use microsoft/VibeVoice-ASR with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="microsoft/VibeVoice-ASR")# Load model directly from transformers import VibeVoiceForASRTraining model = VibeVoiceForASRTraining.from_pretrained("microsoft/VibeVoice-ASR", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Is there any parameter to tune for more accurate diarization
#13
by lingxue156 - opened
thank you guys for this excellent model which combine the semantic understanding and speaker diarization! I read your paper and noticed you used HDBSCAN clustering during pre-training with a fixed threshold of 0.67. Right now, I'm testing it in a meeting diarization scenario, and I've found the model leans a bit too conservative—it tends to identify fewer speakers than are actually present. Even when voices are pretty distinct (like different female speakers), they often end up lumped together.
So I was wondering: is the clustering threshold adjustable? And what other parameters could I tweak to make the diarization part a bit more aggressive? 😊