Automatic Speech Recognition
Transformers
Safetensors
voxtral
image-feature-extraction
speech
speech-language-model
target-speaker-asr
multi-talker
speaker-diarization
meeting-transcription
Dixtral
Voxtral
DiCoW
BUT-FIT
custom_code
Instructions to use BUT-FIT/Dixtral with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use BUT-FIT/Dixtral with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="BUT-FIT/Dixtral", trust_remote_code=True)# Load model directly from transformers import AutoProcessor, AutoModel processor = AutoProcessor.from_pretrained("BUT-FIT/Dixtral", trust_remote_code=True) model = AutoModel.from_pretrained("BUT-FIT/Dixtral", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
metadata
library_name: transformers
tags:
- speech
- automatic-speech-recognition
- speech-language-model
- target-speaker-asr
- multi-talker
- speaker-diarization
- meeting-transcription
- Dixtral
- Voxtral
- DiCoW
- BUT-FIT
pipeline_tag: automatic-speech-recognition
license: apache-2.0
base_model: mistralai/Voxtral-Mini-3B-2507
datasets:
- microsoft/NOTSOFAR
- edinburghcstr/ami
π§ Dixtral β BUT-FIT Diarization-Conditioned Voxtral for Target-Speaker ASR
This repository hosts Dixtral, developed by BUT Speech@FIT. Dixtral couples the Voxtral-Mini-3B spoken-language model with the DiCoW diarization-conditioned encoder, giving the LLM target-speaker awareness in multi-talker audio.
This checkpoint is tuned for target-speaker / multi-talker transcription (TS-ASR) of conversational and meeting recordings. For spoken question answering, use Dixtral_QA instead.
π οΈ Model Usage
from transformers import AutoModel, AutoProcessor
MODEL_NAME = "BUT-FIT/Dixtral"
model = AutoModel.from_pretrained(MODEL_NAME, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(MODEL_NAME)
β‘οΈ For full inference pipelines (diarization β FDDT masks β generation), see the Dixtral GitHub repository.
π¦ Model Details
- Base Model: Voxtral-Mini-3B-2507
- Encoder: DiCoW v3 large
- Training Datasets:
π¬ Contact
π§ Email: ipoloka@fit.vut.cz π’ Affiliation: BUT Speech@FIT, Brno University of Technology π GitHub: BUTSpeechFIT