Automatic Speech Recognition
Transformers
Safetensors
voxtral
image-feature-extraction
speech
speech-language-model
target-speaker-asr
multi-talker
speaker-diarization
meeting-transcription
Dixtral
Voxtral
DiCoW
BUT-FIT
custom_code
Instructions to use BUT-FIT/Dixtral with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use BUT-FIT/Dixtral with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="BUT-FIT/Dixtral", trust_remote_code=True)# Load model directly from transformers import AutoProcessor, AutoModel processor = AutoProcessor.from_pretrained("BUT-FIT/Dixtral", trust_remote_code=True) model = AutoModel.from_pretrained("BUT-FIT/Dixtral", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
File size: 2,027 Bytes
cc34926 fd9ac44 cc34926 88245e4 cc34926 88245e4 fd9ac44 cc34926 fd9ac44 cc34926 fd9ac44 cc34926 fd9ac44 cc34926 fd9ac44 cc34926 fd9ac44 cc34926 fd9ac44 cc34926 fd9ac44 cc34926 fd9ac44 cc34926 fd9ac44 cc34926 fd9ac44 cc34926 fd9ac44 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | ---
library_name: transformers
tags:
- speech
- automatic-speech-recognition
- speech-language-model
- target-speaker-asr
- multi-talker
- speaker-diarization
- meeting-transcription
- Dixtral
- Voxtral
- DiCoW
- BUT-FIT
pipeline_tag: automatic-speech-recognition
license: apache-2.0
base_model: mistralai/Voxtral-Mini-3B-2507
datasets:
- microsoft/NOTSOFAR
- edinburghcstr/ami
---
# ๐ง Dixtral โ BUT-FIT Diarization-Conditioned Voxtral for Target-Speaker ASR
This repository hosts **Dixtral**, developed by [BUT Speech@FIT](https://github.com/BUTSpeechFIT).
**Dixtral** couples the **Voxtral-Mini-3B** spoken-language model with the **DiCoW** diarization-conditioned encoder, giving the LLM target-speaker awareness in multi-talker audio.
This checkpoint is tuned for **target-speaker / multi-talker transcription (TS-ASR)** of conversational and meeting recordings. For spoken question answering, use [**Dixtral_QA**](https://huggingface.co/BUT-FIT/Dixtral_QA) instead.
## ๐ ๏ธ Model Usage
```python
from transformers import AutoModel, AutoProcessor
MODEL_NAME = "BUT-FIT/Dixtral"
model = AutoModel.from_pretrained(MODEL_NAME, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(MODEL_NAME)
```
โก๏ธ For full inference pipelines (diarization โ FDDT masks โ generation), see the
[**Dixtral GitHub repository**](https://github.com/BUTSpeechFIT/Dixtral).
---
## ๐ฆ Model Details
* **Base Model:** [Voxtral-Mini-3B-2507](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507)
* **Encoder:** DiCoW v3 large
* **Training Datasets:**
* [NOTSOFAR-1](https://github.com/microsoft/NOTSOFAR1-Challenge)
* [AMI Meeting Corpus](http://groups.inf.ed.ac.uk/ami/corpus/)
* [LibriMix / LibriSpeechMix](https://github.com/JorisCos/LibriMix)
---
## ๐ฌ Contact
๐ง **Email:** [ipoloka@fit.vut.cz](mailto:ipoloka@fit.vut.cz)
๐ข **Affiliation:** [BUT Speech@FIT](https://github.com/BUTSpeechFIT), Brno University of Technology
๐ **GitHub:** [BUTSpeechFIT](https://github.com/BUTSpeechFIT)
|