Voxtral Realtime 4B
Streaming speech-to-text model with ~4 billion parameters. Weights in BF16 safetensors format, extracted from mistralai/Voxtral-Mini-4B-Realtime-2602.
Architecture
Pipeline:
WAV β 16kHz β Mel Spectrogram β Conv Stem β Encoder β Downsample 4x β Adapter β Decoder β Tokens
- Audio Encoder: ~0.6B params β causal transformer, 32 layers
- Audio-Language Adapter: 2-layer MLP with 4x downsample
- LLM Decoder: ~3.4B params β Ministral-3 based, 26 layers with GQA
Audio Preprocessing
| Parameter | Value |
|---|---|
| Sample rate | 16,000 Hz |
| Frame rate | 12.5 Hz |
| Mel bins | 128 |
| Hop length | 160 samples (10ms) |
| Window size | 400 samples (25ms) |
| 1 text token | 80ms of audio |
Encoder (Causal Transformer)
| Parameter | Value |
|---|---|
| dim | 1280 |
| layers | 32 |
| heads | 32 (MHA) |
| head_dim | 64 |
| hidden_dim | 5120 |
| FFN | SwiGLU |
| Norm | RMSNorm (eps=1e-5) |
| Position | RoPE (theta=1e6, interleaved) |
| Attention | causal, sliding window=750 |
Conv stem: conv1d(128β1280, k=3, s=1) β GELU β conv1d(1280β1280, k=3, s=2) β GELU
Adapter
[seq/4, 5120] β Linear(5120β3072) β GELU β Linear(3072β3072) β [seq/4, 3072]
Decoder (LLM)
| Parameter | Value |
|---|---|
| dim | 3072 |
| layers | 26 |
| heads | 32 |
| KV heads | 8 (GQA 4:1) |
| head_dim | 128 |
| hidden_dim | 9216 |
| Norm | RMSNorm (eps=1e-5) |
| Position | RoPE (theta=1e6) |
| Attention | causal, sliding window=8192 |
| Vocab size | 131,072 |
| Tied embeddings | yes |
The decoder uses adaptive RMS normalization conditioned on transcription delay (6 delay tokens = 480ms).
Weight Format
consolidated.safetensors(8.3 GB) β 711 tensors, all BF16params.jsonβ model configtekken.json(14.9 MB) β Tekken tokenizer
Tokenizer (Tekken)
| Token | ID |
|---|---|
| BOS | 1 |
| EOS | 2 |
| STREAMING_PAD | 32 |
Token IDs 0β999 are special tokens. IDs 1000+ index into the vocabulary (base64-encoded byte sequences in tekken.json).
Audio Streaming Config
| Parameter | Value |
|---|---|
| sampling_rate | 16,000 |
| frame_rate | 12.5 (80ms per token) |
| transcription_delay_ms | 480 (6 delay tokens) |
| left_pad_tokens | 32 |
| right_pad_tokens (offline) | 17 |
Decode Schedule (Offline)
- Prompt:
[BOS] + [STREAMING_PAD] Γ 38(1 + 32 left-pad + 6 delay) - Prefill: Feed
audio_embed[i] + tok_embed(prompt[i])for positions 0..L-2 - First token: Greedy argmax from position L-1
- Autoregressive decode: For each remaining audio position, feed
audio_embed[pos] + tok_embed(prev_token), greedy argmax - Stop: On EOS or end of audio span
C Implementation
A pure C implementation of this model is available at voxtral.c β runs on Apple Silicon (Metal) and CPU (BLAS), with streaming microphone input.
Credits
Original model by Mistral AI: mistralai/Voxtral-Mini-4B-Realtime-2602
Built for the Mistral Hackathon 2026.