Voxtral Realtime 4B

Streaming speech-to-text model with ~4 billion parameters. Weights in BF16 safetensors format, extracted from mistralai/Voxtral-Mini-4B-Realtime-2602.

Architecture

Pipeline:

WAV β†’ 16kHz β†’ Mel Spectrogram β†’ Conv Stem β†’ Encoder β†’ Downsample 4x β†’ Adapter β†’ Decoder β†’ Tokens
  • Audio Encoder: ~0.6B params β€” causal transformer, 32 layers
  • Audio-Language Adapter: 2-layer MLP with 4x downsample
  • LLM Decoder: ~3.4B params β€” Ministral-3 based, 26 layers with GQA

Audio Preprocessing

Parameter Value
Sample rate 16,000 Hz
Frame rate 12.5 Hz
Mel bins 128
Hop length 160 samples (10ms)
Window size 400 samples (25ms)
1 text token 80ms of audio

Encoder (Causal Transformer)

Parameter Value
dim 1280
layers 32
heads 32 (MHA)
head_dim 64
hidden_dim 5120
FFN SwiGLU
Norm RMSNorm (eps=1e-5)
Position RoPE (theta=1e6, interleaved)
Attention causal, sliding window=750

Conv stem: conv1d(128β†’1280, k=3, s=1) β†’ GELU β†’ conv1d(1280β†’1280, k=3, s=2) β†’ GELU

Adapter

[seq/4, 5120] β†’ Linear(5120β†’3072) β†’ GELU β†’ Linear(3072β†’3072) β†’ [seq/4, 3072]

Decoder (LLM)

Parameter Value
dim 3072
layers 26
heads 32
KV heads 8 (GQA 4:1)
head_dim 128
hidden_dim 9216
Norm RMSNorm (eps=1e-5)
Position RoPE (theta=1e6)
Attention causal, sliding window=8192
Vocab size 131,072
Tied embeddings yes

The decoder uses adaptive RMS normalization conditioned on transcription delay (6 delay tokens = 480ms).

Weight Format

  • consolidated.safetensors (8.3 GB) β€” 711 tensors, all BF16
  • params.json β€” model config
  • tekken.json (14.9 MB) β€” Tekken tokenizer

Tokenizer (Tekken)

Token ID
BOS 1
EOS 2
STREAMING_PAD 32

Token IDs 0–999 are special tokens. IDs 1000+ index into the vocabulary (base64-encoded byte sequences in tekken.json).

Audio Streaming Config

Parameter Value
sampling_rate 16,000
frame_rate 12.5 (80ms per token)
transcription_delay_ms 480 (6 delay tokens)
left_pad_tokens 32
right_pad_tokens (offline) 17

Decode Schedule (Offline)

  1. Prompt: [BOS] + [STREAMING_PAD] Γ— 38 (1 + 32 left-pad + 6 delay)
  2. Prefill: Feed audio_embed[i] + tok_embed(prompt[i]) for positions 0..L-2
  3. First token: Greedy argmax from position L-1
  4. Autoregressive decode: For each remaining audio position, feed audio_embed[pos] + tok_embed(prev_token), greedy argmax
  5. Stop: On EOS or end of audio span

C Implementation

A pure C implementation of this model is available at voxtral.c β€” runs on Apple Silicon (Metal) and CPU (BLAS), with streaming microphone input.

Credits

Original model by Mistral AI: mistralai/Voxtral-Mini-4B-Realtime-2602

Built for the Mistral Hackathon 2026.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support