Voxtral Realtime 4B

Streaming speech-to-text model with ~4 billion parameters. Weights in BF16 safetensors format, extracted from mistralai/Voxtral-Mini-4B-Realtime-2602.

Architecture

Pipeline:

WAV → 16kHz → Mel Spectrogram → Conv Stem → Encoder → Downsample 4x → Adapter → Decoder → Tokens

Conv stem: conv1d(128→1280, k=3, s=1) → GELU → conv1d(1280→1280, k=3, s=2) → GELU

[seq/4, 5120] → Linear(5120→3072) → GELU → Linear(3072→3072) → [seq/4, 3072]

The decoder uses adaptive RMS normalization conditioned on transcription delay (6 delay tokens = 480ms).

Token IDs 0–999 are special tokens. IDs 1000+ index into the vocabulary (base64-encoded byte sequences in tekken.json).

Prompt: [BOS] + [STREAMING_PAD] × 38 (1 + 32 left-pad + 6 delay)
Prefill: Feed audio_embed[i] + tok_embed(prompt[i]) for positions 0..L-2
First token: Greedy argmax from position L-1
Autoregressive decode: For each remaining audio position, feed audio_embed[pos] + tok_embed(prev_token), greedy argmax
Stop: On EOS or end of audio span

A pure C implementation of this model is available at voxtral.c — runs on Apple Silicon (Metal) and CPU (BLAS), with streaming microphone input.

Downloads last month: -; Downloads are not tracked for this model. How to track