MeetingMind Voxtral Transcription Endpoint

GPU-accelerated speech-to-text for the MeetingMind pipeline using Voxtral Realtime 4B. Runs as an HF Inference Endpoint on a T4 GPU with scale-to-zero.

Model weights: mistral-hackaton-2026/voxtral_model — Voxtral Realtime 4B (BF16 safetensors, loaded from /repository/voxtral-model/)

API

GET /health

Returns service status and GPU availability.

curl -H "Authorization: Bearer $HF_TOKEN" $ENDPOINT_URL/health
{"status": "ok", "gpu_available": true}

POST /transcribe

Speech-to-text using Voxtral Realtime 4B. Returns full transcription.

curl -X POST \
  -H "Authorization: Bearer $HF_TOKEN" \
  -F audio=@speech.wav \
  $ENDPOINT_URL/transcribe
{"text": "Hello, this is a test of the voxtral speech to text system."}

POST /transcribe/stream

Streaming speech-to-text via SSE. Tokens are emitted as they are generated.

curl -X POST \
  -H "Authorization: Bearer $HF_TOKEN" \
  -F audio=@speech.wav \
  $ENDPOINT_URL/transcribe/stream

Events: token (partial), done (final text), error.

Environment Variables

Variable Default Description
VOXTRAL_MODEL_DIR /repository/voxtral-model Path to Voxtral model weights

Architecture

  • Base image: pytorch/pytorch:2.4.0-cuda12.4-cudnn9-runtime
  • Transcription: Voxtral Realtime 4B via direct safetensors loading (~8GB VRAM)
  • Scale-to-zero: 15 min idle timeout (~$0.60/hr when active)
  • Diarization & embeddings: Served separately by the GPU service on machine "tanti"
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support