Instructions to use strumecki/moshika-mlx-mp with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use strumecki/moshika-mlx-mp with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir moshika-mlx-mp strumecki/moshika-mlx-mp
- Moshi
How to use strumecki/moshika-mlx-mp with Moshi:
# pip install moshi_mlx # Run local inference (macOS Apple Silicon) python -m moshi_mlx.local --hf-repo "strumecki/moshika-mlx-mp" # Or run with web UI python -m moshi_mlx.local_web --hf-repo "strumecki/moshika-mlx-mp"
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
Moshika MLX β Mixed Precision (q4 bulk + q8 sensitive)
Mixed-precision MLX checkpoint of Kyutai Moshika. Targets ~5 GB of weights, making the 7B Moshi practical on 8 GB Apple Silicon Macs while keeping the quality-critical layers at q8.
Quantization policy
| Path pattern | Bits | Group |
|---|---|---|
text_emb |
8 | 64 |
text_linear |
8 | 64 |
audio_embs.* |
8 | 64 |
depformer.* (all Linear / Embedding) |
8 | 64 |
All transformer.layers.* Linear / Embedding |
4 | 32 |
| Norms, Mimi conv layers | β | β |
The Mimi codec is not part of this checkpoint. Pair with the Mimi tokenizer from kyutai/moshika-mlx-bf16 (tokenizer-e351c8d8-checkpoint125.safetensors).
Memory (weights only)
| Variant | Approx weights | Notes |
|---|---|---|
| BF16 | ~14 GB | Reference |
| q8 (group 64), uniform | ~7.4 GB | kyutai/moshika-mlx-q8 |
| mixed (this) | ~5.0 GB | q4 bulk + q8 sensitive |
| q4 (group 32), uniform | ~3.7 GB | kyutai/moshika-mlx-q4 |
Usage
moshi-swift GUI
Select Moshi q4/q8 mixed in the app's model picker. First run downloads to the HF cache.
moshi-swift CLI
moshi-cli run hf://strumecki/moshika-mlx-mp/model.mp.safetensors \
--config moshi7b \
--mimi-model hf://kyutai/moshika-mlx-bf16/tokenizer-e351c8d8-checkpoint125.safetensors \
--input mic
Reproducibility
Produced from kyutai/moshika-mlx-bf16/model.safetensors using scripts/convert_mixed_precision.py in the moshi-swift fork. Two passes of mlx.nn.quantize with mutually-exclusive class_predicate filters.
Limitations
- All 32 main-LM transformer layers are q4. Quality may differ from uniform q8 β subjective listening tests are recommended before relying on this for production.
- Quantization is structural: the on-disk format records no bits/group_size metadata. Loading in any runtime other than
moshi-swiftrequires applying the same predicate beforeupdate(parameters:). - macOS / Apple Silicon only (mlx-swift requirement).
Attribution
All credit for the underlying model architecture, training, and the Mimi codec to Kyutai and the Moshi paper.
Quantized
Model tree for strumecki/moshika-mlx-mp
Base model
kyutai/moshika-mlx-bf16