Step-3.7-Flash_dq4

This model is a DQ4 quantized version of the original model Step-3.7-Flash (local model). It was quantized locally using the mlx_lm library.

Quantization Methodology (DQ4)

This model was quantized using the dynamic DQ4 (4-bit / 5-bit / 6-bit / 8-bit mixed) approach, inspired by the methodology described in the mlx-community/Kimi-K2.5-mlx-DQ3_K_M-q8 repository.

The weights are mixed based on MLX layers:

  • Expert layers (switch_mlp / experts / shared experts) are quantized to 4-bit.
  • Expert down_proj in the first 5 layers is kept at higher quality (6-bit).
  • Expert down_proj every 5th layer is medium quality (5-bit).
  • All other layers (e.g. attention, routers, normalization) remain at 8-bit to serve as the "8-bit brain".

The table below is generated from the actual output config.json, so it reflects exactly what was quantized.

Per-layer quantization map

  • Group size: 64
  • Quantized weight matrices: 530
  • Bit distribution: 4-bit ×232, 5-bit ×16, 6-bit ×4, 8-bit ×278
  • Modules not listed below (attention projections, embeddings, lm_head, routers, norms, dense MLPs) are kept at 8-bit or full precision as the high-precision backbone.
Module pattern Bits Count
language_model.lm_head 8-bit 1
language_model.model.embed_tokens 8-bit 1
language_model.model.layers.{i}.mlp.down_proj 8-bit 3
language_model.model.layers.{i}.mlp.gate.gate 8-bit 42
language_model.model.layers.{i}.mlp.gate_proj 8-bit 3
language_model.model.layers.{i}.mlp.share_expert.down_proj 4-bit 32
language_model.model.layers.{i}.mlp.share_expert.down_proj 5-bit 8
language_model.model.layers.{i}.mlp.share_expert.down_proj 6-bit 2
language_model.model.layers.{i}.mlp.share_expert.gate_proj 4-bit 42
language_model.model.layers.{i}.mlp.share_expert.up_proj 4-bit 42
language_model.model.layers.{i}.mlp.switch_mlp.down_proj 4-bit 32
language_model.model.layers.{i}.mlp.switch_mlp.down_proj 5-bit 8
language_model.model.layers.{i}.mlp.switch_mlp.down_proj 6-bit 2
language_model.model.layers.{i}.mlp.switch_mlp.gate_proj 4-bit 42
language_model.model.layers.{i}.mlp.switch_mlp.up_proj 4-bit 42
language_model.model.layers.{i}.mlp.up_proj 8-bit 3
language_model.model.layers.{i}.self_attn.g_proj 8-bit 45
language_model.model.layers.{i}.self_attn.k_proj 8-bit 45
language_model.model.layers.{i}.self_attn.o_proj 8-bit 45
language_model.model.layers.{i}.self_attn.q_proj 8-bit 45
language_model.model.layers.{i}.self_attn.v_proj 8-bit 45
Downloads last month
141
Safetensors
Model size
197B params
Tensor type
BF16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support