Qwen3.6-27B-RFP458 (4.5 bpw)

A 4.5-bit-per-weight RFP458 quantization of Qwen/Qwen3.6-27B, a hybrid vision-language model (Gated-DeltaNet linear-attention + full-attention layers, with a vision tower and MTP head).

Summary

  • Format: RFP458 (rfp458-pack-quantized): iq4_nl non-uniform 4-bit codebook, group size 16, signed-int8 block mantissa + per-channel int8 exponent, with hadamard16 weight rotation.
  • Size: ~20.5 GB (vs ~27 GB for the FP8 build); ~9.4 GiB per card on a 2x 32 GB setup.
  • What is quantized: the MLP linears, the GDN in_proj_qkv / in_proj_z / out_proj, and the full self-attention q/k/v/o projections. Embeddings, lm_head, the GDN gating projections (in_proj_a / in_proj_b), conv1d, norms, and the entire vision tower are kept in bf16.

Quality

WikiText-2 perplexity (llama.cpp-compatible, n_ctx 2048, full test set):

Build Size PPL
RFP458 (this model) 20.5 GB 6.936
FP8 (RedHatAI/Qwen3.6-27B-FP8) ~27 GB 7.071

RFP458 matches or slightly beats the FP8 build at roughly 25 percent smaller weight footprint.

Serving

Built for and validated on a vLLM build with native RFP458 dequant kernels on AMD RDNA4 (gfx1201, Radeon AI PRO R9700), tensor-parallel 2. The smaller weights free enough VRAM to serve the full 262K context with a large KV pool. Note that 4-bit-class formats carry a dequant cost, so single-stream decode runs roughly half the speed of the FP8 build on the same hardware; this is a capacity, footprint, and quality choice rather than a speed one.

License

Inherits the license of the base model, Qwen/Qwen3.6-27B. See the base model card for terms.

This is a community quantization and is not affiliated with the original model authors.

Downloads last month
22
Safetensors
Model size
17B params
Tensor type
BF16
·
I8
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Launch80/Qwen3.6_27B-RFP458

Base model

Qwen/Qwen3.6-27B
Quantized
(531)
this model