Is it possible to convert to a 2-bit quantized version?

#1
by hehua2008 - opened
MLX Community org

Is it possible to convert to a 2-bit quantized version?

MLX Community org

it is. Would you like to?

MLX Community org

The Trade-Offs at a Glance
4-bit Quantization (The Sweet Spot): Delivers near-baseline model accuracy while cutting memory usage in half. It is the industry standard for local inference.
2-bit Quantization (The Extreme Limit): Maximizes memory savings to fit massive models on small hardware, but causes severe text degradation and "hallucinations."
Key Differences
Memory Footprint: 2-bit cuts the RAM/VRAM required by 4-bit exactly in half.
Perplexity (Accuracy): 4-bit retains high reasoning capabilities. 2-bit suffers a massive quality drop, often breaking logic and coding skills.
Use Case: Use 4-bit for daily production and reliable coding tasks. Use 2-bit only as a last resort to fit a giant model onto low-end hardware.

MLX Community org

Hey, I recently came across a library which compresses model caching to 2 bit 1 bit so the size of kv caching will rediuce may be you can try this technique

may be you can try this

https://pypi.org/project/VeloxQuant-MLX/

Sign up or log in to comment