Is it possible to convert to a 2-bit quantized version?

by hehua2008 - opened Apr 24

Discussion

hehua2008

MLX Community org Apr 24

Is it possible to convert to a 2-bit quantized version?

machiabeli

MLX Community org 5 days ago

it is. Would you like to?

datagram

MLX Community org 4 days ago

The Trade-Offs at a Glance
4-bit Quantization (The Sweet Spot): Delivers near-baseline model accuracy while cutting memory usage in half. It is the industry standard for local inference.
2-bit Quantization (The Extreme Limit): Maximizes memory savings to fit massive models on small hardware, but causes severe text degradation and "hallucinations."
Key Differences
Memory Footprint: 2-bit cuts the RAM/VRAM required by 4-bit exactly in half.
Perplexity (Accuracy): 4-bit retains high reasoning capabilities. 2-bit suffers a massive quality drop, often breaking logic and coding skills.
Use Case: Use 4-bit for daily production and reliable coding tasks. Use 2-bit only as a last resort to fit a giant model onto low-end hardware.

rajveer43

MLX Community org 4 days ago

Hey, I recently came across a library which compresses model caching to 2 bit 1 bit so the size of kv caching will rediuce may be you can try this technique

may be you can try this

https://pypi.org/project/VeloxQuant-MLX/

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment