--- license: other license_name: cohere-license license_link: https://huggingface.co/CohereLabs/command-a-plus-05-2026 base_model: CohereLabs/command-a-plus-05-2026 tags: - quantization - int2 - int4 - mixture-of-experts - command-a-plus library_name: command-a-plus-lite --- # Command-A-Plus-Lite (int2 experts / int4 resident) Pre-quantized weights for running Cohere's **Command-A-Plus** (218B-parameter Mixture-of-Experts, 25B active) on a **single 24GB GPU**. | Component | Precision | Where | |---|---|---| | Routed experts (128/layer) | **int2**, group-wise (g=64) | CPU RAM, streamed per active expert | | Attention q/k/v/o + shared experts + embedding | **int4**, group-wise (g=64) | GPU-resident | | Router gate / layernorms | fp16 | GPU-resident | ``` weights on disk ~67 GB resident VRAM ~8.4 GB host RAM (pinned) ~61 GB (peaks ~108 GB during load) decode speed ~0.3 tok/s (single 24GB GPU, --pin --gemlite) ``` Decode is **transfer-bound** (CPU→GPU expert streaming dominates), so this is a capacity play — fitting a 218B model on one 24GB card — not a throughput one. ## Usage Install the runtime: ```bash pip install -e ".[gemlite]" hf download kizuna-intelligence/Command-A-Plus-Lite --local-dir ./cmda_int4 ``` ```python import torch from command_a_plus_lite import load_quantized model = load_quantized("./cmda_int4", device="cuda:0", dtype=torch.float16, pin_experts=True, use_gemlite=True) ``` The tokenizer is **not** included here — use the one from the base model [`CohereLabs/command-a-plus-05-2026`](https://huggingface.co/CohereLabs/command-a-plus-05-2026). ## License The model weights are governed by **Cohere's license** for Command-A-Plus. The runtime code is MIT (see the GitHub repository). int2 routed experts are blind RTN (no calibration); quality is below the bf16 original.