metadata
license: apache-2.0
tags:
- diffusion
- llada
- gguf
- cpu-inference
- diffuse-cpp
language:
- en
base_model: GSAI-ML/LLaDA-8B-Instruct
pipeline_tag: text-generation
LLaDA-8B-Instruct-GGUF
GGUF quantizations of GSAI-ML/LLaDA-8B-Instruct for use with diffuse-cpp, the first C++ inference engine for Diffusion Language Models.
LLaDA is a masked diffusion language model based on the Llama backbone. Unlike autoregressive models that generate one token at a time, LLaDA generates all tokens in parallel through iterative refinement — making it compute-bound rather than memory-bound on CPU.
On a 12-core CPU, LLaDA with diffuse-cpp reaches 27.7 tok/s on translation tasks — 3.3x faster than llama.cpp (8.51 tok/s) on the same hardware.
Available Quantizations
| File | Type | Size | Description |
|---|---|---|---|
llada-8b-f16.gguf |
F16 | ~14.9 GB | Full precision, best quality |
llada-8b-q8_0.gguf |
Q8_0 | ~8.4 GB | 8-bit quantization, near-lossless |
llada-8b-q4km.gguf |
Q4_K_M | ~5.1 GB | 4-bit mixed, best speed/quality ratio |
Recommended: Q4_K_M for most users.
Quick Start
# Download
huggingface-cli download diffuse-cpp/LLaDA-8B-Instruct-GGUF llada-8b-q4km.gguf
# Build diffuse-cpp
git clone --recursive https://github.com/iafiscal1212/diffuse-cpp.git
cd diffuse-cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
# Run
./build/diffuse-cli -m ../llada-8b-q4km.gguf \
--tokens "128000,3923,374,279,6864,315,9822,30" \
-n 256 -s 16 -t 12 --remasking entropy_exit
Performance
Benchmarked on AMD EPYC 4465P 12-Core, Q4_K_M, entropy_exit + inter-step cache, B=256:
| Prompt | No-Cache | Cache | Steps | vs llama.cpp |
|---|---|---|---|---|
| Capital of France? | 17.5 | 24.4 tok/s | 3 | 2.9x |
| Translate to French | 25.9 | 27.7 tok/s | 2 | 3.3x |
| 15 x 23? | 12.8 | 15.7 tok/s | 4 | 1.8x |
| Translate to Spanish | 7.6 | 22.9 tok/s | 7 | 2.7x |
| Python is_prime() | 3.2 | 4.9 tok/s | 16 | 0.6x |
| Poem about ocean | 3.2 | 5.3 tok/s | 16 | 0.6x |
| Why is sky blue? | 3.3 | 12.0 tok/s | 16 | 1.4x |
| List the planets | 3.3 | 9.4 tok/s | 15 | 1.1x |
| Average | 9.6 | 15.3 tok/s | 1.8x |
- Inter-step cache: 1.6x average speedup with no quality degradation
- 6 of 8 prompts outperform llama.cpp (8.51 tok/s baseline)
- LLaDA excels at translation tasks (converges in 2-5 steps)
Model Details
- Architecture: Llama backbone with bidirectional (non-causal) attention
- Parameters: 8B
- Layers: 32
- Hidden size: 4096
- Attention: MHA (32 query heads, 32 KV heads)
- FFN: SwiGLU, intermediate 12288
- Vocabulary: 126,464 tokens
- RoPE theta: 500,000
- Mask token ID: 126336
Also Available
- Dream-v0-Instruct-7B-GGUF — Qwen2.5 backbone, GQA. Excels at math and code (21.6 tok/s, correctly solves arithmetic in 2 steps).
Citation
@software{diffuse_cpp_2026,
title={diffuse-cpp: High-Performance Inference for Diffusion Language Models},
author={Carmen Esteban},
year={2026},
url={https://github.com/iafiscal1212/diffuse-cpp}
}
License
Apache 2.0