diffuse-cpp
/

LLaDA-8B-Instruct-GGUF

@@ -1,21 +1,24 @@
 ---
 license: apache-2.0
 tags:
-- diffusion
-- masked-diffusion
-- llada
-- llama
-- gguf
-- diffuse-cpp
 base_model: GSAI-ML/LLaDA-8B-Instruct
 pipeline_tag: text-generation
 ---
 # LLaDA-8B-Instruct-GGUF
-GGUF quantizations of [GSAI-ML/LLaDA-8B-Instruct](https://huggingface.co/GSAI-ML/LLaDA-8B-Instruct) for use with [diffuse-cpp](https://github.com/iafiscal1212/diffuse-cpp), a CPU inference engine for Diffusion Language Models.
-LLaDA is a masked diffusion language model based on the Llama backbone with Multi-Head Attention (MHA, 32/32 heads).
 ## Available Quantizations
@@ -23,63 +26,70 @@ LLaDA is a masked diffusion language model based on the Llama backbone with Mult
 |------|------|------|-------------|
 | `llada-8b-f16.gguf` | F16 | ~14.9 GB | Full precision, best quality |
 | `llada-8b-q8_0.gguf` | Q8_0 | ~8.4 GB | 8-bit quantization, near-lossless |
-| `llada-8b-q4km.gguf` | Q4_K_M | ~5.1 GB | 4-bit mixed quantization, best quality/size ratio |
-**Recommended:** Q4_K_M for most users. Q8_0 if you have enough RAM and want minimal quality loss.
-## Performance
-Benchmarked on diffuse-cpp with entropy_exit + inter-step KV cache, Q4_K_M, B=256, 12 threads, seed=42:
-| Prompt | No-Cache tok/s | Cache tok/s | Steps | vs llama.cpp |
-|--------|----------------|-------------|-------|-------------|
-| Capital of France? | 17.5 | **24.4** | 3 | 2.9x |
-| Translate to French | 25.9 | **27.7** | 2 | 3.3x |
-| 15 x 23? | 12.8 | **15.7** | 4 | 1.8x |
-| Translate to Spanish | 7.6 | **22.9** | 7 | 2.7x |
-| Python is_prime() | 3.2 | **4.9** | 16 | 0.6x |
-| Poem about ocean | 3.2 | **5.3** | 16 | 0.6x |
-| Why is sky blue? | 3.3 | **12.0** | 16 | 1.4x |
-| List the planets | 3.3 | **9.4** | 15 | 1.1x |
-| **Average** | **9.6** | **15.3** | | **1.8x** |
-- **Inter-step cache: 1.6x average speedup** (9.6 -> 15.3 tok/s)
-- Easy prompts: **15-28 tok/s** (up to 3.3x faster than llama.cpp)
-- 6 of 8 prompts outperform llama.cpp (8.51 tok/s baseline)
-- Cache enabled by default, no quality degradation
-## Usage
 ```bash
 # Download
 huggingface-cli download diffuse-cpp/LLaDA-8B-Instruct-GGUF llada-8b-q4km.gguf
-# Run (requires diffuse-cpp)
-./diffuse-cli -m llada-8b-q4km.gguf -p "What is the capital of France?" -n 256 -s 16
 ```
 ## Model Details
-- **Architecture:** Llama backbone with bidirectional attention
 - **Parameters:** 8B
 - **Layers:** 32
 - **Hidden size:** 4096
-- **Attention:** MHA (32 query heads, 32 KV heads, head dim 128)
-- **FFN:** SwiGLU, intermediate size 12288
 - **Vocabulary:** 126,464 tokens
 - **RoPE theta:** 500,000
 - **Mask token ID:** 126336
 ## Also Available
-- **[Dream-v0-Instruct-7B-GGUF](https://huggingface.co/diffuse-cpp/Dream-v0-Instruct-7B-GGUF)** — Qwen2.5 backbone, GQA, 7.6B params. Excels at factual and math prompts (21.6 tok/s).
 ## Citation
 ```bibtex
 @software{diffuse_cpp_2026,
   title={diffuse-cpp: High-Performance Inference for Diffusion Language Models},
-  author={Carmen Estévez},
   year={2026},
   url={https://github.com/iafiscal1212/diffuse-cpp}
 }
@@ -87,4 +97,4 @@ huggingface-cli download diffuse-cpp/LLaDA-8B-Instruct-GGUF llada-8b-q4km.gguf
 ## License
-Apache 2.0, following the original LLaDA model license.

 ---
 license: apache-2.0
 tags:
+  - diffusion
+  - llada
+  - gguf
+  - cpu-inference
+  - diffuse-cpp
+language:
+  - en
 base_model: GSAI-ML/LLaDA-8B-Instruct
 pipeline_tag: text-generation
 ---
 # LLaDA-8B-Instruct-GGUF
+GGUF quantizations of [GSAI-ML/LLaDA-8B-Instruct](https://huggingface.co/GSAI-ML/LLaDA-8B-Instruct) for use with [diffuse-cpp](https://github.com/iafiscal1212/diffuse-cpp), the first C++ inference engine for Diffusion Language Models.
+LLaDA is a masked diffusion language model based on the Llama backbone. Unlike autoregressive models that generate one token at a time, LLaDA generates all tokens in parallel through iterative refinement — making it compute-bound rather than memory-bound on CPU.
+**On a 12-core CPU, LLaDA with diffuse-cpp reaches 27.7 tok/s on translation tasks — 3.3x faster than llama.cpp (8.51 tok/s) on the same hardware.**
 ## Available Quantizations
 |------|------|------|-------------|
 | `llada-8b-f16.gguf` | F16 | ~14.9 GB | Full precision, best quality |
 | `llada-8b-q8_0.gguf` | Q8_0 | ~8.4 GB | 8-bit quantization, near-lossless |
+| `llada-8b-q4km.gguf` | Q4_K_M | ~5.1 GB | 4-bit mixed, best speed/quality ratio |
+**Recommended:** Q4_K_M for most users.
+## Quick Start
 ```bash
 # Download
 huggingface-cli download diffuse-cpp/LLaDA-8B-Instruct-GGUF llada-8b-q4km.gguf
+# Build diffuse-cpp
+git clone --recursive https://github.com/iafiscal1212/diffuse-cpp.git
+cd diffuse-cpp
+cmake -B build -DCMAKE_BUILD_TYPE=Release
+cmake --build build -j$(nproc)
+# Run
+./build/diffuse-cli -m ../llada-8b-q4km.gguf \
+    --tokens "128000,3923,374,279,6864,315,9822,30" \
+    -n 256 -s 16 -t 12 --remasking entropy_exit
 ```
+## Performance
+Benchmarked on AMD EPYC 4465P 12-Core, Q4_K_M, entropy_exit + inter-step cache, B=256:
+| Prompt | No-Cache | Cache | Steps | vs llama.cpp |
+|--------|----------|-------|-------|-------------|
+| Capital of France? | 17.5 | **24.4 tok/s** | 3 | 2.9x |
+| Translate to French | 25.9 | **27.7 tok/s** | 2 | **3.3x** |
+| 15 x 23? | 12.8 | **15.7 tok/s** | 4 | 1.8x |
+| Translate to Spanish | 7.6 | **22.9 tok/s** | 7 | 2.7x |
+| Python is_prime() | 3.2 | **4.9 tok/s** | 16 | 0.6x |
+| Poem about ocean | 3.2 | **5.3 tok/s** | 16 | 0.6x |
+| Why is sky blue? | 3.3 | **12.0 tok/s** | 16 | 1.4x |
+| List the planets | 3.3 | **9.4 tok/s** | 15 | 1.1x |
+| **Average** | **9.6** | **15.3 tok/s** | | **1.8x** |
+- Inter-step cache: 1.6x average speedup with no quality degradation
+- 6 of 8 prompts outperform llama.cpp (8.51 tok/s baseline)
+- LLaDA excels at translation tasks (converges in 2-5 steps)
 ## Model Details
+- **Architecture:** Llama backbone with bidirectional (non-causal) attention
 - **Parameters:** 8B
 - **Layers:** 32
 - **Hidden size:** 4096
+- **Attention:** MHA (32 query heads, 32 KV heads)
+- **FFN:** SwiGLU, intermediate 12288
 - **Vocabulary:** 126,464 tokens
 - **RoPE theta:** 500,000
 - **Mask token ID:** 126336
 ## Also Available
+- **[Dream-v0-Instruct-7B-GGUF](https://huggingface.co/diffuse-cpp/Dream-v0-Instruct-7B-GGUF)** — Qwen2.5 backbone, GQA. Excels at math and code (21.6 tok/s, correctly solves arithmetic in 2 steps).
 ## Citation
 ```bibtex
 @software{diffuse_cpp_2026,
   title={diffuse-cpp: High-Performance Inference for Diffusion Language Models},
+  author={Carmen Esteban},
   year={2026},
   url={https://github.com/iafiscal1212/diffuse-cpp}
 }
 ## License
+Apache 2.0