Carmenest commited on
Commit
f80c9ca
·
verified ·
1 Parent(s): 1c65114

docs: improve model card with quickstart, benchmarks, Apache-2.0

Browse files
Files changed (1) hide show
  1. README.md +50 -40
README.md CHANGED
@@ -1,21 +1,24 @@
1
  ---
2
  license: apache-2.0
3
  tags:
4
- - diffusion
5
- - masked-diffusion
6
- - llada
7
- - llama
8
- - gguf
9
- - diffuse-cpp
 
10
  base_model: GSAI-ML/LLaDA-8B-Instruct
11
  pipeline_tag: text-generation
12
  ---
13
 
14
  # LLaDA-8B-Instruct-GGUF
15
 
16
- GGUF quantizations of [GSAI-ML/LLaDA-8B-Instruct](https://huggingface.co/GSAI-ML/LLaDA-8B-Instruct) for use with [diffuse-cpp](https://github.com/iafiscal1212/diffuse-cpp), a CPU inference engine for Diffusion Language Models.
17
 
18
- LLaDA is a masked diffusion language model based on the Llama backbone with Multi-Head Attention (MHA, 32/32 heads).
 
 
19
 
20
  ## Available Quantizations
21
 
@@ -23,63 +26,70 @@ LLaDA is a masked diffusion language model based on the Llama backbone with Mult
23
  |------|------|------|-------------|
24
  | `llada-8b-f16.gguf` | F16 | ~14.9 GB | Full precision, best quality |
25
  | `llada-8b-q8_0.gguf` | Q8_0 | ~8.4 GB | 8-bit quantization, near-lossless |
26
- | `llada-8b-q4km.gguf` | Q4_K_M | ~5.1 GB | 4-bit mixed quantization, best quality/size ratio |
27
-
28
- **Recommended:** Q4_K_M for most users. Q8_0 if you have enough RAM and want minimal quality loss.
29
 
30
- ## Performance
31
-
32
- Benchmarked on diffuse-cpp with entropy_exit + inter-step KV cache, Q4_K_M, B=256, 12 threads, seed=42:
33
-
34
- | Prompt | No-Cache tok/s | Cache tok/s | Steps | vs llama.cpp |
35
- |--------|----------------|-------------|-------|-------------|
36
- | Capital of France? | 17.5 | **24.4** | 3 | 2.9x |
37
- | Translate to French | 25.9 | **27.7** | 2 | 3.3x |
38
- | 15 x 23? | 12.8 | **15.7** | 4 | 1.8x |
39
- | Translate to Spanish | 7.6 | **22.9** | 7 | 2.7x |
40
- | Python is_prime() | 3.2 | **4.9** | 16 | 0.6x |
41
- | Poem about ocean | 3.2 | **5.3** | 16 | 0.6x |
42
- | Why is sky blue? | 3.3 | **12.0** | 16 | 1.4x |
43
- | List the planets | 3.3 | **9.4** | 15 | 1.1x |
44
- | **Average** | **9.6** | **15.3** | | **1.8x** |
45
-
46
- - **Inter-step cache: 1.6x average speedup** (9.6 -> 15.3 tok/s)
47
- - Easy prompts: **15-28 tok/s** (up to 3.3x faster than llama.cpp)
48
- - 6 of 8 prompts outperform llama.cpp (8.51 tok/s baseline)
49
- - Cache enabled by default, no quality degradation
50
 
51
- ## Usage
52
 
53
  ```bash
54
  # Download
55
  huggingface-cli download diffuse-cpp/LLaDA-8B-Instruct-GGUF llada-8b-q4km.gguf
56
 
57
- # Run (requires diffuse-cpp)
58
- ./diffuse-cli -m llada-8b-q4km.gguf -p "What is the capital of France?" -n 256 -s 16
 
 
 
 
 
 
 
 
59
  ```
60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
  ## Model Details
62
 
63
- - **Architecture:** Llama backbone with bidirectional attention
64
  - **Parameters:** 8B
65
  - **Layers:** 32
66
  - **Hidden size:** 4096
67
- - **Attention:** MHA (32 query heads, 32 KV heads, head dim 128)
68
- - **FFN:** SwiGLU, intermediate size 12288
69
  - **Vocabulary:** 126,464 tokens
70
  - **RoPE theta:** 500,000
71
  - **Mask token ID:** 126336
72
 
73
  ## Also Available
74
 
75
- - **[Dream-v0-Instruct-7B-GGUF](https://huggingface.co/diffuse-cpp/Dream-v0-Instruct-7B-GGUF)** — Qwen2.5 backbone, GQA, 7.6B params. Excels at factual and math prompts (21.6 tok/s).
76
 
77
  ## Citation
78
 
79
  ```bibtex
80
  @software{diffuse_cpp_2026,
81
  title={diffuse-cpp: High-Performance Inference for Diffusion Language Models},
82
- author={Carmen Estévez},
83
  year={2026},
84
  url={https://github.com/iafiscal1212/diffuse-cpp}
85
  }
@@ -87,4 +97,4 @@ huggingface-cli download diffuse-cpp/LLaDA-8B-Instruct-GGUF llada-8b-q4km.gguf
87
 
88
  ## License
89
 
90
- Apache 2.0, following the original LLaDA model license.
 
1
  ---
2
  license: apache-2.0
3
  tags:
4
+ - diffusion
5
+ - llada
6
+ - gguf
7
+ - cpu-inference
8
+ - diffuse-cpp
9
+ language:
10
+ - en
11
  base_model: GSAI-ML/LLaDA-8B-Instruct
12
  pipeline_tag: text-generation
13
  ---
14
 
15
  # LLaDA-8B-Instruct-GGUF
16
 
17
+ GGUF quantizations of [GSAI-ML/LLaDA-8B-Instruct](https://huggingface.co/GSAI-ML/LLaDA-8B-Instruct) for use with [diffuse-cpp](https://github.com/iafiscal1212/diffuse-cpp), the first C++ inference engine for Diffusion Language Models.
18
 
19
+ LLaDA is a masked diffusion language model based on the Llama backbone. Unlike autoregressive models that generate one token at a time, LLaDA generates all tokens in parallel through iterative refinement — making it compute-bound rather than memory-bound on CPU.
20
+
21
+ **On a 12-core CPU, LLaDA with diffuse-cpp reaches 27.7 tok/s on translation tasks — 3.3x faster than llama.cpp (8.51 tok/s) on the same hardware.**
22
 
23
  ## Available Quantizations
24
 
 
26
  |------|------|------|-------------|
27
  | `llada-8b-f16.gguf` | F16 | ~14.9 GB | Full precision, best quality |
28
  | `llada-8b-q8_0.gguf` | Q8_0 | ~8.4 GB | 8-bit quantization, near-lossless |
29
+ | `llada-8b-q4km.gguf` | Q4_K_M | ~5.1 GB | 4-bit mixed, best speed/quality ratio |
 
 
30
 
31
+ **Recommended:** Q4_K_M for most users.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
+ ## Quick Start
34
 
35
  ```bash
36
  # Download
37
  huggingface-cli download diffuse-cpp/LLaDA-8B-Instruct-GGUF llada-8b-q4km.gguf
38
 
39
+ # Build diffuse-cpp
40
+ git clone --recursive https://github.com/iafiscal1212/diffuse-cpp.git
41
+ cd diffuse-cpp
42
+ cmake -B build -DCMAKE_BUILD_TYPE=Release
43
+ cmake --build build -j$(nproc)
44
+
45
+ # Run
46
+ ./build/diffuse-cli -m ../llada-8b-q4km.gguf \
47
+ --tokens "128000,3923,374,279,6864,315,9822,30" \
48
+ -n 256 -s 16 -t 12 --remasking entropy_exit
49
  ```
50
 
51
+ ## Performance
52
+
53
+ Benchmarked on AMD EPYC 4465P 12-Core, Q4_K_M, entropy_exit + inter-step cache, B=256:
54
+
55
+ | Prompt | No-Cache | Cache | Steps | vs llama.cpp |
56
+ |--------|----------|-------|-------|-------------|
57
+ | Capital of France? | 17.5 | **24.4 tok/s** | 3 | 2.9x |
58
+ | Translate to French | 25.9 | **27.7 tok/s** | 2 | **3.3x** |
59
+ | 15 x 23? | 12.8 | **15.7 tok/s** | 4 | 1.8x |
60
+ | Translate to Spanish | 7.6 | **22.9 tok/s** | 7 | 2.7x |
61
+ | Python is_prime() | 3.2 | **4.9 tok/s** | 16 | 0.6x |
62
+ | Poem about ocean | 3.2 | **5.3 tok/s** | 16 | 0.6x |
63
+ | Why is sky blue? | 3.3 | **12.0 tok/s** | 16 | 1.4x |
64
+ | List the planets | 3.3 | **9.4 tok/s** | 15 | 1.1x |
65
+ | **Average** | **9.6** | **15.3 tok/s** | | **1.8x** |
66
+
67
+ - Inter-step cache: 1.6x average speedup with no quality degradation
68
+ - 6 of 8 prompts outperform llama.cpp (8.51 tok/s baseline)
69
+ - LLaDA excels at translation tasks (converges in 2-5 steps)
70
+
71
  ## Model Details
72
 
73
+ - **Architecture:** Llama backbone with bidirectional (non-causal) attention
74
  - **Parameters:** 8B
75
  - **Layers:** 32
76
  - **Hidden size:** 4096
77
+ - **Attention:** MHA (32 query heads, 32 KV heads)
78
+ - **FFN:** SwiGLU, intermediate 12288
79
  - **Vocabulary:** 126,464 tokens
80
  - **RoPE theta:** 500,000
81
  - **Mask token ID:** 126336
82
 
83
  ## Also Available
84
 
85
+ - **[Dream-v0-Instruct-7B-GGUF](https://huggingface.co/diffuse-cpp/Dream-v0-Instruct-7B-GGUF)** — Qwen2.5 backbone, GQA. Excels at math and code (21.6 tok/s, correctly solves arithmetic in 2 steps).
86
 
87
  ## Citation
88
 
89
  ```bibtex
90
  @software{diffuse_cpp_2026,
91
  title={diffuse-cpp: High-Performance Inference for Diffusion Language Models},
92
+ author={Carmen Esteban},
93
  year={2026},
94
  url={https://github.com/iafiscal1212/diffuse-cpp}
95
  }
 
97
 
98
  ## License
99
 
100
+ Apache 2.0