BLADE-LM
Block-Level Autoregressive Diffusion with Causal DiscovEry
BLADE-LM is a hybrid language model built on Qwen2.5-0.5B that combines autoregressive (AR) generation with a diffusion-based think segment. It enables real-time discovery of causal dependencies between token blocks during inference.
โ ๏ธ This model requires a custom inference script. Standard
model.generate()will fall back to plain AR generation without the 4-quadrant mask. See Usage below.
Model Details
| Base model | Qwen/Qwen2.5-0.5B |
| Architecture | Dual-path: AR clean segment + Diffusion think segment |
| Sequence length | 1024 (extensible via --seq_len) |
| Block size | 8 tokens/block (128 blocks) |
| Parameters | ~0.5B |
| License | MIT |
How It Works
BLADE-LM processes sequences as two parallel segments:
Input: [ clean segment (L) | think segment (L) ]
real tokens all [MASK]
A fixed 4-quadrant attention mask governs interactions between segments:
| Quadrant | Direction | Rule |
|---|---|---|
| Top-Left | clean โ clean | AR causal (lower-triangular) |
| Top-Right | clean โ think | clean[i] attends think[j < i] |
| Bottom-Left | think โ clean | think[i] attends clean[j < i] |
| Bottom-Right | think โ think | intra-block bidirectional only |
Inference uses a two-round process:
- Round 1: Token-by-token draft generation using beta-mixed AR/Diff logits
- Round 2 (optional): Single eager forward with draft as input โ activates K_diff via future Q_clean, enabling causal graph extraction
Training
| Data | NuminaMath-TIR, StackMathQA, CLadder/e-CARE causal reasoning |
| Mix ratio | 3 : 5 : 2 |
| Steps | 40,000 optimizer steps |
| Batch size | 16 ร grad_accum 2 |
| Optimizer | AdamW (betas 0.9/0.95) |
| LR | 2e-5 with cosine decay |
| Loss | l_diff + 0.05 ร l_ar + 0.1 ร L_intra |
Usage
Install dependencies and clone the inference code:
pip install transformers torch pyyaml matplotlib
git clone https://github.com/xiaziye/BLADE-LM.git
cd BLADE-LM
Basic generation
# Using blade_run.py (recommended)
python blade_run.py \
--model_path Hengzongshu/BLADE-LM \
--prompt "Solve: 2x + 5 = 13. Show your work.\n" \
--beta 0.6 \
--max_tokens 200
With causal graph analysis
python blade_run.py \
--model_path Hengzongshu/BLADE-LM \
--prompt "Solve: 2x + 5 = 13. Show your work.\n" \
--beta 0.6 \
--causal \
--causal_out causal.png
Parameters
| Parameter | Default | Description |
|---|---|---|
--beta |
0.6 | AR/Diff mix ratio. 0 = pure Diff, 1 = pure AR. Recommended: 0.49โ0.90 |
--max_tokens |
200 | Maximum tokens to generate |
--seq_len |
1024 | Sequence length (must be divisible by 8) |
--causal |
false | Run causal graph analysis after generation |
Causal Discovery
After generation, BLADE-LM can produce a block-level causal graph showing which past blocks influenced which future blocks. Example output on math reasoning:
[1] block 1 โ block 33 gap=32 strength=0.0017
cause j=1: ' + 5 = 13.'
effect i=33: '2x + 5 = 1'
[2] block 2 โ block 21 gap=19 strength=0.0013
cause j=2: ' Show your work. To solve the equation'
effect i=21: '4\n \]\n\n3. **'
The model exhibits adaptive causal depth โ simple queries produce shallow linear graphs while multi-step reasoning produces richer long-range dependencies. This behavior emerges naturally without explicit supervision.
Experimental Results
Tested across math reasoning, causal reasoning, and logical sorting tasks:
- Global anchors: Prompt tokens consistently influence conclusion blocks across long gaps (gap > 30)
- Structure propagation: Early format-establishing blocks influence all subsequent similar blocks
- Conclusion aggregation: Final answer blocks receive contributions from multiple intermediate reasoning steps
- Adaptive depth: Causal graph complexity scales naturally with task difficulty
Causal signal strength is in the 0.001 range at 0.5B scale โ meaningful but weak. Larger scale is expected to amplify the signal.
Limitations
- Causal signal strength is relatively weak at 0.5B scale
- No KV cache support (custom 2L mask is incompatible with standard
DynamicCache) - Inference is slower than standard AR models due to full 2L forward per token
- Must use
blade_run.pyfor correct inference โmodel.generate()bypasses the 4-quadrant mask - Beta may require tuning for different prompt styles
Citation
@misc{blade2025,
title = {BLADE-LM: Block-Level Autoregressive Diffusion with Causal Discovery},
author = {Xia Ziye},
year = {2025},
url = {https://huggingface.co/Hengzongshu/BLADE-LM}
}
- Downloads last month
- 21
Model tree for Hengzongshu/BLADE-LM
Base model
Qwen/Qwen2.5-0.5B