BLADE-LM

Block-Level Autoregressive Diffusion with Causal DiscovEry

BLADE-LM is a hybrid language model built on Qwen2.5-0.5B that combines autoregressive (AR) generation with a diffusion-based think segment. It enables real-time discovery of causal dependencies between token blocks during inference.

โš ๏ธ This model requires a custom inference script. Standard model.generate() will fall back to plain AR generation without the 4-quadrant mask. See Usage below.


Model Details

Base model Qwen/Qwen2.5-0.5B
Architecture Dual-path: AR clean segment + Diffusion think segment
Sequence length 1024 (extensible via --seq_len)
Block size 8 tokens/block (128 blocks)
Parameters ~0.5B
License MIT

How It Works

BLADE-LM processes sequences as two parallel segments:

Input:  [ clean segment (L) | think segment (L) ]
         real tokens           all [MASK]

A fixed 4-quadrant attention mask governs interactions between segments:

Quadrant Direction Rule
Top-Left clean โ†’ clean AR causal (lower-triangular)
Top-Right clean โ†’ think clean[i] attends think[j < i]
Bottom-Left think โ†’ clean think[i] attends clean[j < i]
Bottom-Right think โ†’ think intra-block bidirectional only

Inference uses a two-round process:

  • Round 1: Token-by-token draft generation using beta-mixed AR/Diff logits
  • Round 2 (optional): Single eager forward with draft as input โ€” activates K_diff via future Q_clean, enabling causal graph extraction

Training

Data NuminaMath-TIR, StackMathQA, CLadder/e-CARE causal reasoning
Mix ratio 3 : 5 : 2
Steps 40,000 optimizer steps
Batch size 16 ร— grad_accum 2
Optimizer AdamW (betas 0.9/0.95)
LR 2e-5 with cosine decay
Loss l_diff + 0.05 ร— l_ar + 0.1 ร— L_intra

Usage

Install dependencies and clone the inference code:

pip install transformers torch pyyaml matplotlib
git clone https://github.com/xiaziye/BLADE-LM.git
cd BLADE-LM

Basic generation

# Using blade_run.py (recommended)
python blade_run.py \
    --model_path Hengzongshu/BLADE-LM \
    --prompt "Solve: 2x + 5 = 13. Show your work.\n" \
    --beta 0.6 \
    --max_tokens 200

With causal graph analysis

python blade_run.py \
    --model_path Hengzongshu/BLADE-LM \
    --prompt "Solve: 2x + 5 = 13. Show your work.\n" \
    --beta 0.6 \
    --causal \
    --causal_out causal.png

Parameters

Parameter Default Description
--beta 0.6 AR/Diff mix ratio. 0 = pure Diff, 1 = pure AR. Recommended: 0.49โ€“0.90
--max_tokens 200 Maximum tokens to generate
--seq_len 1024 Sequence length (must be divisible by 8)
--causal false Run causal graph analysis after generation

Causal Discovery

After generation, BLADE-LM can produce a block-level causal graph showing which past blocks influenced which future blocks. Example output on math reasoning:

[1] block 1 โ†’ block 33  gap=32  strength=0.0017
  cause  j=1:  ' + 5 = 13.'
  effect i=33: '2x + 5 = 1'

[2] block 2 โ†’ block 21  gap=19  strength=0.0013
  cause  j=2:  ' Show your work. To solve the equation'
  effect i=21: '4\n   \]\n\n3. **'

The model exhibits adaptive causal depth โ€” simple queries produce shallow linear graphs while multi-step reasoning produces richer long-range dependencies. This behavior emerges naturally without explicit supervision.


Experimental Results

Tested across math reasoning, causal reasoning, and logical sorting tasks:

  • Global anchors: Prompt tokens consistently influence conclusion blocks across long gaps (gap > 30)
  • Structure propagation: Early format-establishing blocks influence all subsequent similar blocks
  • Conclusion aggregation: Final answer blocks receive contributions from multiple intermediate reasoning steps
  • Adaptive depth: Causal graph complexity scales naturally with task difficulty

Causal signal strength is in the 0.001 range at 0.5B scale โ€” meaningful but weak. Larger scale is expected to amplify the signal.


Limitations

  • Causal signal strength is relatively weak at 0.5B scale
  • No KV cache support (custom 2L mask is incompatible with standard DynamicCache)
  • Inference is slower than standard AR models due to full 2L forward per token
  • Must use blade_run.py for correct inference โ€” model.generate() bypasses the 4-quadrant mask
  • Beta may require tuning for different prompt styles

Citation

@misc{blade2025,
  title  = {BLADE-LM: Block-Level Autoregressive Diffusion with Causal Discovery},
  author = {Xia Ziye},
  year   = {2025},
  url    = {https://huggingface.co/Hengzongshu/BLADE-LM}
}
Downloads last month
21
Safetensors
Model size
0.5B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Hengzongshu/BLADE-LM

Finetuned
(628)
this model