IDLM-DCD

IDLM-DCD is an Inverse-distilled Diffusion Language Model distilled from a Duo-DCD style teacher checkpoint. It is released with the paper IDLM: Inverse-distilled Diffusion Language Models.

IDLM extends inverse distillation to discrete token spaces. This checkpoint targets a stronger distilled Duo-DCD teacher and is the fastest of the released OpenWebText IDLM checkpoints in the reported low-step setting.

Model Details

  • Model family: IDLM, discrete diffusion language model
  • Teacher checkpoint: s-sahoo/duo-distilled
  • Diffusion type: uniform-state / Duo-DCD-style diffusion
  • Training data: OpenWebText
  • Tokenizer: GPT-2 tokenizer
  • Context length: 1024 tokens
  • Parameters: 169,627,250
  • Tensor type: F32 Safetensors
  • Architecture config: 12 blocks, 12 heads, hidden size 768, conditioning dimension 128, dropout 0.1
  • License: MIT

Intended Use

This checkpoint is intended for research on diffusion language models, inverse distillation, and very low-step discrete diffusion sampling.

Installation

The sampling code depends on CUDA and FlashAttention.

git clone https://github.com/David-cripto/IDLM.git
cd IDLM

conda create -n idlm python=3.12
conda activate idlm
conda install nvidia/label/cuda-12.4.0::cuda-toolkit
pip install -r requirements.txt
pip install flash_attn==2.7.4.post1

Loading the Checkpoint

The Hugging Face repository contains custom model code. Use trust_remote_code=True.

from transformers import AutoModelForMaskedLM, AutoTokenizer

model_id = "kekchpek/idlm-dcd"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(
    model_id,
    trust_remote_code=True,
)

Direct AutoModelForMaskedLM loading exposes the denoising network. For text generation, use the sampler in the official IDLM repository.

Generate Samples

mkdir -p samples

python -m main \
  mode=sample_eval \
  loader.batch_size=2 \
  loader.eval_batch_size=8 \
  data=openwebtext-split \
  algo=duo \
  algo.backbone=hf_dit \
  eval.checkpoint_path=kekchpek/idlm-dcd \
  sampling.steps=4 \
  sampling.num_sample_batches=10 \
  sampling.noise_removal=greedy \
  +wandb.offline=true \
  eval.generated_samples_path=samples/idlm_dcd_4steps.json

The generation script can be swept with different sampling steps. The paper reports both ancestral (a) and Greedy-Tail (g) sampling variants.

Evaluation

The paper reports generation perplexity (GenPPL, lower is better) and sample entropy (higher is better) on OpenWebText-style generation. The released evaluation code defaults to gpt2-large for GenPPL.

Sampling steps Sampler GenPPL (lower is better) Entropy (higher is better)
32 Greedy-Tail 38.57 5.35
16 Greedy-Tail 43.21 5.41
8 Greedy-Tail 53.55 5.41
4 Greedy-Tail 77.49 5.28
32 Ancestral 42.03 5.41
16 Ancestral 51.86 5.44
8 Ancestral 66.31 5.42
4 Ancestral 111.01 5.32

For comparison, the Duo-DCD teacher/baseline is reported at 32 steps with GenPPL 46.31 / entropy 5.38 under Greedy-Tail sampling and GenPPL 61.31 / entropy 5.52 under ancestral sampling.

Training Summary

IDLM-DCD was trained by initializing the student and fake model from a Duo-DCD style teacher and alternating between:

  1. Updating the fake model on student-generated samples using the teacher diffusion loss.
  2. Updating the student using the teacher-fake loss gap.

This checkpoint is designed for the very low-step setting, especially Greedy-Tail sampling.

Citation

@article{li2026idlm,
  title={IDLM: Inverse-distilled Diffusion Language Models},
  author={Li, David and Gushchin, Nikita and Abulkhanov, Dmitry and Moulines, Eric and Oseledets, Ivan and Panov, Maxim and Korotin, Alexander},
  journal={arXiv preprint arXiv:2602.19066},
  year={2026}
}
Downloads last month
125
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kekchpek/idlm-dcd

Finetuned
(1)
this model

Dataset used to train kekchpek/idlm-dcd

Paper for kekchpek/idlm-dcd