IDLM-DCD

IDLM-DCD is an Inverse-distilled Diffusion Language Model distilled from a Duo-DCD style teacher checkpoint. It is released with the paper IDLM: Inverse-distilled Diffusion Language Models.

IDLM extends inverse distillation to discrete token spaces. This checkpoint targets a stronger distilled Duo-DCD teacher and is the fastest of the released OpenWebText IDLM checkpoints in the reported low-step setting.

Project page: https://david-cripto.github.io/idlm-project-page/
Code: https://github.com/David-cripto/IDLM
Paper: https://arxiv.org/abs/2602.19066

Model Details

Model family: IDLM, discrete diffusion language model
Teacher checkpoint: s-sahoo/duo-distilled
Diffusion type: uniform-state / Duo-DCD-style diffusion
Training data: OpenWebText
Tokenizer: GPT-2 tokenizer
Context length: 1024 tokens
Parameters: 169,627,250
Tensor type: F32 Safetensors
Architecture config: 12 blocks, 12 heads, hidden size 768, conditioning dimension 128, dropout 0.1
License: MIT

Intended Use

This checkpoint is intended for research on diffusion language models, inverse distillation, and very low-step discrete diffusion sampling.

Installation

The sampling code depends on CUDA and FlashAttention.

git clone https://github.com/David-cripto/IDLM.git
cd IDLM

conda create -n idlm python=3.12
conda activate idlm
conda install nvidia/label/cuda-12.4.0::cuda-toolkit
pip install -r requirements.txt
pip install flash_attn==2.7.4.post1

Loading the Checkpoint

The Hugging Face repository contains custom model code. Use trust_remote_code=True.

from transformers import AutoModelForMaskedLM, AutoTokenizer

model_id = "kekchpek/idlm-dcd"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(
    model_id,
    trust_remote_code=True,
)

Direct AutoModelForMaskedLM loading exposes the denoising network. For text generation, use the sampler in the official IDLM repository.

Generate Samples

mkdir -p samples

python -m main \
  mode=sample_eval \
  loader.batch_size=2 \
  loader.eval_batch_size=8 \
  data=openwebtext-split \
  algo=duo \
  algo.backbone=hf_dit \
  eval.checkpoint_path=kekchpek/idlm-dcd \
  sampling.steps=4 \
  sampling.num_sample_batches=10 \
  sampling.noise_removal=greedy \
  +wandb.offline=true \
  eval.generated_samples_path=samples/idlm_dcd_4steps.json

The generation script can be swept with different sampling steps. The paper reports both ancestral (a) and Greedy-Tail (g) sampling variants.

Evaluation

The paper reports generation perplexity (GenPPL, lower is better) and sample entropy (higher is better) on OpenWebText-style generation. The released evaluation code defaults to gpt2-large for GenPPL.

Sampling steps	Sampler	GenPPL (lower is better)	Entropy (higher is better)
32	Greedy-Tail	38.57	5.35
16	Greedy-Tail	43.21	5.41
8	Greedy-Tail	53.55	5.41
4	Greedy-Tail	77.49	5.28
32	Ancestral	42.03	5.41
16	Ancestral	51.86	5.44
8	Ancestral	66.31	5.42
4	Ancestral	111.01	5.32

For comparison, the Duo-DCD teacher/baseline is reported at 32 steps with GenPPL 46.31 / entropy 5.38 under Greedy-Tail sampling and GenPPL 61.31 / entropy 5.52 under ancestral sampling.

Training Summary

IDLM-DCD was trained by initializing the student and fake model from a Duo-DCD style teacher and alternating between:

Updating the fake model on student-generated samples using the teacher diffusion loss.
Updating the student using the teacher-fake loss gap.

This checkpoint is designed for the very low-step setting, especially Greedy-Tail sampling.

Citation

@article{li2026idlm,
  title={IDLM: Inverse-distilled Diffusion Language Models},
  author={Li, David and Gushchin, Nikita and Abulkhanov, Dmitry and Moulines, Eric and Oseledets, Ivan and Panov, Maxim and Korotin, Alexander},
  journal={arXiv preprint arXiv:2602.19066},
  year={2026}
}

Downloads last month: 125

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for kekchpek/idlm-dcd

Base model

s-sahoo/duo-distilled

Finetuned

(1)

this model

Dataset used to train kekchpek/idlm-dcd

Paper for kekchpek/idlm-dcd

IDLM: Inverse-distilled Diffusion Language Models

Paper • 2602.19066 • Published Feb 22