IDLM-MDLM

IDLM-MDLM is an Inverse-distilled Diffusion Language Model distilled from the pretrained MDLM OpenWebText checkpoint. It is released with the paper IDLM: Inverse-distilled Diffusion Language Models.

Diffusion Language Models can produce high-quality text, but standard reverse diffusion requires many sampling steps. IDLM trains a few-step student generator from a pretrained DLM teacher using an inverse distillation objective with an auxiliary fake model. This checkpoint targets fast generation from an absorbing-state masked diffusion teacher.

Project page: https://david-cripto.github.io/idlm-project-page/
Code: https://github.com/David-cripto/IDLM
Paper: https://arxiv.org/abs/2602.19066

Model Details

Model family: IDLM, discrete diffusion language model
Teacher checkpoint: kuleshov-group/mdlm-owt
Diffusion type: absorbing-state / masked diffusion
Training data: OpenWebText
Tokenizer: GPT-2 tokenizer
Context length: 1024 tokens
Parameters: 169,627,250
Tensor type: F32 Safetensors
Architecture config: 12 blocks, 12 heads, hidden size 768, conditioning dimension 128, dropout 0.1
License: MIT

Intended Use

This checkpoint is intended for research on discrete diffusion language models, few-step diffusion sampling, and reproduction of the IDLM paper experiments.

Installation

The sampling code depends on CUDA and FlashAttention.

git clone https://github.com/David-cripto/IDLM.git
cd IDLM

conda create -n idlm python=3.12
conda activate idlm
conda install nvidia/label/cuda-12.4.0::cuda-toolkit
pip install -r requirements.txt
pip install flash_attn==2.7.4.post1

Loading the Checkpoint

The Hugging Face repository contains custom model code. Use trust_remote_code=True.

from transformers import AutoModelForMaskedLM, AutoTokenizer

model_id = "kekchpek/idlm-mdlm"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(
    model_id,
    trust_remote_code=True,
)

Direct AutoModelForMaskedLM loading exposes the denoising network. For text generation, use the sampler in the official IDLM repository.

Generate Samples

mkdir -p samples

python -m main \
  mode=sample_eval \
  loader.batch_size=2 \
  loader.eval_batch_size=8 \
  data=openwebtext-split \
  algo=mdlm \
  algo.backbone=hf_dit \
  eval.checkpoint_path=kekchpek/idlm-mdlm \
  sampling.steps=16 \
  sampling.num_sample_batches=10 \
  sampling.predictor=ancestral_cache \
  sampling.noise_removal=ancestral \
  +wandb.offline=true \
  eval.generated_samples_path=samples/idlm_mdlm_16steps.json

The generation script can be swept with different sampling steps.

Evaluation

The paper reports generation perplexity (GenPPL, lower is better) and sample entropy (higher is better) on OpenWebText-style generation. The released evaluation code defaults to gpt2-large for GenPPL.

Sampling steps	GenPPL (lower is better)	Entropy (higher is better)
32	20.37	5.23
16	32.74	5.42
8	79.42	5.61
4	310.38	5.78

For comparison, the MDLM teacher is reported at 1024 steps with GenPPL 41.29 and entropy 5.28.

Training Summary

IDLM-MDLM was trained by initializing the student and fake model from the pretrained MDLM teacher and alternating between:

Updating the fake model on student-generated samples using the teacher diffusion loss.
Updating the student using the teacher-fake loss gap.

This follows the inverse distillation objective described in the paper and uses the absorbing-state masked diffusion formulation.

Citation

@article{li2026idlm,
  title={IDLM: Inverse-distilled Diffusion Language Models},
  author={Li, David and Gushchin, Nikita and Abulkhanov, Dmitry and Moulines, Eric and Oseledets, Ivan and Panov, Maxim and Korotin, Alexander},
  journal={arXiv preprint arXiv:2602.19066},
  year={2026}
}

Downloads last month: 135

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for kekchpek/idlm-mdlm

Base model

kuleshov-group/mdlm-owt

Finetuned

(3)

this model

Dataset used to train kekchpek/idlm-mdlm

Paper for kekchpek/idlm-mdlm

IDLM: Inverse-distilled Diffusion Language Models

Paper • 2602.19066 • Published Feb 22