SIFT-VTON: Geometric Correspondence Supervision on Cross-Attention for Virtual Try-On

ICPR 2026

SIFT-VTON is a diffusion-based virtual try-on model that uses SIFT feature correspondences between a garment image and a person image to supervise cross-attention maps during training, improving geometric alignment in the generated results.

Paper: arXiv:2605.01296

This model is derived from StableVITON and built on a Stable Diffusion backbone.

The code repository is available at takesukeDS/SIFT-VTON.

Model Files

File	Description
`model.ckpt`	Model checkpoint
`config.yaml`	Model architecture config

Requirements

Clone the code repository and set up the environment:

git clone https://github.com/takesukeDS/SIFT-VTON
cd SIFT-VTON

conda create -n siftvton python==3.12.8 -y
conda activate siftvton

pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu128
pip install matplotlib einops omegaconf yacs
pip install pytorch-lightning==2.5.2
pip install open-clip-torch==3.1.0
pip install diffusers==0.34.0
pip install scipy==1.16.1
pip install transformers==4.55.0
conda install -c anaconda ipython -y
pip install scikit-image clean-fid albumentations==2.0.8
pip3 install -U xformers==0.0.31.post1
pip install tensorboard
pip install accelerate==1.10.0
pip install numpy==2.2.6

Data

Download the VITON-HD dataset and prepare the following directory structure:

[data_root_dir]
└── test
    |-- image
    |-- image-densepose
    |-- agnostic-v3.2
    |-- agnostic-mask
    |-- cloth
    |-- cloth-mask

A pairs file yahavton_test_pairs.txt is also required under [data_root_dir], listing image and cloth filenames one pair per line:

image_00001.jpg cloth_00001.jpg
image_00002.jpg cloth_00002.jpg
...

Inference

python inference_hf.py \
    --repo_id takesukeDS/SIFT-VTON \
    --data_root_dir [data_root_dir] \
    --save_dir [output_dir] \
    --phase test \
    --batch_size 4 \
    --start_from_noised_agn \
    --cfg_scale 1.5 \
    --repaint

The model and config are downloaded automatically from this Hub repository on the first run and cached locally under ~/.cache/huggingface/hub/.

Key inference arguments

Argument	Default	Description
`--repo_id`	—	This Hub repo (`takesukeDS/SIFT-VTON`)
`--phase`	`test`	`test` for the test split, `train` for the training split
`--cfg_scale`	`1.0`	Classifier-free guidance scale
`--denoise_steps`	`50`	Number of PLMS denoising steps
`--start_from_noised_agn`	off	Start denoising from noised agnostic image instead of pure noise (recommended)
`--repaint`	off	Paste back the unmasked region from the original image after generation (recommended)
`--unpair`	off	Run unpaired inference (person and garment from different samples)
`--batch_size`	`16`	Batch size
`--seed`	`1235`	Random seed

Citation

@misc{takemoto2026siftvton,
  title         = {{SIFT-VTON}: Geometric Correspondence Supervision on Cross-Attention for Virtual Try-On},
  author        = {Takemoto, Kosuke and Koshinaka, Takafumi},
  year          = {2026},
  eprint        = {2605.01296},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2605.01296}
}

License

Licensed under the CC BY-NC-SA 4.0 license.

Downloads last month: 17

Paper for takesuke/SIFT-VTON

SIFT-VTON: Geometric Correspondence Supervision on Cross-Attention for Virtual Try-On

Paper • 2605.01296 • Published May 2