SIFT-VTON: Geometric Correspondence Supervision on Cross-Attention for Virtual Try-On

ICPR 2026

SIFT-VTON is a diffusion-based virtual try-on model that uses SIFT feature correspondences between a garment image and a person image to supervise cross-attention maps during training, improving geometric alignment in the generated results.

Paper: arXiv:2605.01296

This model is derived from StableVITON and built on a Stable Diffusion backbone.

The code repository is available at takesukeDS/SIFT-VTON.

Model Files

File Description
model.ckpt Model checkpoint
config.yaml Model architecture config

Requirements

Clone the code repository and set up the environment:

git clone https://github.com/takesukeDS/SIFT-VTON
cd SIFT-VTON

conda create -n siftvton python==3.12.8 -y
conda activate siftvton

pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu128
pip install matplotlib einops omegaconf yacs
pip install pytorch-lightning==2.5.2
pip install open-clip-torch==3.1.0
pip install diffusers==0.34.0
pip install scipy==1.16.1
pip install transformers==4.55.0
conda install -c anaconda ipython -y
pip install scikit-image clean-fid albumentations==2.0.8
pip3 install -U xformers==0.0.31.post1
pip install tensorboard
pip install accelerate==1.10.0
pip install numpy==2.2.6

Data

Download the VITON-HD dataset and prepare the following directory structure:

[data_root_dir]
โ””โ”€โ”€ test
    |-- image
    |-- image-densepose
    |-- agnostic-v3.2
    |-- agnostic-mask
    |-- cloth
    |-- cloth-mask

A pairs file yahavton_test_pairs.txt is also required under [data_root_dir], listing image and cloth filenames one pair per line:

image_00001.jpg cloth_00001.jpg
image_00002.jpg cloth_00002.jpg
...

Inference

python inference_hf.py \
    --repo_id takesukeDS/SIFT-VTON \
    --data_root_dir [data_root_dir] \
    --save_dir [output_dir] \
    --phase test \
    --batch_size 4 \
    --start_from_noised_agn \
    --cfg_scale 1.5 \
    --repaint

The model and config are downloaded automatically from this Hub repository on the first run and cached locally under ~/.cache/huggingface/hub/.

Key inference arguments

Argument Default Description
--repo_id โ€” This Hub repo (takesukeDS/SIFT-VTON)
--phase test test for the test split, train for the training split
--cfg_scale 1.0 Classifier-free guidance scale
--denoise_steps 50 Number of PLMS denoising steps
--start_from_noised_agn off Start denoising from noised agnostic image instead of pure noise (recommended)
--repaint off Paste back the unmasked region from the original image after generation (recommended)
--unpair off Run unpaired inference (person and garment from different samples)
--batch_size 16 Batch size
--seed 1235 Random seed

Citation

@misc{takemoto2026siftvton,
  title         = {{SIFT-VTON}: Geometric Correspondence Supervision on Cross-Attention for Virtual Try-On},
  author        = {Takemoto, Kosuke and Koshinaka, Takafumi},
  year          = {2026},
  eprint        = {2605.01296},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2605.01296}
}

License

Licensed under the CC BY-NC-SA 4.0 license.

Downloads last month
17
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Paper for takesuke/SIFT-VTON