SIFT-VTON: Geometric Correspondence Supervision on Cross-Attention for Virtual Try-On
ICPR 2026
SIFT-VTON is a diffusion-based virtual try-on model that uses SIFT feature correspondences between a garment image and a person image to supervise cross-attention maps during training, improving geometric alignment in the generated results.
Paper: arXiv:2605.01296
This model is derived from StableVITON and built on a Stable Diffusion backbone.
The code repository is available at takesukeDS/SIFT-VTON.
Model Files
| File | Description |
|---|---|
model.ckpt |
Model checkpoint |
config.yaml |
Model architecture config |
Requirements
Clone the code repository and set up the environment:
git clone https://github.com/takesukeDS/SIFT-VTON
cd SIFT-VTON
conda create -n siftvton python==3.12.8 -y
conda activate siftvton
pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu128
pip install matplotlib einops omegaconf yacs
pip install pytorch-lightning==2.5.2
pip install open-clip-torch==3.1.0
pip install diffusers==0.34.0
pip install scipy==1.16.1
pip install transformers==4.55.0
conda install -c anaconda ipython -y
pip install scikit-image clean-fid albumentations==2.0.8
pip3 install -U xformers==0.0.31.post1
pip install tensorboard
pip install accelerate==1.10.0
pip install numpy==2.2.6
Data
Download the VITON-HD dataset and prepare the following directory structure:
[data_root_dir]
โโโ test
|-- image
|-- image-densepose
|-- agnostic-v3.2
|-- agnostic-mask
|-- cloth
|-- cloth-mask
A pairs file yahavton_test_pairs.txt is also required under [data_root_dir], listing image and cloth filenames one pair per line:
image_00001.jpg cloth_00001.jpg
image_00002.jpg cloth_00002.jpg
...
Inference
python inference_hf.py \
--repo_id takesukeDS/SIFT-VTON \
--data_root_dir [data_root_dir] \
--save_dir [output_dir] \
--phase test \
--batch_size 4 \
--start_from_noised_agn \
--cfg_scale 1.5 \
--repaint
The model and config are downloaded automatically from this Hub repository on the first run and cached locally under ~/.cache/huggingface/hub/.
Key inference arguments
| Argument | Default | Description |
|---|---|---|
--repo_id |
โ | This Hub repo (takesukeDS/SIFT-VTON) |
--phase |
test |
test for the test split, train for the training split |
--cfg_scale |
1.0 |
Classifier-free guidance scale |
--denoise_steps |
50 |
Number of PLMS denoising steps |
--start_from_noised_agn |
off | Start denoising from noised agnostic image instead of pure noise (recommended) |
--repaint |
off | Paste back the unmasked region from the original image after generation (recommended) |
--unpair |
off | Run unpaired inference (person and garment from different samples) |
--batch_size |
16 |
Batch size |
--seed |
1235 |
Random seed |
Citation
@misc{takemoto2026siftvton,
title = {{SIFT-VTON}: Geometric Correspondence Supervision on Cross-Attention for Virtual Try-On},
author = {Takemoto, Kosuke and Koshinaka, Takafumi},
year = {2026},
eprint = {2605.01296},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2605.01296}
}
License
Licensed under the CC BY-NC-SA 4.0 license.
- Downloads last month
- 17