
SparkVSR: Interactive Video Super-Resolution via Sparse Keyframe Propagation
Jiongze Yu1, Xiangbo Gao1, Pooja Verlani2, Akshay Gadde2, Yilin Wang2, Balu Adsumilli2, Zhengzhong Tuβ ,1
1Texas A&M University 2YouTube, Google
β Corresponding author
π° News
- 2026.03.17: This repo is released.π₯π₯π₯
Abstract: Video Super-Resolution (VSR) aims to restore high-quality video frames from low-resolution (LR) estimates, yet most existing VSR approaches behave like black boxes at inference time: users cannot reliably correct unexpected artifacts, but instead can only accept whatever the model produces. In this paper, we propose a novel interactive VSR framework dubbed SparkVSR that makes sparse keyframes a simple and expressive control signal. Specifically, users can first super-resolve or optionally a small set of keyframes using any off-the-shelf image super-resolution (ISR) model, then SparkVSR propagates the keyframe priors to the entire video sequence while remaining grounded by the original LR video motion. Concretely, we introduce a keyframe-conditioned latent-pixel two-stage training pipeline that fuses LR video latents with sparsely encoded HR keyframe latents to learn robust cross-space propagation and refine perceptual details. At inference time, SparkVSR supports flexible keyframe selection (manual specification, codec I-frame extraction, or random sampling) and a reference-free guidance mechanism that continuously balances keyframe adherence and blind restoration, ensuring robust performance even when reference keyframes are absent or imperfect. Experiments on multiple VSR benchmarks demonstrate improved temporal consistency and strong restoration quality, surpassing baselines by up to 24.6%, 21.8%, and 5.6% on CLIP-IQA, DOVER, and MUSIQ, respectively, enabling controllable, keyframe-driven video super-resolution. Moreover, we demonstrate that SparkVSR is a generic interactive, keyframe-conditioned video processing framework as it can be applied out of the box to unseen tasks such as old-film restoration and video style transfer.
Inference Pipeline
Training Pipeline
π TODO
- β Release inference code.
- β Release pre-trained models.
- β Release training code.
- β Release project page.
βοΈ Dependencies
- Python 3.10+
- PyTorch >= 2.5.0
- Diffusers
- Other dependencies (see
requirements.txt)
# Clone the github repo and go to the directory
git clone https://github.com/taco-group/SparkVSR
cd SparkVSR
# Create and activate conda environment
conda create -n sparkvsr python=3.10
conda activate sparkvsr
# Install all required dependencies
pip install -r requirements.txt
π Contents
π Datasets
π³οΈ Train Datasets
Our model is trained on the same datasets as DOVE: HQ-VSR and DIV2K-HR. All datasets should be placed in the directory datasets/train/.
| Dataset | Type | # Videos / Images | Download |
|---|---|---|---|
| HQ-VSR | Video | 2,055 | Google Drive |
| DIV2K-HR | Image | 800 | Official Link |
All datasets should follow this structure:
datasets/
βββ train/
βββ HQ-VSR/
βββ DIV2K_train_HR/
π³οΈ Test Datasets
We use several real-world and synthetic test datasets for evaluation. All datasets follow a consistent directory structure:
| Dataset | Type | # Videos | Average Frames | Download |
|---|---|---|---|---|
| UDM10 | Synthetic | 10 | 32 | Google Drive |
| SPMCS | Synthetic | 30 | 32 | Google Drive |
| YouHQ40 | Synthetic | 40 | 32 | Google Drive |
| RealVSR | Real-world | 50 | 50 | Google Drive |
| MovieLQ | Old-movie | 10 | 192 | Google Drive |
Make sure the path (datasets/test/) is correct before running inference.
The directory structure is as follows:
datasets/
βββ test/
βββ [DatasetName]/
βββ GT/ # Ground Truth: folder of high-quality frames (one per clip)
βββ GT-Video/ # Ground Truth (video version): lossless MKV format
βββ LQ/ # Low-quality Input: folder of degraded frames (one per clip)
βββ LQ-Video/ # Low-Quality Input (video version): lossless MKV format
π Dataset Preparation (Path Lists)
Before training or testing, you need to generate .txt files containing the relative paths of all valid video and image files in your dataset directories. These text lists act as the index for the dataloader during training and inference. Run the following commands:
# πΉ Train dataset
python finetune/scripts/prepare_dataset.py --dir datasets/train/HQ-VSR
python finetune/scripts/prepare_dataset.py --dir datasets/train/DIV2K_train_HR
# πΉ Testing dataset
python finetune/scripts/prepare_dataset.py --dir datasets/test/UDM10/GT-Video
python finetune/scripts/prepare_dataset.py --dir datasets/test/UDM10/LQ-Video
# (You may need to repeat the above for other test datasets as needed)
π¦ Models
Our model is built upon the CogVideoX1.5-5B-I2V base model. We provide pretrained weights for SparkVSR at different training stages.
| Model Name | Description | HuggingFace |
|---|---|---|
| CogVideoX1.5-5B-I2V | Base model used for initialization | zai-org/CogVideoX1.5-5B-I2V |
| SparkVSR (Stage-1) | SparkVSR Stage-1 trained weights | JiongzeYu/SparkVSR-S1 |
| SparkVSR (Stage-2) | SparkVSR Stage-2 final weights | JiongzeYu/SparkVSR |
π‘ Placement of Models:
- Place the base model (
CogVideoX1.5-5B-I2V) into thepretrained_weights/folder.- Place the downloaded SparkVSR weights (Stage-1 and Stage-2) into the
checkpoints/folder.
π§ Training
Note: Training requires 4ΓA100 GPUs.
πΉ Stage-1 (Latent-Space): Keyframe-Conditioned Adaptation. Enter the
finetune/directory and start training:cd finetune/ bash sparkvsr_train_s1_ref.shThis stage adapts the base model to VSR by learning to fuse LR video latents with sparse HR keyframe latents for robust cross-space propagation.
πΉ Stage-2 (Pixel-Space): Detail Refinement. First, convert the Stage-1 checkpoint into a loadable SFT weight format:
python scripts/prepare_sft_ckpt.py --checkpoint_dir ../checkpoint/SparkVSR-s1/checkpoint-10000(Adjust the path and step number to match your actual training output).
You can skip Stage-1 by downloading our SparkVSR Stage-1 weight as the starting point for Stage-2.
Then, run the second-stage fine-tuning:
bash sparkvsr_train_s2_ref.shThis stage refines perceptual details in pixel space, ensuring adherence to provided keyframes while simultaneously maintaining strong no-reference blind SR capabilities when keyframes are absent or imperfect.
Finally, convert the Stage-2 checkpoint for inference:
python scripts/prepare_sft_ckpt.py --checkpoint_dir ../checkpoint/SparkVSR-s2/checkpoint-500
π¨ Inference
- Before running inference, make sure you have downloaded the corresponding pre-trained models and test datasets.
- The full inference commands are provided in the shell script:
sparkvsr_inference.sh.
SparkVSR supports flexible keyframe propagation through three primary inference modes (--ref_mode).
π Global Customization Flags
Regardless of the mode you choose, you can customize the temporal propagation behavior using these flags:
--ref_indices: Specifies the indices of the keyframes you want to use as references (0-indexed).- Example:
--ref_indices 0 16 32 - β οΈ Important: The interval between any two reference frame indices must be strictly greater than 4.
- Example:
--ref_guidance_scale: Controls the strength of the reference keyframe's influence on the output video (Default is1.0). Increasing this value forces the model to adhere more strictly to the provided keyframes.
1οΈβ£ No-Ref Mode (--ref_mode no_ref)
Performs blind video super-resolution without any reference keyframes.
MODEL_PATH="checkpoints/sparkvsr-s2/ckpt-500-sft"
CUDA_VISIBLE_DEVICES=0 python sparkvsr_inference_script.py \
--input_dir datasets/test/UDM10/LQ-Video \
--model_path $MODEL_PATH \
--output_path results/UDM10/no_ref \
--gt_dir datasets/test/UDM10/GT-Video \
--is_vae_st \
--ref_mode no_ref \
--ref_prompt_mode fixed \
--ref_guidance_scale 1.0 \
--eval_metrics psnr,ssim,lpips,dists,clipiqa \
--upscale 4
2οΈβ£ API Mode (--ref_mode api)
Uses keyframes restored by a commercial API as the condition signal. SparkVSR defaults to using the impressive fal-ai/nano-banana-pro/edit endpoint.
β οΈ Setup Requirement:
- Open
finetune/utils/ref_utils.py.- Locate the configuration block at the top of the file.
- Replace
'your_fal_key'with your actual API key.- (Optional) Customize the
TASK_PROMPTin the same file to better guide the restoration process.
MODEL_PATH="checkpoints/sparkvsr-s2/ckpt-500-sft"
CUDA_VISIBLE_DEVICES=0 python sparkvsr_inference_script.py \
--input_dir datasets/test/UDM10/LQ-Video \
--model_path $MODEL_PATH \
--output_path results/UDM10/api_ref \
--gt_dir datasets/test/UDM10/GT-Video \
--is_vae_st \
--ref_mode api \
--ref_prompt_mode fixed \
--ref_guidance_scale 1.0 \
--eval_metrics psnr,ssim,lpips,dists,clipiqa \
--upscale 4 \
--ref_indices 0
3οΈβ£ PiSA-SR Mode (--ref_mode pisasr)
Uses keyframes restored by the open-source PiSA-SR model.
β οΈ Setup Requirement:
- Clone the PiSA-SR Repository and follow their instructions to install dependencies in a separate Conda environment.
- Download their pre-trained weights (
stable-diffusion-2-1-baseandpisa_sr.pkl).- Update the
--pisa_*flags insparkvsr_inference.shto point to your actual cloned PiSA-SR directory, environment, and desired GPU.
MODEL_PATH="checkpoints/sparkvsr-s2/ckpt-500-sft"
CUDA_VISIBLE_DEVICES=0 python sparkvsr_inference_script.py \
--input_dir datasets/test/UDM10/LQ-Video \
--model_path $MODEL_PATH \
--output_path results/UDM10/pisa_ref \
--gt_dir datasets/test/UDM10/GT-Video \
--is_vae_st \
--ref_mode pisasr \
--ref_prompt_mode fixed \
--ref_guidance_scale 1.0 \
--eval_metrics psnr,ssim,lpips,dists,clipiqa \
--upscale 4 \
--ref_indices 0 \
--pisa_python_executable "path/to/your/pisasr/conda/env/bin/python" \
--pisa_script_path "path/to/your/PiSA-SR/test_pisasr.py" \
--pisa_sd_model_path "path/to/your/PiSA-SR/preset/models/stable-diffusion-2-1-base" \
--pisa_chkpt_path "path/to/your/PiSA-SR/preset/models/pisa_sr.pkl" \
--pisa_gpu "0"
π‘ Note: All three of the above inference modes and their complete execution commands are fully organized and ready to run in the
sparkvsr_inference.shscript!
π Metric Evaluation
To quantitatively evaluate the super-resolved videos, we provide a unified evaluation script: run_eval_all.sh.
β οΈ Evaluation Setup Requirement: To calculate DOVER and FastVQA/FasterVQA scores, you must clone their respective repositories and place them (along with their weights) into the
metrics/directory.
- Clone VQAssessment/DOVER into
metrics/DOVER.- Clone VQAssessment/FAST-VQA-and-FasterVQA into
metrics/FastVQA.- Download the pre-trained weights specified in their repositories to their respective nested algorithm folders.
Once the metrics are set up, you can simply run the unified evaluation script run_eval_all.sh to calculate the scores. The evaluation results will be saved as all_metrics_results.json in your specified output directory.
π Citation
If you find the code helpful in your research or work, please cite the following paper(s).
@misc{yu2026sparkvsrinteractivevideosuperresolution,
title={SparkVSR: Interactive Video Super-Resolution via Sparse Keyframe Propagation},
author={Jiongze Yu and Xiangbo Gao and Pooja Verlani and Akshay Gadde and Yilin Wang and Balu Adsumilli and Zhengzhong Tu},
year={2026},
eprint={2603.16864},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.16864},
}
π‘ Acknowledgements
Our work is built upon the solid foundations laid by DOVE and CogVideoX. We sincerely thank the authors for their excellent open-source contributions.
- Downloads last month
- 73