arxiv:2606.14024

ViT-Up: Faithful Feature Upsampling for Vision Transformers

Published on Jun 12

· Submitted by

Wandel on Jun 18

Shanghai Jiao Tong University

Upvote

Authors:

Krispin Wandel ,

Abstract

ViT-Up is a feature upsampling framework for Vision Transformers that uses layer-wise query construction from hidden states to improve dense prediction tasks, outperforming existing image-guided methods.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Vision Transformers (ViTs) have become a dominant architecture for visual representation learning, providing exceptionally strong and broadly reusable backbone features. However, ViTs are commonly operated on relatively small patch-token grids due to the quadratic cost of global self-attention, which creates a persistent bottleneck for dense prediction tasks such as semantic segmentation and depth estimation. This has motivated the development of task-agnostic feature upsamplers. While recent state-of-the-art methods produce visually sharp dense representations, their reliance on shallow image encoders for guided upsampling can introduce feature leakage, fragmentation, and blur. We introduce ViT-Up, an implicit feature upsampling framework that replaces external image guidance with layer-wise query construction from intermediate ViT hidden states. This enables feature prediction at arbitrary continuous image coordinates while preserving alignment with the backbone feature space. Experiments demonstrate that ViT-Up consistently outperforms state-of-the-art image-guided upsamplers across dense prediction and semantic correspondence. On DINOv3-S+, ViT-Up improves over prior methods by up to +2.07 mIoU on Cityscapes and +4.17 PCK@0.10 on SPair-71k. With the larger DINOv3-B backbone, these gains increase to +3.36 mIoU and +8.09 PCK@0.10, demonstrating that ViT-Up scales favorably with backbone capacity.

View arXiv page View PDF Project page GitHub 6 Add to collection

Community

Krispin

Paper author Paper submitter 1 day ago

Introducing ViT-Up: A state-of-the-art task-agnostic feature upsampler for Vision Transformers.
ViT-Up predicts features at ⭐ arbitrary continuous image coordinates ⭐, enabling dense feature maps at any resolution and sample-aware vision pipelines that query features only where they are needed.

Pretrained through self-supervised feature distillation on over one million ImageNet-1K images, it supports data-constrained dense prediction and fine-grained correspondence by letting downstream heads operate directly on dense DINOv3 features.

ViT-Up outperforms prior state-of-the-art feature upsamplers across dense prediction and semantic correspondence benchmarks. On DINOv3-S+, ViT-Up improves over prior methods by up to:

+2.07 mIoU on Cityscapes
+4.17 PCK@0.10 on SPair-71k

The project page includes pretrained checkpoints, code for training and evaluation, quantitative results, qualitative comparisons, the arXiv preprint, and a Google Colab demo:

https://vitup.papers.discuna.com/