ViT-5

ViT-5: Vision Transformers for the Mid-2020s
Official checkpoint release.

📄 Paper: https://arxiv.org/abs/2602.08071
💻 Code: https://github.com/wangf3014/ViT-5


Overview

ViT-5 is a modernized Vision Transformer backbone that preserves the canonical Attention–FFN block structure while systematically upgrading its internal components using best practices from recent large-scale vision modeling research.

Rather than proposing a new paradigm, ViT-5 focuses on refining and consolidating improvements that have emerged over the past few years into a clean, scalable, and reproducible ViT design suitable for mid-2020s workloads.

This repository provides pretrained ViT-5 checkpoints for image recognition and as a general-purpose vision backbone.


Model Architecture

ViT-5 retains the standard Transformer encoder structure:

Patch Embedding → [Attention → FFN] × L → Classification Head

but modernizes key components, including:

  • Improved normalization strategy
  • Updated positional encoding
  • Refined activation design
  • Architectural stabilization techniques
  • Training refinements

Full architectural details are described in the paper.


Available Checkpoints

Model Input Resolution Params Top-1 (ImageNet-1K) Notes
ViT-5-Small 224 22M 82.2%
ViT-5-Base 224 87M 84.2%
ViT-5-Base 384 87M 85.4%
ViT-5-Large 224 304M 84.9%
ViT-5-Large 384 304M 86.0% Available soon

Please refer to the paper for detailed training configuration.


Intended Use

ViT-5 is designed as a general-purpose vision backbone and can be used for:

  • Image classification (fine-tuning or linear probing)
  • Transfer learning to detection and segmentation
  • Vision-language modeling
  • Generative modeling backbones (e.g., diffusion transformers)
  • Research on Transformer scaling and representation learning

Citation

If you use this model, please cite:

@article{wang2026vit5,
  title={ViT-5: Vision Transformers for The Mid-2020s},
  author={Wang, Feng and Ren, Sucheng and Zhang, Tiezheng and Neskovic, Predrag and Bhattad, Anand and Xie, Cihang and Yuille, Alan},
  journal={arXiv preprint arXiv:2602.08071},
  year={2026}
}

Acknowledgements

This work builds on the foundation of Vision Transformers and recent advances in scalable Transformer design.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for FengWang3211/ViT-5