DINO-SAE: DINO Spherical Autoencoder for High-Fidelity Image Reconstruction and Generation
Abstract
A novel vision autoencoder framework combines semantic representation with pixel-level reconstruction using spherical latent space and Riemannian flow matching for improved fidelity and efficiency.
Recent studies have explored using pretrained Vision Foundation Models (VFMs) such as DINO for generative autoencoders, showing strong generative performance. Unfortunately, existing approaches often suffer from limited reconstruction fidelity due to the loss of high-frequency details. In this work, we present the DINO Spherical Autoencoder (DINO-SAE), a framework that bridges semantic representation and pixel-level reconstruction. Our key insight is that semantic information in contrastive representations is primarily encoded in the direction of feature vectors, while forcing strict magnitude matching can hinder the encoder from preserving fine-grained details. To address this, we introduce Hierarchical Convolutional Patch Embedding module that enhances local structure and texture preservation, and Cosine Similarity Alignment objective that enforces semantic consistency while allowing flexible feature magnitudes for detail retention. Furthermore, leveraging the observation that SSL-based foundation model representations intrinsically lie on a hypersphere, we employ Riemannian Flow Matching to train a Diffusion Transformer (DiT) directly on this spherical latent manifold. Experiments on ImageNet-1K demonstrate that our approach achieves state-of-the-art reconstruction quality, reaching 0.37 rFID and 26.2 dB PSNR, while maintaining strong semantic alignment to the pretrained VFM. Notably, our Riemannian Flow Matching-based DiT exhibits efficient convergence, achieving a gFID of 3.47 at 80 epochs.
Community
DINO-SAE bridges semantic directions and pixel fidelity via spherical latent diffusion with hierarchical patch embedding and cosine alignment, achieving state-of-the-art reconstruction while preserving semantic alignment.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation (2025)
- Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing (2025)
- VAE-REPA: Variational Autoencoder Representation Alignment for Efficient Diffusion Training (2026)
- RecTok: Reconstruction Distillation along Rectified Flow (2025)
- RePack: Representation Packing of Vision Foundation Model Features Enhances Diffusion Transformer (2025)
- REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion (2025)
- Distribution Matching Variational AutoEncoder (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper