Abstract
Finite Scalar Quantization with improved activation mapping enables unified modeling of discrete and continuous image generation approaches, revealing optimal representation balance and performance characteristics.
The field of image generation is currently bifurcated into autoregressive (AR) models operating on discrete tokens and diffusion models utilizing continuous latents. This divide, rooted in the distinction between VQ-VAEs and VAEs, hinders unified modeling and fair benchmarking. Finite Scalar Quantization (FSQ) offers a theoretical bridge, yet vanilla FSQ suffers from a critical flaw: its equal-interval quantization can cause activation collapse. This mismatch forces a trade-off between reconstruction fidelity and information efficiency. In this work, we resolve this dilemma by simply replacing the activation function in original FSQ with a distribution-matching mapping to enforce a uniform prior. Termed iFSQ, this simple strategy requires just one line of code yet mathematically guarantees both optimal bin utilization and reconstruction precision. Leveraging iFSQ as a controlled benchmark, we uncover two key insights: (1) The optimal equilibrium between discrete and continuous representations lies at approximately 4 bits per dimension. (2) Under identical reconstruction constraints, AR models exhibit rapid initial convergence, whereas diffusion models achieve a superior performance ceiling, suggesting that strict sequential ordering may limit the upper bounds of generation quality. Finally, we extend our analysis by adapting Representation Alignment (REPA) to AR models, yielding LlamaGen-REPA. Codes is available at https://github.com/Tencent-Hunyuan/iFSQ
Community
AR or Diffusion?
Itβs been hard to judge because different tokenizers (VQ vs. VAE) Enter iFSQ with just 1 line of code! We found: (1) AR wins on efficiency, but Diffusion hits a higher quality ceiling. (2) The sweet spot for representations is ~4 bits.
We brought REPA to LlamaGen and solved the missing piece: Where to align?
It turns out thereβs no fixed layer, but a Golden Ratio!
We found the optimal alignment depth is consistently 1/3 of total layers for both AR & Diffusion.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SFTok: Bridging the Performance Gap in Discrete Tokenizers (2025)
- Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing (2025)
- Improving Flexible Image Tokenizers for Autoregressive Image Generation (2026)
- Soft Tail-dropping for Adaptive Visual Tokenization (2026)
- ResTok: Learning Hierarchical Residuals in 1D Visual Tokenizers for Autoregressive Image Generation (2026)
- NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation (2026)
- Multi-Scale Local Speculative Decoding for Image Generation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
arXivlens breakdown of this paper π https://arxivlens.com/PaperView/Details/ifsq-improving-fsq-for-image-generation-with-1-line-of-code-7518-9c6ef569
- Executive Summary
- Detailed Breakdown
- Practical Applications
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper