OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder
Abstract
OneWorld enables 3D scene generation by performing diffusion in a unified 3D representation space using a 3D Unified Representation Autoencoder and specialized consistency losses.
Existing diffusion-based 3D scene generation methods primarily operate in 2D image/video latent spaces, which makes maintaining cross-view appearance and geometric consistency inherently challenging. To bridge this gap, we present OneWorld, a framework that performs diffusion directly within a coherent 3D representation space. Central to our approach is the 3D Unified Representation Autoencoder (3D-URAE); it leverages pretrained 3D foundation models and augments their geometry-centric nature by injecting appearance and distilling semantics into a unified 3D latent space. Furthermore, we introduce token-level Cross-View-Correspondence (CVC) consistency loss to explicitly enforce structural alignment across views, and propose Manifold-Drift Forcing (MDF) to mitigate train-inference exposure bias and shape a robust 3D manifold by mixing drifted and original representations. Comprehensive experiments demonstrate that OneWorld generates high-quality 3D scenes with superior cross-view consistency compared to state-of-the-art 2D-based methods. Our code will be available at https://github.com/SensenGao/OneWorld.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- BetterScene: 3D Scene Synthesis with Representation-Aligned Generative Model (2026)
- SceneTok: A Compressed, Diffusable Token Space for 3D Scenes (2026)
- VAR-3D: View-aware Auto-Regressive Model for Text-to-3D Generation via a 3D Tokenizer (2026)
- LaVR: Scene Latent Conditioned Generative Video Trajectory Re-Rendering using Large 4D Reconstruction Models (2026)
- One-Shot Refiner: Boosting Feed-forward Novel View Synthesis via One-Step Diffusion (2026)
- GeoNVS: Geometry Grounded Video Diffusion for Novel View Synthesis (2026)
- SemanticNVS: Improving Semantic Scene Understanding in Generative Novel View Synthesis (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper