Abstract
An object-centric 3D generative model is extended with adaptive latent space and iterative refinement to generate complete 3D scenes from single images, incorporating noise-aware completion and 3D-aware optimization for improved fidelity.
In this paper, we propose Extend3D, a training-free pipeline for 3D scene generation from a single image, built upon an object-centric 3D generative model. To overcome the limitations of fixed-size latent spaces in object-centric models for representing wide scenes, we extend the latent space in the x and y directions. Then, by dividing the extended latent space into overlapping patches, we apply the object-centric 3D generative model to each patch and couple them at each time step. Since patch-wise 3D generation with image conditioning requires strict spatial alignment between image and latent patches, we initialize the scene using a point cloud prior from a monocular depth estimator and iteratively refine occluded regions through SDEdit. We discovered that treating the incompleteness of 3D structure as noise during 3D refinement enables 3D completion via a concept, which we term under-noising. Furthermore, to address the sub-optimality of object-centric models for sub-scene generation, we optimize the extended latent during denoising, ensuring that the denoising trajectories remain consistent with the sub-scene dynamics. To this end, we introduce 3D-aware optimization objectives for improved geometric structure and texture fidelity. We demonstrate that our method yields better results than prior methods, as evidenced by human preference and quantitative experiments.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Points-to-3D: Structure-Aware 3D Generation with Point Cloud Priors (2026)
- LaS-Comp: Zero-shot 3D Completion with Latent-Spatial Consistency (2026)
- OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder (2026)
- LiTo: Surface Light Field Tokenization (2026)
- Pano3DComposer: Feed-Forward Compositional 3D Scene Generation from Single Panoramic Image (2026)
- Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas (2026)
- TexSpot: 3D Texture Enhancement with Spatially-uniform Point Latent Representation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2603.29387 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
