PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction
Abstract
PixARMesh enables end-to-end 3D indoor scene mesh reconstruction from single RGB images using a unified model with cross-attention and autoregressive token generation.
We introduce PixARMesh, a method to autoregressively reconstruct complete 3D indoor scene meshes directly from a single RGB image. Unlike prior methods that rely on implicit signed distance fields and post-hoc layout optimization, PixARMesh jointly predicts object layout and geometry within a unified model, producing coherent and artist-ready meshes in a single forward pass. Building on recent advances in mesh generative models, we augment a point-cloud encoder with pixel-aligned image features and global scene context via cross-attention, enabling accurate spatial reasoning from a single image. Scenes are generated autoregressively from a unified token stream containing context, pose, and mesh, yielding compact meshes with high-fidelity geometry. Experiments on synthetic and real-world datasets show that PixARMesh achieves state-of-the-art reconstruction quality while producing lightweight, high-quality meshes ready for downstream applications.
Community
PixARMesh is a mesh-native autoregressive framework for single-view 3D scene reconstruction.
Instead of reconstructing via intermediate volumetric or implicit representations, PixARMesh directly models instances with native mesh representation. Object poses and meshes are predicted in a unified autoregressive sequence.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Pano3DComposer: Feed-Forward Compositional 3D Scene Generation from Single Panoramic Image (2026)
- ShapeR: Robust Conditional 3D Shape Generation from Casual Captures (2026)
- FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation (2026)
- NOVA3R: Non-pixel-aligned Visual Transformer for Amodal 3D Reconstruction (2026)
- LaVR: Scene Latent Conditioned Generative Video Trajectory Re-Rendering using Large 4D Reconstruction Models (2026)
- MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE (2026)
- Hand3R: Online 4D Hand-Scene Reconstruction in the Wild (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 2
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper