Please be sure to provide your full legal name, date of birth, and full organization name with all corporate identifiers. Avoid the use of acronyms and special characters. Failure to follow these instructions may prevent you from accessing this model and others on Hugging Face. You will not have the ability to edit this form after submission, so please ensure all information is accurate.
The information you provide will be collected, stored, processed and shared in accordance with the Meta Privacy Policy.
Log in or Sign Up to review the conditions and access this model content.
This is a model checkpoint accompanying the CVPR 2026 paper “LagerNVS: Latent Geometry for Fully Neural Real-time Novel View Synthesis”. Accompanying code is available on github
The model takes as input a number of 2D input images, possibly with corresponding camera poses, and processes information about the scene within these images. Then, it takes as input a new camera pose from which a user wishes to see the image. The model then renders the observed scene from a new viewpoint. The model ‘re-renders’ the observed content from a novel viewpoint, effectively estimating what it would look like in 3d. This checkpoint was trained with 1-10 input views, with and without input camera poses, at 512 resolution (longer side) on a mix of datasets. It is intended for general usage on any static scene, can work with or without known source camera poses and at aspect ratios within the [0.5, 2.0] range. It is licensed under the FAIR research license
The expected performance on the splits detailed in the paper is:
| Dataset | Views | Posed | PSNR | SSIM | LPIPS |
|---|---|---|---|---|---|
| Re10k | 2 | ✓ | 29.05 | 0.901 | 0.147 |
| Re10k | 2 | ✗ | 28.28 | 0.885 | 0.155 |
| Re10k | 2 | ✓ | 26.40 | 0.867 | 0.188 |
| Re10k | 2 | ✗ | 25.64 | 0.848 | 0.201 |
| DL3DV | 2 | ✓ | 21.77 | 0.692 | 0.287 |
| DL3DV | 2 | ✗ | 21.33 | 0.670 | 0.301 |
| DL3DV | 4 | ✓ | 24.94 | 0.780 | 0.188 |
| DL3DV | 4 | ✗ | 23.99 | 0.744 | 0.206 |
| DL3DV | 6 | ✓ | 26.14 | 0.808 | 0.159 |
| DL3DV | 6 | ✗ | 24.97 | 0.769 | 0.178 |
| DL3DV | 16 | ✓ | 25.42 | 0.782 | 0.171 |
| DL3DV | 16 | ✗ | 23.49 | 0.719 | 0.211 |
| CO3D | 3 | ✓ | 21.31 | 0.691 | 0.386 |
| CO3D | 3 | ✗ | 20.22 | 0.667 | 0.431 |
| CO3D | 6 | ✓ | 23.65 | 0.733 | 0.317 |
| CO3D | 6 | ✗ | 21.65 | 0.684 | 0.377 |
| CO3D | 9 | ✓ | 24.74 | 0.747 | 0.292 |
| CO3D | 9 | ✗ | 22.37 | 0.697 | 0.352 |
| Mip360 | 3 | ✓ | 18.08 | 0.434 | 0.497 |
| Mip360 | 3 | ✗ | 17.45 | 0.413 | 0.531 |
| Mip360 | 6 | ✓ | 19.39 | 0.469 | 0.436 |
| Mip360 | 6 | ✗ | 18.97 | 0.447 | 0.466 |
| Mip360 | 9 | ✓ | 20.39 | 0.493 | 0.402 |
| Mip360 | 9 | ✗ | 19.68 | 0.462 | 0.438 |
Known limitations:
- The model is trained with deterministic (i.e. not generative) losses, so it cannot generate plausible completions of unobserved regions of the scene.
- The model is suited only to static data.
- The model did not include humans or animals in the training data, and did not include images with distortion (e.g., fish-eye). As a consequence, we do not expect the model to work well in such scenarios.
- Occasionally, when rendering a video, one observes block artifacts and an impression of flicker. This often stems from uncertainty in geometry estimation, unobserved regions, or difficulty in estimating source camera poses.
- Regions with high-frequency patterns, such as grass or trees, are systematically poorly represented by our model—investigating this failure mode is an interesting avenue for future work.