Delanoe Pirard
Deploy to HuggingFace Spaces
18b382b

A newer version of the Gradio SDK is available: 6.1.0

Upgrade
metadata
title: Awesome Depth Anything 3
emoji: 🌊
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.50.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: Metric 3D reconstruction from images/video

Awesome Depth Anything 3

Optimized fork of Depth Anything 3 with production-ready features

PyPI Python 3.10+ License Tests Open In Colab HF Spaces

Demo · Tutorial · Benchmarks · Original Paper


This is an optimized fork of Depth Anything 3 by ByteDance. All credit for the model architecture, training, and research goes to the original authors (see Credits below). This fork focuses on production optimization, developer experience, and ease of deployment.

🚀 What's New in This Fork

Feature Description
Model Caching ~200x faster model loading after first use
Adaptive Batching Automatic batch size optimization based on GPU memory
PyPI Package pip install awesome-depth-anything-3
CLI Improvements Batch processing options, better error handling
Apple Silicon Optimized Smart CPU/GPU preprocessing for best MPS performance
Comprehensive Benchmarks Detailed performance analysis across devices

Performance Improvements

Metric Upstream This Fork Improvement
Cached model load ~1s ~5ms 200x faster
Batch 4 inference (MPS) 3.32 img/s 3.78 img/s 1.14x faster
Cold model load 1.28s 0.77s 1.7x faster

Original Depth Anything 3

Recovering the Visual Space from Any Views

Haotong Lin* · Sili Chen* · Jun Hao Liew* · Donny Y. Chen* · Zhenyu Li · Guang Shi · Jiashi Feng
Bingyi Kang*†

†project lead *Equal Contribution

Paper PDF Project Page

This work presents Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from arbitrary visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights:

  • 💎 A single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization,
  • ✨ A singular depth-ray representation obviates the need for complex multi-task learning.

🏆 DA3 significantly outperforms DA2 for monocular depth estimation, and VGGT for multi-view depth estimation and pose estimation. All models are trained exclusively on public academic datasets.

Depth Anything 3 - Left

Depth Anything 3

📰 News

  • 30-11-2025: Add use_ray_pose and ref_view_strategy (reference view selection for multi-view inputs).
  • 25-11-2025: Add Awesome DA3 Projects, a community-driven section featuring DA3-based applications.
  • 14-11-2025: Paper, project page, code and models are all released.

✨ Highlights

🏆 Model Zoo

We release three series of models, each tailored for specific use cases in visual geometry.

  • 🌟 DA3 Main Series (DA3-Giant, DA3-Large, DA3-Base, DA3-Small) These are our flagship foundation models, trained with a unified depth-ray representation. By varying the input configuration, a single model can perform a wide range of tasks:

    • 🌊 Monocular Depth Estimation: Predicts a depth map from a single RGB image.
    • 🌊 Multi-View Depth Estimation: Generates consistent depth maps from multiple images for high-quality fusion.
    • 🎯 Pose-Conditioned Depth Estimation: Achieves superior depth consistency when camera poses are provided as input.
    • 📷 Camera Pose Estimation: Estimates camera extrinsics and intrinsics from one or more images.
    • 🟡 3D Gaussian Estimation: Directly predicts 3D Gaussians, enabling high-fidelity novel view synthesis.
  • 📐 DA3 Metric Series (DA3Metric-Large) A specialized model fine-tuned for metric depth estimation in monocular settings, ideal for applications requiring real-world scale.

  • 🔍 DA3 Monocular Series (DA3Mono-Large). A dedicated model for high-quality relative monocular depth estimation. Unlike disparity-based models (e.g., Depth Anything 2), it directly predicts depth, resulting in superior geometric accuracy.

🔗 Leveraging these available models, we developed a nested series (DA3Nested-Giant-Large). This series combines a any-view giant model with a metric model to reconstruct visual geometry at a real-world metric scale.

🛠️ Codebase Features

Our repository is designed to be a powerful and user-friendly toolkit for both practical application and future research.

  • 🎨 Interactive Web UI & Gallery: Visualize model outputs and compare results with an easy-to-use Gradio-based web interface.
  • Flexible Command-Line Interface (CLI): Powerful and scriptable CLI for batch processing and integration into custom workflows.
  • 💾 Multiple Export Formats: Save your results in various formats, including glb, npz, depth images, ply, 3DGS videos, etc, to seamlessly connect with other tools.
  • 🔧 Extensible and Modular Design: The codebase is structured to facilitate future research and the integration of new models or functionalities.

🚀 Quick Start

📦 Installation

# From PyPI (recommended)
pip install awesome-depth-anything-3

# With Gradio web UI
pip install awesome-depth-anything-3[app]

# With CUDA optimizations (xformers + gsplat)
pip install awesome-depth-anything-3[cuda]

# Everything
pip install awesome-depth-anything-3[all]
Development installation
git clone https://github.com/Aedelon/awesome-depth-anything-3.git
cd awesome-depth-anything-3
pip install -e ".[dev]"

# Optional: 3D Gaussian Splatting head
pip install --no-build-isolation git+https://github.com/nerfstudio-project/gsplat.git@0b4dddf

For detailed model information, please refer to the Model Cards section below.

💻 Basic Usage

import glob, os, torch
from depth_anything_3.api import DepthAnything3
device = torch.device("cuda")
model = DepthAnything3.from_pretrained("depth-anything/DA3NESTED-GIANT-LARGE")
model = model.to(device=device)
example_path = "assets/examples/SOH"
images = sorted(glob.glob(os.path.join(example_path, "*.png")))
prediction = model.inference(
    images,
)
# prediction.processed_images : [N, H, W, 3] uint8   array
print(prediction.processed_images.shape)
# prediction.depth            : [N, H, W]    float32 array
print(prediction.depth.shape)  
# prediction.conf             : [N, H, W]    float32 array
print(prediction.conf.shape)  
# prediction.extrinsics       : [N, 3, 4]    float32 array # opencv w2c or colmap format
print(prediction.extrinsics.shape)
# prediction.intrinsics       : [N, 3, 3]    float32 array
print(prediction.intrinsics.shape)

export MODEL_DIR=depth-anything/DA3NESTED-GIANT-LARGE
# This can be a Hugging Face repository or a local directory
# If you encounter network issues, consider using the following mirror: export HF_ENDPOINT=https://hf-mirror.com
# Alternatively, you can download the model directly from Hugging Face
export GALLERY_DIR=workspace/gallery
mkdir -p $GALLERY_DIR

# CLI auto mode with backend reuse
da3 backend --model-dir ${MODEL_DIR} --gallery-dir ${GALLERY_DIR} # Cache model to gpu
da3 auto assets/examples/SOH \
    --export-format glb \
    --export-dir ${GALLERY_DIR}/TEST_BACKEND/SOH \
    --use-backend

# CLI video processing with feature visualization
da3 video assets/examples/robot_unitree.mp4 \
    --fps 15 \
    --use-backend \
    --export-dir ${GALLERY_DIR}/TEST_BACKEND/robo \
    --export-format glb-feat_vis \
    --feat-vis-fps 15 \
    --process-res-method lower_bound_resize \
    --export-feat "11,21,31"

# CLI auto mode without backend reuse
da3 auto assets/examples/SOH \
    --export-format glb \
    --export-dir ${GALLERY_DIR}/TEST_CLI/SOH \
    --model-dir ${MODEL_DIR}

The model architecture is defined in DepthAnything3Net, and specified with a Yaml config file located at src/depth_anything_3/configs. The input and output processing are handled by DepthAnything3. To customize the model architecture, simply create a new config file (e.g., path/to/new/config) as:

__object__:
  path: depth_anything_3.model.da3
  name: DepthAnything3Net
  args: as_params

net:
  __object__:
    path: depth_anything_3.model.dinov2.dinov2
    name: DinoV2
    args: as_params

  name: vitb
  out_layers: [5, 7, 9, 11]
  alt_start: 4
  qknorm_start: 4
  rope_start: 4
  cat_token: True

head:
  __object__:
    path: depth_anything_3.model.dualdpt
    name: DualDPT
    args: as_params

  dim_in: &head_dim_in 1536
  output_dim: 2
  features: &head_features 128
  out_channels: &head_out_channels [96, 192, 384, 768]

Then, the model can be created with the following code snippet.

from depth_anything_3.cfg import create_object, load_config

Model = create_object(load_config("path/to/new/config"))

📚 Useful Documentation

🗂️ Model Cards

Generally, you should observe that DA3-LARGE achieves comparable results to VGGT.

The Nested series uses an Any-view model to estimate pose and depth, and a monocular metric depth estimator for scaling.

🗃️ Model Name 📏 Params 📊 Rel. Depth 📷 Pose Est. 🧭 Pose Cond. 🎨 GS 📐 Met. Depth ☁️ Sky Seg 📄 License
Nested
DA3NESTED-GIANT-LARGE 1.40B CC BY-NC 4.0
Any-view Model
DA3-GIANT 1.15B CC BY-NC 4.0
DA3-LARGE 0.35B CC BY-NC 4.0
DA3-BASE 0.12B Apache 2.0
DA3-SMALL 0.08B Apache 2.0
Monocular Metric Depth
DA3METRIC-LARGE 0.35B Apache 2.0
Monocular Depth
DA3MONO-LARGE 0.35B Apache 2.0

⚡ Performance Benchmarks

Inference throughput measured on Apple Silicon (MPS) with PyTorch 2.9.0. For detailed benchmarks, see BENCHMARKS.md.

Apple Silicon (MPS) - Batch Size 1

Model Latency Throughput
DA3-Small 46 ms 22 img/s
DA3-Base 93 ms 11 img/s
DA3-Large 265 ms 3.8 img/s
DA3-Giant 618 ms 1.6 img/s

Cross-Device Comparison (DA3-Large)

Device Throughput vs CPU
CPU 0.3 img/s 1.0x
Apple Silicon (MPS) 3.8 img/s 13x
NVIDIA L4 (CUDA) 10.3 img/s 34x

Batch Processing

from depth_anything_3.api import DepthAnything3

model = DepthAnything3.from_pretrained("depth-anything/DA3-LARGE")

# Adaptive batching (recommended for large image sets)
results = model.batch_inference(
    images=image_paths,
    batch_size="auto",  # Automatically selects optimal batch size
    target_memory_utilization=0.85,
)

# Fixed batch size
results = model.batch_inference(
    images=image_paths,
    batch_size=4,
)

See BENCHMARKS.md for comprehensive benchmarks including preprocessing, attention mechanisms, and adaptive batching strategies.

❓ FAQ

  • Monocular Metric Depth: To obtain metric depth in meters from DA3METRIC-LARGE, use metric_depth = focal * net_output / 300., where focal is the focal length in pixels (typically the average of fx and fy from the camera intrinsic matrix K). Note that the output from DA3NESTED-GIANT-LARGE is already in meters.

  • Ray Head (use_ray_pose): Our API and CLI support use_ray_pose arg, which means that the model will derive camera pose from ray head, which is generally slightly slower, but more accurate. Note that the default is False for faster inference speed.

    AUC3 Results for DA3NESTED-GIANT-LARGE
    Model HiRoom ETH3D DTU 7Scenes ScanNet++
    ray_head 84.4 52.6 93.9 29.5 89.4
    cam_head 80.3 48.4 94.1 28.5 85.0
  • Older GPUs without XFormers support: See Issue #11. Thanks to @S-Mahoney for the solution!

🏢 Awesome DA3 Projects

A community-curated list of Depth Anything 3 integrations across 3D tools, creative pipelines, robotics, and web/VR viewers, including but not limited to these. You are welcome to submit your DA3-based project via PR, and we will review and feature it if applicable.

  • DA3-blender: Blender addon for DA3-based 3D reconstruction from a set of images.

  • ComfyUI-DepthAnythingV3: ComfyUI nodes for Depth Anything 3, supporting single/multi-view and video-consistent depth with optional point‑cloud export.

  • DA3-ROS2-Wrapper: Real-time DA3 depth in ROS2 with multi-camera support.

  • VideoDepthViewer3D: Streaming videos with DA3 metric depth to a Three.js/WebXR 3D viewer for VR/stereo playback.

📝 Credits

Original Authors

This package is built on top of Depth Anything 3, created by the ByteDance Seed team:

All model weights, architecture, and core algorithms are their work. This fork only adds production optimizations and deployment tooling.

Fork Maintainer

This optimized fork is maintained by Delanoe Pirard (Aedelon).

Contributions:

  • Model caching system
  • Adaptive batching
  • Apple Silicon (MPS) optimizations
  • PyPI packaging and CI/CD
  • Comprehensive benchmarking

Citation

If you use Depth Anything 3 in your research, please cite the original paper:

@article{depthanything3,
  title={Depth Anything 3: Recovering the visual space from any views},
  author={Haotong Lin and Sili Chen and Jun Hao Liew and Donny Y. Chen and Zhenyu Li and Guang Shi and Jiashi Feng and Bingyi Kang},
  journal={arXiv preprint arXiv:2511.10647},
  year={2025}
}

If you specifically use features from this fork (caching, batching, MPS optimizations), you may additionally reference:

awesome-depth-anything-3: https://github.com/Aedelon/awesome-depth-anything-3