depth-anything

depth-anything/DA3-GIANT-1.1

DA3 Giant model for multi-view depth estimation, camera pose estimation, and 3D Gaussian estimation. This is the flagship foundation model w...

Model Documentation

Depth Anything 3: DA3-GIANT

noqa: E501

Benchmark badge removed as per request -->

Model Description

DA3 Giant model for multi-view depth estimation, camera pose estimation, and 3D Gaussian estimation. This is the flagship foundation model with unified depth-ray representation.

| Property | Value | |----------|-------| | Model Series | Any-view Model | | Parameters | 1.15B | | License | CC BY-NC 4.0 |

⚠️ Non-commercial use only due to CC BY-NC 4.0 license.

Capabilities

✅ Relative Depth

✅ Pose Estimation

✅ Pose Conditioning

✅ 3D Gaussians

Quick Start

Installation

bash
git clone https://github.com/ByteDance-Seed/depth-anything-3
cd depth-anything-3
pip install -e .

Basic Example

python
import torch
from depth_anything_3.api import DepthAnything3

Load model from Hugging Face Hub
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = DepthAnything3.from_pretrained("depth-anything/da3-giant")
model = model.to(device=device)

Run inference on images
images = ["image1.jpg", "image2.jpg"]  List of image paths, PIL Images, or numpy arrays
prediction = model.inference(
    images,
    export_dir="output",
    export_format="glb"  Options: glb, npz, ply, mini_npz, gs_ply, gs_video
)

Access results
print(prediction.depth.shape)        Depth maps: [N, H, W] float32
print(prediction.conf.shape)         Confidence maps: [N, H, W] float32
print(prediction.extrinsics.shape)   Camera poses (w2c): [N, 3, 4] float32
print(prediction.intrinsics.shape)   Camera intrinsics: [N, 3, 3] float32

Command Line Interface

bash
Process images with auto mode
da3 auto path/to/images \
    --export-format glb \
    --export-dir output \
    --model-dir depth-anything/da3-giant

Use backend for faster repeated inference
da3 backend --model-dir depth-anything/da3-giant
da3 auto path/to/images --export-format glb --use-backend

Model Details

Developed by: ByteDance Seed Team

Model Type: Vision Transformer for Visual Geometry

Architecture: Plain transformer with unified depth-ray representation

Training Data: Public academic datasets only

Key Insights

💎 A single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization.

noqa: E501

✨ A singular depth-ray representation obviates the need for complex multi-task learning.

Performance

🏆 Depth Anything 3 significantly outperforms:

Depth Anything 2 for monocular depth estimation

VGGT for multi-view depth estimation and pose estimation

For detailed benchmarks, please refer to our paper.

noqa: E501

Limitations

The model is trained on academic datasets and may have limitations on certain domain-specific images

noqa: E501

Performance may vary depending on image quality, lighting conditions, and scene complexity

⚠️ Non-commercial use only due to CC BY-NC 4.0 license.

Citation

If you find Depth Anything 3 useful in your research or projects, please cite:

bibtex
@article{depthanything3,
  title={Depth Anything 3: Recovering the visual space from any views},
  author={Haotong Lin and Sili Chen and Jun Hao Liew and Donny Y. Chen and Zhenyu Li and Guang Shi and Jiashi Feng and Bingyi Kang},  noqa: E501
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2025}
}

Links

🏠 Project Page

📄 Paper

💻 GitHub Repository

🤗 Hugging Face Demo

📚 Documentation

Authors

Haotong Lin · Sili Chen · Junhao Liew · Donny Y. Chen · Zhenyu Li · Guang Shi · Jiashi Feng · Bingyi Kang

noqa: E501

Files & Weights

Filename	Size	Action
model.safetensors	5.05 GB

depth-anything/DA3-GIANT-1.1

Model Documentation

Depth Anything 3: DA3-GIANT

noqa: E501

Model Description

Capabilities

Quick Start

Installation

Basic Example

Load model from Hugging Face Hub

Run inference on images

List of image paths, PIL Images, or numpy arrays

Options: glb, npz, ply, mini_npz, gs_ply, gs_video

Access results

Depth maps: [N, H, W] float32

Confidence maps: [N, H, W] float32

Camera poses (w2c): [N, 3, 4] float32

Camera intrinsics: [N, 3, 3] float32

Command Line Interface

Process images with auto mode

Use backend for faster repeated inference

Model Details

Key Insights

noqa: E501

Performance

noqa: E501

Limitations

noqa: E501

Citation

noqa: E501

Links

Authors

noqa: E501

Files & Weights

Access Download