depth-anything

depth-anything/DA3-SMALL

DA3 Small model for multi-view depth estimation and camera pose estimation. Efficient foundation model with unified depth-ray representation...

Model Documentation

Depth Anything 3: DA3-SMALL





Project Page Paper Demo

noqa: E501

Benchmark badge removed as per request -->



Model Description



DA3 Small model for multi-view depth estimation and camera pose estimation. Efficient foundation model with unified depth-ray representation.

| Property | Value | |----------|-------| | Model Series | Any-view Model | | Parameters | 0.08B | | License | Apache 2.0 |



Capabilities



  • โœ… Relative Depth
  • โœ… Pose Estimation
  • โœ… Pose Conditioning


  • Quick Start



    Installation



    bash
    git clone https://github.com/ByteDance-Seed/depth-anything-3
    cd depth-anything-3
    pip install -e .
    


    Basic Example



    python
    import torch
    from depth_anything_3.api import DepthAnything3

    Load model from Hugging Face Hub

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = DepthAnything3.from_pretrained("depth-anything/da3-small") model = model.to(device=device)

    Run inference on images

    images = ["image1.jpg", "image2.jpg"]

    List of image paths, PIL Images, or numpy arrays

    prediction = model.inference( images, export_dir="output", export_format="glb"

    Options: glb, npz, ply, mini_npz, gs_ply, gs_video

    )

    Access results

    print(prediction.depth.shape)

    Depth maps: [N, H, W] float32

    print(prediction.conf.shape)

    Confidence maps: [N, H, W] float32

    print(prediction.extrinsics.shape)

    Camera poses (w2c): [N, 3, 4] float32

    print(prediction.intrinsics.shape)

    Camera intrinsics: [N, 3, 3] float32



    Command Line Interface



    bash
    

    Process images with auto mode

    da3 auto path/to/images \ --export-format glb \ --export-dir output \ --model-dir depth-anything/da3-small

    Use backend for faster repeated inference

    da3 backend --model-dir depth-anything/da3-small da3 auto path/to/images --export-format glb --use-backend


    Model Details



  • Developed by: ByteDance Seed Team
  • Model Type: Vision Transformer for Visual Geometry
  • Architecture: Plain transformer with unified depth-ray representation
  • Training Data: Public academic datasets only


  • Key Insights



    ๐Ÿ’Ž A single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization.

    noqa: E501



    โœจ A singular depth-ray representation obviates the need for complex multi-task learning.

    Performance



    ๐Ÿ† Depth Anything 3 significantly outperforms:
  • Depth Anything 2 for monocular depth estimation
  • VGGT for multi-view depth estimation and pose estimation


  • For detailed benchmarks, please refer to our paper.

    noqa: E501



    Limitations



  • The model is trained on academic datasets and may have limitations on certain domain-specific images

    noqa: E501

  • Performance may vary depending on image quality, lighting conditions, and scene complexity


  • Citation



    If you find Depth Anything 3 useful in your research or projects, please cite:

    bibtex
    @article{depthanything3,
      title={Depth Anything 3: Recovering the visual space from any views},
      author={Haotong Lin and Sili Chen and Jun Hao Liew and Donny Y. Chen and Zhenyu Li and Guang Shi and Jiashi Feng and Bingyi Kang},  

    noqa: E501

    journal={arXiv preprint arXiv:XXXX.XXXXX}, year={2025} }


    Links



  • ๐Ÿ  Project Page
  • ๐Ÿ“„ Paper
  • ๐Ÿ’ป GitHub Repository
  • ๐Ÿค— Hugging Face Demo
  • ๐Ÿ“š Documentation


  • Authors



    Haotong Lin ยท Sili Chen ยท Junhao Liew ยท Donny Y. Chen ยท Zhenyu Li ยท Guang Shi ยท Jiashi Feng ยท Bingyi Kang

    noqa: E501

    Files & Weights

    FilenameSizeAction
    model.safetensors 0.13 GB Download