robotics-diffusion-transformer

robotics-diffusion-transformer/RDT2-VQ

No description available.

Model Documentation

RDT2-VQ: Vision-Language-Action with Residual VQ Action Tokens

RDT2-VQ is an autoregressive Vision-Language-Action (VLA) model adapted from Qwen2.5-VL-7B-Instruct and trained on large-scale UMI bimanual manipulation data. It predicts a short-horizon relative action chunk (24 steps, 20 dims/step) from binocular wrist-camera RGB and a natural-language instruction. Actions are discretized with a lightweight Residual VQ (RVQ) tokenizer, enabling robust zero-shot transfer across unseen embodiments for simple, open-vocabulary skills (e.g., pick, place, shake, wipe).

Home

Github - Discord - Paper

---

* Highlights * Model details * Hardware & software requirements * Quickstart (inference) * Precision settings * Intended uses & limitations * Troubleshooting * Changelog * Citation * Contact

---

Highlights

* Zero-shot cross-embodiment: Demonstrated on Bimanual UR5e and Franka Research 3 setups; designed to generalize further with correct hardware calibration. * UMI scale: Trained on 10k+ hours from 100+ indoor scenes of human manipulation with the UMI gripper. * Residual VQ action tokenizer: Compact, stable action codes; open-vocabulary instruction following via Qwen2.5-VL-7B backbone.

---

Model details

Architecture

* Backbone: Qwen2.5-VL-7B-Instruct (vision-language). * Observation: Two wrist-camera RGB images (right/left), 384×384, JPEG-like statistics. * Instruction: Short imperative text, recommended format “Verb + Object.” (e.g., “Pick up the apple.”).

Action representation (UMI bimanual, per 24-step chunk)

* 20-D per step = right (10) + left (10):

* pos (x,y,z): 3 * rot (6D rotation): 6 * gripper width: 1 * Output tensor shape: (T=24, D=20), relative deltas, float32. * The RVQ tokenizer yields a fixed-length token sequence; see tokenizer card for exact code lengths.

Tokenizer

* Tokenizer repo: robotics-diffusion-transformer/RVQActionTokenizer * Use float32 for the VQ model. * Provide a LinearNormalizer for action scaling (UMI convention).

---

Hardware & software requirements

Approximate single-GPU requirements (Qwen2.5-VL-7B-Instruct scale):

| Mode | RAM | VRAM | Example GPU | | --------

| ------: | ------: | ----------------------- |

| Inference | ≥ 32 GB | ≥ 16 GB | RTX 4090 | | LoRA FT | – | ≥ 32 GB | A100 40GB | | Full FT | – | ≥ 80 GB | A100 80GB / H100 / B200 |

> For deployment on real robots, follow your platform’s end-effector + camera choices and perform hardware setup & calibration (camera stand/pose, flange, etc.) before running closed-loop policies.

Tested OS: Ubuntu 24.04.

---

Quickstart (inference)

python
Run under repository: https://github.com/thu-ml/RDT2


import torch
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration

from vqvae import MultiVQVAE
from models.normalizer import LinearNormalizer
from utils import batch_predict_action

assuming using gpu 0
device = "cuda:0"


processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "robotics-diffusion-transformer/RDT2-VQ"
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map=device
).eval()
vae = MultiVQVAE.from_pretrained("robotics-diffusion-transformer/RVQActionTokenizer").eval()
vae = vae.to(device=device, dtype=torch.float32)

valid_action_id_length = (
    vae.pos_id_len + vae.rot_id_len + vae.grip_id_len
)
TODO: modify to your own downloaded normalizer path
download from http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt
normalizer = LinearNormalizer.from_pretrained("umi_normalizer_wo_downsample_indentity_rot.pt")  


result = batch_predict_action(
    model,
    processor,
    vae,
    normalizer,
    examples=[
        {
            "obs": {
                NOTE: following the setting of UMI, camera0_rgb for right arm, camera1_rgb for left arm
                "camera0_rgb": ..., RGB image in np.ndarray of shape (1, 384, 384, 3) with dtype=np.uint8
                "camera1_rgb": ..., RGB image in np.ndarray of shape (1, 384, 384, 3) with dtype=np.uint8
            },
            "meta": {
                "num_camera": 2
            }
        },
        ...,    we support batch inference, so you can pass a list of examples
    ]，
    valid_action_id_length=valid_action_id_length,
    apply_jpeg_compression=True,
    Since model is trained with mostly jpeg images, we suggest toggle this on for better formance
    instruction="Pick up the apple."
    We suggest using Instruction in format "verb + object" with Capitalized First Letter and trailing period 
)

get the predict action from example 0
action_chunk = result["action_pred"][0] torch.FloatTensor of shape (24, 20) with dtype=torch.float32
action_chunk (T, D) with T=24, D=20
  T=24: our action_chunk predicts the future 0.8s in fps=30, i.e. 24 frames
  D=20: following the setting of UMI, we predict the action for both arms from right to left
  [0-2]: RIGHT ARM end effector position in x, y, z (unit: m)
  [3-8]: RIGHT ARM end effector rotation in 6D rotation representation
  [9]: RIGHT ARM gripper width (unit: m)
  [10-12]: LEFT ARM end effector position in x, y, z (unit: m)
  [13-18]: LEFT ARM end effector rotation in 6D rotation representation
  [19]: LEFT ARM gripper width (unit: m)


rescale gripper width from [0, 0.088] to [0, 0.1]
for robot_idx in range(2):
    action_chunk[:, robot_idx * 10 + 9] = action_chunk[:, robot_idx * 10 + 9] / 0.088 * 0.1

> For installation and fine-tuning instructions, please refer to the official GitHub repository.

---

Intended uses & limitations

Intended uses

* Research in robot manipulation and VLA modeling. * Zero-shot or few-shot deployment on bimanual systems following the repo’s hardware calibration steps.

Limitations

* Open-world robustness depends on calibration quality, camera placement, and gripper specifics. * Requires correct normalization and RVQ code compatibility. * Safety-critical deployment requires supervision, interlocks, and conservative velocity/force limits.

Safety & responsible use

* Always test in simulation or with hardware limits engaged (reduced speed, gravity compensation, E-stop within reach).

---

Troubleshooting

| Symptom | Likely cause | Suggested fix | | ---------------------------------

| -------------- | ------------------------------------------------------------------- |

Changelog

* 2025-09: Initial release of RDT2-VQ on Hugging Face.

---

Citation

bibtex
@article{liu2026rdt2,
  title={RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization},
  author={Liu, Songming and Li, Bangguo and Ma, Kai and Wu, Lingxuan and Tan, Hengkai and Ouyang, Xiao and Su, Hang and Zhu, Jun},
  journal={arXiv preprint arXiv:2602.03310},
  year={2026}
}

---

Contact

* Project page: https://rdt-robotics.github.io/rdt2/ * Organization: https://huggingface.co/robotics-diffusion-transformer * Discord: https://discord.gg/vsZS3zmf9A

Files & Weights

Filename	Size	Action
model-00001-of-00004.safetensors	4.63 GB
model-00002-of-00004.safetensors	4.65 GB
model-00003-of-00004.safetensors	4.59 GB
model-00004-of-00004.safetensors	1.58 GB

robotics-diffusion-transformer/RDT2-VQ

Model Documentation

RDT2-VQ: Vision-Language-Action with Residual VQ Action Tokens

Table of contents

Highlights

Model details

Architecture

Action representation (UMI bimanual, per 24-step chunk)

Tokenizer

Hardware & software requirements

Quickstart (inference)

Run under repository: https://github.com/thu-ml/RDT2

assuming using gpu 0

TODO: modify to your own downloaded normalizer path

download from http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt

NOTE: following the setting of UMI, camera0_rgb for right arm, camera1_rgb for left arm

RGB image in np.ndarray of shape (1, 384, 384, 3) with dtype=np.uint8

RGB image in np.ndarray of shape (1, 384, 384, 3) with dtype=np.uint8

we support batch inference, so you can pass a list of examples

Since model is trained with mostly jpeg images, we suggest toggle this on for better formance

We suggest using Instruction in format "verb + object" with Capitalized First Letter and trailing period

get the predict action from example 0

torch.FloatTensor of shape (24, 20) with dtype=torch.float32

action_chunk (T, D) with T=24, D=20

T=24: our action_chunk predicts the future 0.8s in fps=30, i.e. 24 frames

D=20: following the setting of UMI, we predict the action for both arms from right to left

[0-2]: RIGHT ARM end effector position in x, y, z (unit: m)

[3-8]: RIGHT ARM end effector rotation in 6D rotation representation

[9]: RIGHT ARM gripper width (unit: m)

[10-12]: LEFT ARM end effector position in x, y, z (unit: m)

[13-18]: LEFT ARM end effector rotation in 6D rotation representation

[19]: LEFT ARM gripper width (unit: m)