robotics-diffusion-transformer
robotics-diffusion-transformer/RDT2-VQ
No description available.
Model Documentation
RDT2-VQ: Vision-Language-Action with Residual VQ Action Tokens
RDT2-VQ is an autoregressive Vision-Language-Action (VLA) model adapted from Qwen2.5-VL-7B-Instruct and trained on large-scale UMI bimanual manipulation data. It predicts a short-horizon relative action chunk (24 steps, 20 dims/step) from binocular wrist-camera RGB and a natural-language instruction. Actions are discretized with a lightweight Residual VQ (RVQ) tokenizer, enabling robust zero-shot transfer across unseen embodiments for simple, open-vocabulary skills (e.g., pick, place, shake, wipe).
Home
---
Table of contents
* Highlights * Model details * Hardware & software requirements * Quickstart (inference) * Precision settings * Intended uses & limitations * Troubleshooting * Changelog * Citation * Contact
---
Highlights
* Zero-shot cross-embodiment: Demonstrated on Bimanual UR5e and Franka Research 3 setups; designed to generalize further with correct hardware calibration. * UMI scale: Trained on 10k+ hours from 100+ indoor scenes of human manipulation with the UMI gripper. * Residual VQ action tokenizer: Compact, stable action codes; open-vocabulary instruction following via Qwen2.5-VL-7B backbone.
---
Model details
Architecture
* Backbone: Qwen2.5-VL-7B-Instruct (vision-language). * Observation: Two wrist-camera RGB images (right/left), 384×384, JPEG-like statistics. * Instruction: Short imperative text, recommended format “Verb + Object.” (e.g., “Pick up the apple.”).
Action representation (UMI bimanual, per 24-step chunk)
* 20-D per step = right (10) + left (10):
* pos (x,y,z): 3 * rot (6D rotation): 6 * gripper width: 1 * Output tensor shape: (T=24, D=20), relative deltas,
float32.
* The RVQ tokenizer yields a fixed-length token sequence; see tokenizer card for exact code lengths.Tokenizer
* Tokenizer repo:
robotics-diffusion-transformer/RVQActionTokenizer
* Use float32 for the VQ model.
* Provide a LinearNormalizer for action scaling (UMI convention).---
Hardware & software requirements
Approximate single-GPU requirements (Qwen2.5-VL-7B-Instruct scale):
| Mode | RAM | VRAM | Example GPU | | --------
> For deployment on real robots, follow your platform’s end-effector + camera choices and perform hardware setup & calibration (camera stand/pose, flange, etc.) before running closed-loop policies.
Tested OS: Ubuntu 24.04.
---
Quickstart (inference)
python
Run under repository: https://github.com/thu-ml/RDT2
import torch
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from vqvae import MultiVQVAE
from models.normalizer import LinearNormalizer
from utils import batch_predict_action
assuming using gpu 0
device = "cuda:0"
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"robotics-diffusion-transformer/RDT2-VQ"
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map=device
).eval()
vae = MultiVQVAE.from_pretrained("robotics-diffusion-transformer/RVQActionTokenizer").eval()
vae = vae.to(device=device, dtype=torch.float32)
valid_action_id_length = (
vae.pos_id_len + vae.rot_id_len + vae.grip_id_len
)
TODO: modify to your own downloaded normalizer path
download from http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt
normalizer = LinearNormalizer.from_pretrained("umi_normalizer_wo_downsample_indentity_rot.pt")
result = batch_predict_action(
model,
processor,
vae,
normalizer,
examples=[
{
"obs": {
NOTE: following the setting of UMI, camera0_rgb for right arm, camera1_rgb for left arm
"camera0_rgb": ..., RGB image in np.ndarray of shape (1, 384, 384, 3) with dtype=np.uint8
"camera1_rgb": ..., RGB image in np.ndarray of shape (1, 384, 384, 3) with dtype=np.uint8
},
"meta": {
"num_camera": 2
}
},
..., we support batch inference, so you can pass a list of examples
],
valid_action_id_length=valid_action_id_length,
apply_jpeg_compression=True,
Since model is trained with mostly jpeg images, we suggest toggle this on for better formance
instruction="Pick up the apple."
We suggest using Instruction in format "verb + object" with Capitalized First Letter and trailing period
)
get the predict action from example 0
action_chunk = result["action_pred"][0] torch.FloatTensor of shape (24, 20) with dtype=torch.float32
action_chunk (T, D) with T=24, D=20
T=24: our action_chunk predicts the future 0.8s in fps=30, i.e. 24 frames
D=20: following the setting of UMI, we predict the action for both arms from right to left
[0-2]: RIGHT ARM end effector position in x, y, z (unit: m)
[3-8]: RIGHT ARM end effector rotation in 6D rotation representation
[9]: RIGHT ARM gripper width (unit: m)
[10-12]: LEFT ARM end effector position in x, y, z (unit: m)
[13-18]: LEFT ARM end effector rotation in 6D rotation representation
[19]: LEFT ARM gripper width (unit: m)
rescale gripper width from [0, 0.088] to [0, 0.1]
for robot_idx in range(2):
action_chunk[:, robot_idx * 10 + 9] = action_chunk[:, robot_idx * 10 + 9] / 0.088 * 0.1
> For installation and fine-tuning instructions, please refer to the official GitHub repository.
---
Intended uses & limitations
Intended uses
* Research in robot manipulation and VLA modeling. * Zero-shot or few-shot deployment on bimanual systems following the repo’s hardware calibration steps.
Limitations
* Open-world robustness depends on calibration quality, camera placement, and gripper specifics. * Requires correct normalization and RVQ code compatibility. * Safety-critical deployment requires supervision, interlocks, and conservative velocity/force limits.
Safety & responsible use
* Always test in simulation or with hardware limits engaged (reduced speed, gravity compensation, E-stop within reach).
---
Troubleshooting
| Symptom | Likely cause | Suggested fix | | ---------------------------------
--image_corruption; ensure 384×384 inputs. |---
Changelog
* 2025-09: Initial release of RDT2-VQ on Hugging Face.
---
Citation
bibtex
@article{liu2026rdt2,
title={RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization},
author={Liu, Songming and Li, Bangguo and Ma, Kai and Wu, Lingxuan and Tan, Hengkai and Ouyang, Xiao and Su, Hang and Zhu, Jun},
journal={arXiv preprint arXiv:2602.03310},
year={2026}
}
---
Contact
* Project page: https://rdt-robotics.github.io/rdt2/ * Organization: https://huggingface.co/robotics-diffusion-transformer * Discord: https://discord.gg/vsZS3zmf9A
Files & Weights
| Filename | Size | Action |
|---|---|---|
| model-00001-of-00004.safetensors | 4.63 GB | |
| model-00002-of-00004.safetensors | 4.65 GB | |
| model-00003-of-00004.safetensors | 4.59 GB | |
| model-00004-of-00004.safetensors | 1.58 GB |