robotics-diffusion-transformer

robotics-diffusion-transformer/RDT2-VQ

No description available.

Model Documentation

RDT2-VQ: Vision-Language-Action with Residual VQ Action Tokens



RDT2-VQ is an autoregressive Vision-Language-Action (VLA) model adapted from Qwen2.5-VL-7B-Instruct and trained on large-scale UMI bimanual manipulation data. It predicts a short-horizon relative action chunk (24 steps, 20 dims/step) from binocular wrist-camera RGB and a natural-language instruction. Actions are discretized with a lightweight Residual VQ (RVQ) tokenizer, enabling robust zero-shot transfer across unseen embodiments for simple, open-vocabulary skills (e.g., pick, place, shake, wipe).

Home
  • Github - Discord - Paper


  • ---

    Table of contents



    * Highlights * Model details * Hardware & software requirements * Quickstart (inference) * Precision settings * Intended uses & limitations * Troubleshooting * Changelog * Citation * Contact

    ---

    Highlights



    * Zero-shot cross-embodiment: Demonstrated on Bimanual UR5e and Franka Research 3 setups; designed to generalize further with correct hardware calibration. * UMI scale: Trained on 10k+ hours from 100+ indoor scenes of human manipulation with the UMI gripper. * Residual VQ action tokenizer: Compact, stable action codes; open-vocabulary instruction following via Qwen2.5-VL-7B backbone.

    ---

    Model details



    Architecture



    * Backbone: Qwen2.5-VL-7B-Instruct (vision-language). * Observation: Two wrist-camera RGB images (right/left), 384×384, JPEG-like statistics. * Instruction: Short imperative text, recommended format “Verb + Object.” (e.g., “Pick up the apple.”).

    Action representation (UMI bimanual, per 24-step chunk)



    * 20-D per step = right (10) + left (10):

    * pos (x,y,z): 3 * rot (6D rotation): 6 * gripper width: 1 * Output tensor shape: (T=24, D=20), relative deltas, float32. * The RVQ tokenizer yields a fixed-length token sequence; see tokenizer card for exact code lengths.

    Tokenizer



    * Tokenizer repo: robotics-diffusion-transformer/RVQActionTokenizer * Use float32 for the VQ model. * Provide a LinearNormalizer for action scaling (UMI convention).

    ---

    Hardware & software requirements



    Approximate single-GPU requirements (Qwen2.5-VL-7B-Instruct scale):

    | Mode | RAM | VRAM | Example GPU | | --------
  • | ------: | ------: | ----------------------- |
  • | Inference | ≥ 32 GB | ≥ 16 GB | RTX 4090 | | LoRA FT | – | ≥ 32 GB | A100 40GB | | Full FT | – | ≥ 80 GB | A100 80GB / H100 / B200 |

    > For deployment on real robots, follow your platform’s end-effector + camera choices and perform hardware setup & calibration (camera stand/pose, flange, etc.) before running closed-loop policies.

    Tested OS: Ubuntu 24.04.

    ---

    Quickstart (inference)



    python
    

    Run under repository: https://github.com/thu-ml/RDT2



    import torch from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration

    from vqvae import MultiVQVAE from models.normalizer import LinearNormalizer from utils import batch_predict_action

    assuming using gpu 0

    device = "cuda:0"

    processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct") model = Qwen2_5_VLForConditionalGeneration.from_pretrained( "robotics-diffusion-transformer/RDT2-VQ" torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map=device ).eval() vae = MultiVQVAE.from_pretrained("robotics-diffusion-transformer/RVQActionTokenizer").eval() vae = vae.to(device=device, dtype=torch.float32)

    valid_action_id_length = ( vae.pos_id_len + vae.rot_id_len + vae.grip_id_len )

    TODO: modify to your own downloaded normalizer path

    download from http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt

    normalizer = LinearNormalizer.from_pretrained("umi_normalizer_wo_downsample_indentity_rot.pt")



    result = batch_predict_action( model, processor, vae, normalizer, examples=[ { "obs": {

    NOTE: following the setting of UMI, camera0_rgb for right arm, camera1_rgb for left arm

    "camera0_rgb": ...,

    RGB image in np.ndarray of shape (1, 384, 384, 3) with dtype=np.uint8

    "camera1_rgb": ...,

    RGB image in np.ndarray of shape (1, 384, 384, 3) with dtype=np.uint8

    }, "meta": { "num_camera": 2 } }, ...,

    we support batch inference, so you can pass a list of examples

    ], valid_action_id_length=valid_action_id_length, apply_jpeg_compression=True,

    Since model is trained with mostly jpeg images, we suggest toggle this on for better formance

    instruction="Pick up the apple."

    We suggest using Instruction in format "verb + object" with Capitalized First Letter and trailing period

    )

    get the predict action from example 0

    action_chunk = result["action_pred"][0]

    torch.FloatTensor of shape (24, 20) with dtype=torch.float32

    action_chunk (T, D) with T=24, D=20

    T=24: our action_chunk predicts the future 0.8s in fps=30, i.e. 24 frames

    D=20: following the setting of UMI, we predict the action for both arms from right to left

  • [0-2]: RIGHT ARM end effector position in x, y, z (unit: m)
  • [3-8]: RIGHT ARM end effector rotation in 6D rotation representation
  • [9]: RIGHT ARM gripper width (unit: m)
  • [10-12]: LEFT ARM end effector position in x, y, z (unit: m)
  • [13-18]: LEFT ARM end effector rotation in 6D rotation representation
  • [19]: LEFT ARM gripper width (unit: m)


  • rescale gripper width from [0, 0.088] to [0, 0.1]

    for robot_idx in range(2): action_chunk[:, robot_idx * 10 + 9] = action_chunk[:, robot_idx * 10 + 9] / 0.088 * 0.1


    > For installation and fine-tuning instructions, please refer to the official GitHub repository.

    ---

    Intended uses & limitations



    Intended uses

    * Research in robot manipulation and VLA modeling. * Zero-shot or few-shot deployment on bimanual systems following the repo’s hardware calibration steps.

    Limitations

    * Open-world robustness depends on calibration quality, camera placement, and gripper specifics. * Requires correct normalization and RVQ code compatibility. * Safety-critical deployment requires supervision, interlocks, and conservative velocity/force limits.

    Safety & responsible use

    * Always test in simulation or with hardware limits engaged (reduced speed, gravity compensation, E-stop within reach).

    ---

    Troubleshooting



    | Symptom | Likely cause | Suggested fix | | ---------------------------------
  • | -------------- | ------------------------------------------------------------------- |
  • | Drifting / unstable gripper widths | Scale mismatch | Apply LinearNormalizer; rescale widths (\[0,0.088] → \[0,0.1]). | | Poor instruction following | Prompt format | Use “Verb + Object.” with capitalization + period. | | No improvement after FT | OOD actions | Check RVQ bounds & reconstruction error; verify normalization. | | Vision brittleness | JPEG gap | Enable --image_corruption; ensure 384×384 inputs. |

    ---

    Changelog



    * 2025-09: Initial release of RDT2-VQ on Hugging Face.

    ---

    Citation



    bibtex
    @article{liu2026rdt2,
      title={RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization},
      author={Liu, Songming and Li, Bangguo and Ma, Kai and Wu, Lingxuan and Tan, Hengkai and Ouyang, Xiao and Su, Hang and Zhu, Jun},
      journal={arXiv preprint arXiv:2602.03310},
      year={2026}
    }
    


    ---

    Contact



    * Project page: https://rdt-robotics.github.io/rdt2/ * Organization: https://huggingface.co/robotics-diffusion-transformer * Discord: https://discord.gg/vsZS3zmf9A

    Files & Weights

    FilenameSizeAction
    model-00001-of-00004.safetensors 4.63 GB
    model-00002-of-00004.safetensors 4.65 GB
    model-00003-of-00004.safetensors 4.59 GB
    model-00004-of-00004.safetensors 1.58 GB