RLinf

RLinf/RLinf-OpenVLAOFT-LIBERO-130-Base-Lora

The RLinf-openvlaoft-libero series is trained on RLinf/RLinf-OpenVLAOFT-LIBERO-xxx-Base-Lora (including libero90 and libero130) and Haozhan7...

Model Documentation

RLinf-logo


--> Hugging Face --> Ask DeepWiki.com -->


RLinf: Reinforcement Learning Infrastructure for Agentic AI



RLinf is a flexible and scalable open-source infrastructure designed for post-training foundation models (LLMs, VLMs, VLAs) via reinforcement learning. The 'inf' in RLinf stands for Infrastructure, highlighting its role as a robust backbone for next-generation training. It also stands for Infinite, symbolizing the system’s support for open-ended learning, continuous generalization, and limitless possibilities in intelligence development.

RLinf-overview


Model Description

The RLinf-openvlaoft-libero series is trained on RLinf/RLinf-OpenVLAOFT-LIBERO-xxx-Base-Lora (including libero90 and libero130) and Haozhan72/Openvla-oft-SFT-libero-xxx-traj1 (including libero10, libero-object, libero-goal and libero-spatial), using the same base models and training datasets as verl. Training with RLinf yields SOTA performance.

We use a mask to focus on valid action tokens, and compute token-level loss based on the Group Relative Policy Optimization (GRPO) advantage function, in order to enhance the model’s performance on spatial reasoning, object generalization, instruction generalization, and long-horizon tasks.

Evaluation and Results

We trained four models using RLinf:

  • RLinf-OpenVLAOFT-GRPO-LIBERO-90 Model (based on RLinf/RLinf-OpenVLAOFT-LIBERO-90-Base-Lora))
  • Recommended sampling settings: temperature = 1.6, top_p = 1.0


  • RLinf-OpenVLAOFT-LIBERO-130 Model (based on RLinf/RLinf-OpenVLAOFT-LIBERO-130-Base-Lora))
  • Recommended sampling settings: temperature = 1.6, top_p = 1.0


  • RLinf-OpenVLAOFT-GRPO-LIBERO-object Model (based on Haozhan72/Openvla-oft-SFT-libero-object-traj1)
  • Recommended sampling settings: temperature = 1.6, top_p = 1.0


  • RLinf-OpenVLAOFT-GRPO-LIBERO-spatial Model (based on Haozhan72/Openvla-oft-SFT-libero-spatial-traj1)
  • Recommended sampling settings: temperature = 1.6, top_p = 1.0


  • RLinf-OpenVLAOFT-GRPO-LIBERO-goal Model (based on Haozhan72/Openvla-oft-SFT-libero-goal-traj1))
  • Recommended sampling settings: temperature = 1.6, top_p = 1.0


  • RLinf-OpenVLAOFT-GRPO-LIBERO-long Model (based on Haozhan72/Openvla-oft-SFT-libero10-traj1))
  • Recommended sampling settings: temperature = 1.6, top_p = 1.0


  • Benchmark Results



    Sft models for LIBERO-90 and LIBERO-130 are trained by ourself following training reciepe from OpenVLA-OFT. And other sft models are from SimpleVLA-RL. > We evaluate each model according to its training configuration. Using libero_seed = 0 and evaluating 500 episodes for the Object, Spatial, Goal, and Long suites, 4,500 episodes for LIBERO-90, and 6,500 episodes for LIBERO-130. > For the SFT-trained (LoRA-base) models, we set do_sample = False. > For the RL-trained models, we set do_sample = True, temperature = 1.6, and enable rollout_epoch=2, and the final results are reported as the average across the two runs.

    | Model | Object | Spatial | Goal | Long | 90 | Average | | -----------------
  • | ------ | ------- | ----- | ----- | ------- |------- |
  • | sft models | 28.83 | 52.22 | 49.40 | 14.92 | 79.28 | 66.07 | | trained with RLinf | 97.68 | 94.76 | 93.96 | 90.93 | 96.44 | 95.79 |

    Besides, we train one model (we named it libero-130 model) for all tasks in libero.

    | libero-130 model | Object | Spatial | Goal | Long | 90 | 130(all) | | -----------------
  • | ------ | ------- | ----- | ----- | ------- |------- |
  • | sft models | 50.20 | 51.61 | 49.40 | 11.90 | 42.67 | 42.09 | | trained with RLinf | 99.60 | 98.69 | 98.09 | 93.45 | 98.02 | 97.85 |

    RLinf-libero-result


    How to Use

    Please integrate the provided model with the RLinf codebase. To do so, modify the following parameters in the configuration file examples/embodiment/config/libero_10_grpo_openvlaoft.yaml:

  • Set rollout.model.model_path, actor.model.model_path, and actor.tokenizer.tokenizer_model to the path of the model checkpoint.


  • Note: If you intend to evaluate the model directly, make sure to set actor.model.is_lora to false.

    License

    This code repository and the model weights are licensed under the MIT License.

    Files & Weights

    FilenameSizeAction
    model-00001-of-00004.safetensors 4.59 GB
    model-00002-of-00004.safetensors 4.61 GB
    model-00003-of-00004.safetensors 4.61 GB
    model-00004-of-00004.safetensors 0.24 GB