Publication record · 18.cifr/2023.brohan.rt2-vla

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

v1.0.0

Anthony Brohan (Google DeepMind), Noah Brown (Google DeepMind), Chelsea Finn (Stanford University / Google DeepMind), Sergey Levine (UC Berkeley / Google DeepMind)

RAI18.cifr/2023.brohan.rt2-vla

arXiv / Conference on Robot Learning (CoRL) 2023· 2023· doi:10.48550/arXiv.2307.15818

We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. We propose to co-fine-tune state-of-the-art vision-language models on both robotic trajectory data and Internet-scale vision-language tasks. We express actions as text tokens and incorporate them directly into the training set. We refer to such models as vision-language-action models (VLA) and instantiate RT-2. Evaluation over 6k trials shows emergent capabilities including novel object generalization, icon/symbol interpretation, and multi-stage semantic reasoning.

vision-language-actionrobot learningaction tokenizationemergent capabilitieschain-of-thought

✦ Research context

What this agent contributes to the literature.

Problem solved

End-to-end robot policies trained only on robot data fail to generalize to novel semantic concepts, icons, and multi-step reasoning tasks. Prior work could not leverage the vast world knowledge in Internet-scale VLMs for robot control. RT-2 bridges this gap by co-fine-tuning VLMs on both web data and robot trajectories using a unified token representation.

Novelty

RT-2 demonstrates that VLMs can be directly co-fine-tuned on robot trajectory data by representing continuous robot actions as discrete text tokens appended to the model vocabulary. This is the first large-scale demonstration that Internet-scale VLM pretraining transfers emergent semantic capabilities to robot control without architectural changes. Chain-of-thought augmentation further enables multi-stage semantic reasoning within a single autoregressive decoding pass.

Related research

Computing related research...

Canvas contract1-in / 1-out · unpacked into observation, model_params legacy ports

Sample data

Loading sample data...

Total calls

This month

Citations

Last called

—

Image digest

sha256:14b9dbc0f02513b2e32514c8f0d44debb8bc52c41e229e79e24f9fce4a06d703

Invoke command

python main.py

Inputs

input:application/json

Outputs

output:application/json

Citation

Loading DOI…

Invoke

CPU compute only

How to get GPU access: Your university, lab, or company can become a CIFR institutional member. Members get GPU-accelerated runs for all their researchers. Contact us

Pre-filled with the paper's canonical scenario. Click Invoke agent to reproduce the original result, or edit the JSON below to run a counterfactual.

inputapplication/jsonoptional

Unified canvas input containing observation and model_params

Leave empty to run the paper's canonical scenario.

Recent invocations(0)

No invocations yet — be the first to call this agent.