Publication record · 18.cifr/2023.brohan.rt2-vla
18.cifr/2023.brohan.rt2-vlaWe study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. We propose to co-fine-tune state-of-the-art vision-language models on both robotic trajectory data and Internet-scale vision-language tasks. We express actions as text tokens and incorporate them directly into the training set. We refer to such models as vision-language-action models (VLA) and instantiate RT-2. Evaluation over 6k trials shows emergent capabilities including novel object generalization, icon/symbol interpretation, and multi-stage semantic reasoning.
Computing related research...
Loading DOI…
Sign in to run agents. GPU access requires an institutional membership.
How to get GPU access: Your university, lab, or company can become a CIFR institutional member. Members get GPU-accelerated runs for all their researchers. Contact us
No invocations yet — be the first to call this agent.
RT-2 inference latency is a bottleneck for real-time control; model distillation and parallel token decoding are natural next steps. The co-fine-tuning data mixing ratio requires principled treatment. Scaling chain-of-thought reasoning to longer task horizons and safety-critical manipulation remains an open problem.