Publication record · 18.cifr/2022.jiang.vima-multimodal

VIMA: General Robot Manipulation with Multimodal Prompts

v1.0.0

Yunfan Jiang (Stanford University), Agrim Gupta (Stanford University), Zichen Zhang (NVIDIA), Guanzhi Wang (Caltech), Yongqiang Dou (NVIDIA), Yanjun Chen (NVIDIA), Li Fei-Fei (Stanford University), Anima Anandkumar (Caltech / NVIDIA), Yuke Zhu (UT Austin / NVIDIA), Linxi Fan (NVIDIA)

RAI18.cifr/2022.jiang.vima-multimodal

arXiv / ICML 2023· 2022· doi:10.48550/arXiv.2210.03094

Prompt-based learning has emerged as a successful paradigm in natural language processing, where a single general-purpose language model can be instructed to perform any task specified by input prompts. Yet task specification in robotics comes in various forms, such as imitating one-shot demonstrations, following language instructions, and reaching visual goals. They are often considered different tasks and tackled by specialized models. We show that a wide spectrum of robot manipulation tasks can be expressed with multimodal prompts, interleaving textual and visual tokens. Accordingly, we develop a new simulation benchmark that consists of thousands of procedurally-generated tabletop tasks with multimodal prompts, 600K+ expert trajectories for imitation learning, and a four-level evaluation protocol for systematic generalization. We design a transformer-based robot agent, VIMA, that processes these prompts and outputs motor actions autoregressively. VIMA features a recipe that achieves strong model scalability and data efficiency. It outperforms alternative designs in the hardest zero-shot generalization setting by up to 2.9x task success rate given the same training data.

cs.ROcs.AIcs.LGmultimodal promptsrobot manipulationtransformerimitation learning

✦ Research context

What this agent contributes to the literature.

Problem solved

Robot manipulation tasks are typically handled by specialized models trained per task type, preventing cross-task knowledge transfer. VIMA formulates all task types as multimodal prompt-following so a single model generalizes to new tasks at inference time via their prompt alone, without retraining.

Novelty

VIMA unifies diverse manipulation task types under a single multimodal prompt framework, treating textual and visual tokens as interchangeable inputs to a transformer. A four-level generalization evaluation protocol tests compositional, novel-object, and novel-task generalization. Cross-attention conditioning achieves 2.9x better zero-shot generalization than alternative architectures with the same data.

Related research

Computing related research...

Canvas contract1-in / 1-out · unpacked into task_prompt, observations, model_config legacy ports

Sample data

Loading sample data...

Total calls

This month

Citations

Last called

—

Image digest

sha256:e20281f6ff4cdf4e4380e4c154c6207c3acdfe832363118d1486298562d34e5e

Invoke command

python main.py

Inputs

input:application/json

Outputs

output:application/json

Citation

Loading DOI…

Invoke

CPU compute only

How to get GPU access: Your university, lab, or company can become a CIFR institutional member. Members get GPU-accelerated runs for all their researchers. Contact us

Pre-filled with the paper's canonical scenario. Click Invoke agent to reproduce the original result, or edit the JSON below to run a counterfactual.

inputapplication/jsonoptional

Unified canvas input containing prompt tokens, observations, and action space config

Leave empty to run the paper's canonical scenario.

{
  "task_prompt": {
    "prompt_tokens": [
      {
        "type": "text",
        "content": "Put the"
      },
      {
        "type": "image",
        "content": [
          [
            255,
            0,
            0
          ],
          [
            0,
            255,
            0
          ]
        ],
        "shape": [
          2,
          3
        ]
      },
      {
        "type": "text",
        "content": "into the"
      },
      {
        "type": "image",
        "content": [
          [
            0,
            0,
            255
          ],
          [
            255,
            255,
            0
          ]
        ],
        "shape": [
          2,
          3
        ]
      }
    ],
    "task_type": "visual_manipulation"
  },
  "observations": {
    "steps": [
      {
        "image": [
          [
            128,
            128,
            128
          ],
          [
            64,
            64,
            64
          ]
        ],
        "shape": [
          2,
          3
        ],
        "objects": [
          {
            "id": 0,
            "pos": [
              0.3,
              0.2
            ],
            "color": "red"
          }
        ]
      },
      {
        "image": [
          [
            120,
            130,
            125
          ],
          [
            60,
            70,
            65
          ]
        ],
        "shape": [
          2,
          3
        ],
        "objects": [
          {
            "id": 0,
            "pos": [
              0.35,
              0.22
            ],
            "color": "red"
          }
        ]
      }
    ]
  },
  "model_config": {
    "embed_dim": 256,
    "num_heads": 8,
    "num_layers": 6,
    "action_dim": 3,
    "gripper_bins": 2,
    "max_seq_len": 128
  }
}

Recent invocations(0)

No invocations yet — be the first to call this agent.