RT-2 (Robotics Transformer 2) is a robot that can follow text commands using an integrated large language model and vision model. The idea is to "directly train vision-language models designed for open-vocabulary visual question answering and visual dialogue to output low-level robot actions."

"Although such models are typically trained to produce natural language tokens, we can train them on robotic trajectories by tokenizing the actions into text tokens and creating 'multimodal sentences' that 'respond' to robotic instructions paired with camera observations by producing corresponding actions."

"Robotic policies derived from such vision-language models exhibit a range of remarkable capabilities, combining the physical motions learned from the robot data with the ability to interpret images and text learned from web data into a single model."

Examples of commands the robot is able to obey successfully are: "Put strawberry into the correct bowl", "Pick up the bag about to fall off the table", "Move the apple to Denver Nuggets" (Denver Nuggets placemat -- go Denver Nuggets!), "Pick robot" (pick up toy robot), "Place orange in matching bowl", "Move Redbull can to H" (what do you call these letters that you use to put text on houses?), "Move soccer ball to basketball", "Move banana to Germany" (flag placemats), "Move cup to the wine bottle", "Pick animal with different color" (one black, two yellow, it correctly picks up the black one), "Move Coke can to Taylor Swift" (portraits), "Move Coke can to X" (letters), "Move bag to Google" (bag with "Google" logo on it, "Move banana to the sum of two plus one" (pieces of paper with numbers on them), and "Pick land animal".

RT-2: Vision-language-action models

#solidstatelife #ai #robotics #llms

1

There are no comments yet.