Mobile ALOHA: Your housekeeping robot. Opening the blinds, watering the plants, vacuuming, making coffee, cleaning up spilled milk, taking dishes out of the dishwasher, pouring ketchup, taking out the trash, putting laundry in the washing machine, taking clothes out of the dryer, putting sheets on the bed, putting the pillow in the pillowcase, hanging clothes in the closet, folding clothes and putting them in a drawer, and turning the light off. Oh, and it plays with the cat, too. The video is taking the internet by storm.

But, not so fast. The "robot" in this video is not autonomous, it's tele-operated. But, if you go to the website of the project (link below), you find many tasks the robot can do autonomously. So, what's going on?

Autonomous skills: Cook shrimp, wipe wine, call elevator, use cabinets, rinse pan, push chairs, high-five humans.

What's going on is the tele-operation and the autonomy are related. What these researchers did was build a robot for training a robot to do things. So the tele-operation creates training data that then gets turned into a robot that can perform some tasks autonomously. The researchers noticed that "imitation learning" algorithms had been created, but there wasn't any affordable platform for creating the training data for the "imitation learning" algorithms. So they decided to make it. This is actually their second system. The first system, called "ALOHA", was a pair of robotic arms mounted on a tabletop. "ALOHA" stood for "A low-cost open-source hardware... (system for bimanual teleoperation, but that would be ALOHASFBIT).

The problem with the tabletop mounting is that many household tasks combine hand movement with whole-body movement. For example, to open a cabinet, the robot needs to back up while opening the two cabinet doors by the two door handles. And of course it has to navigate to the cabinet in the first place. And if it's putting a pot in the cabinet, it has to put the pot down, open the cabinet, pick the pot up again, put it in the cabinet, and close the cabinet. Most household tasks are like this. So they got the idea for "Mobile ALOHA".

To go from "ALOHA" to "Mobile Aloha", they took the previous tabletop ALOHA system and mounted it on a wheeled base. An AgileX Tracer AGV (automatic guided vehicle), from Trossen Robotics, to be precise. It is designed for indoor autonomous logistics and warehousing applications. It has two 150W brushless servo motors and can carry a payload of 100 kg (220 lbs) for 4 hours. It can move at about the same speed as a walking human. It can go 1.6 meters per second and the average walking speed for a human is 1.4 meters per second. By adding extra weight low to the ground, the researchers found they could increase the "tip-over" stability, enabling Mobile ALOHA to, for example, get on an elevator, where the elevator floor wasn't exactly level with the building floor.

With the robotic arms attached, the total weight is 75 kg (165 lbs), the robotic arms can extend 100 cm (40 inches) from the base, and can lift 750 g (26 Oz).

How then is the system tele-operated, and how, once the tele-operation generates data, is that data used to train the system?

Well, first of all the controller of the tele-operation system can't use their hands to control the base movement because their hands are controlling the robotic arms and grippers. So the researchers came up with a "tether" that attaches to the controller's waist, and that's how the robotic base is controlled.

How about the training? Well, the original ALOHA system represented the 14-degrees-of-freedom input from the robotic arms and grippers as a 14-dimensional vector. For Mobile ALOHA, that's extended to 2 more for the base. If you're wondering where the number 14 came from, the robotic arms are "6-degrees-of-freedom" robotic arms. More precisely, they're the 6-degrees-of-freedom ViperX 300 Robot Arm from Trossen Robotics. That's 6, and to get 7 they added the gripping force. And there's 2 of these robotic arms, so multiply by 2 and you get 14. Add 2 more degrees of freedom for the base unit, and you're now up to 16 degrees of freedom.

To expedite the training, they first trained the system on their data from the original 14-DoF ALOHA training data. They called this step "Co-training with static ALOHA". They didn't do anything else original for the training, although they did come up with an original algorithm for the original ALOHA system and that was one of three used here. That system is called ACT, which stands for Action Chunking with Transformers. As you might guess from the "transformers" part of the name, the system uses transformers like the GPT models we're familiar with (remember the "T" in "GPT" stands for "transformer" -- GPT stands for "generative pre-trained transformer".) The idea is to break actions into "chunks" represented by tokens and then have the transformer model generate tokens for actions in a manner analogous to how a large language model like GPT generates tokens for text.

The other two algorithms are called Diffusion Policy, and VINN.

If the name "Diffusion Policy" makes you think of diffusion models like Stable Diffusion or DALL-E, you're on the right track. Except unlike those models which generate images, Diffusion Policy generates "policies", which in the parlance of reinforcement learning is the function that maps the state of the agent and envorinment to actions -- the word "strategy" would make more sense in colloquial, non-technical contexts. The idea is that you represent a visual-motor action strategy as a denoising process.

VINN stands for Visual Imitation through Nearest Neighbors. The basic idea is that an encoder network is trained on all the training data. Then when it comes time to use the encoder to perform actions after training, it uses a "nearest neighbor" algorithm to find the action that is closest to the current situation.

Mobile ALOHA: Your housekeeping robot

#solidstatelife #ai #genai #reinforcementlearning #imitationlearning #robotics