#reinforcementlearning

waynerad@diasp.org

"LMRL Gym: Benchmarks for multi-turn reinforcement learning with language models."

One of their benchmarks was "Guess My City" so I decided to try that.


Me:

Guess my city.

ChatGPT:

Sure, I'd be happy to play a game of guessing your city! Please provide some clues or descriptions about your city, such as landmarks, geographical features, climate, or any other distinctive characteristics. The more details you can provide, the better I'll be able to make an accurate guess.

Me:

Landmarks: It has a building known as the "cash register". Geographical features: it has mountains to the west. Climate: cold winters and moderate summers with occasional hot days, with a lot of variation all year.

ChatGPT:

Based on your clues, I believe your city might be Denver, Colorado.

The "cash register" building you're referring to is likely the Wells Fargo Center, known for its unique architecture that resembles a cash register. The geographical feature of mountains to the west aligns with Denver's location at the eastern foot of the Rocky Mountains. Additionally, Denver is known for its cold winters and moderate summers, with a climate that can vary greatly, which matches your description of the climate in your city.

Is this correct?

Me:

Yes

ChatGPT:

Great! I'm glad I guessed correctly. Denver is a fascinating city with its unique blend of urban and natural landscapes. If you have any more questions or need information about Denver or anything else, feel free to ask!


I also tried this with Claude, Bard, and DeepSeek.

In the interest of brevity, I'm going to skip the full transcripts of the conversations.

Claude guessed Dallas, Atlanta, and San Francisco before guessing Denver. It thought "cash register" could refer to buildings in those cities and had to be given a hint that my city had mountains to the west. It didn't think to ask for "geographical features" right off the bat that would have resulted in me telling it that.

Bard did something totally unexpected and did Google Maps searches. This (surprisingly or unsurprisingly?) was unhelpful. It guessed Los Angeles because there is an "American Cash Register Company" there. It had to be told "cash register" is nickname.

DeepSeek (the Chinese large language model) initially just said sorry, I don't have enough information to guess your city. Unlike the others it didn't immediately recognize "guess my city" as a game. I had to explain it's a game where it was supposed to ask me questions to guess my city. When I followed up with the country and geography, it responded with encyclopedic information on the geography of the western United States. Once it properly understood it was supposed to ask questions, it asked whether my city had a nickname. I said yes, "The Mile High City," and it immediately guessed Denver from there.

By the way, all the large language models gave a different name for Denver's "cash register" building (so called because it has the silhouette of a cash register if viewed from the proper angle), and I don't know which is correct because I don't know the true name of the building and can't be bothered to figure it out.

What this is all about is "evaluating capabilities enabled by reinforcement learning". As you may or may not know, what enables large language models to function as "chatbots" is not just their "predict the next token" language training (which is called self-supervised training, for historical reasons, don't worry if the term makes no sense), but an additional technique called reinforcement learning through human feedback (RLHF). This technique uses humans to train a model that is then flipped around and used as a reward signal for a second model, which generates feedback that gets fed into the original model which teaches it to behave "helpfully". This is why ChatGPT and its ilk come across as so eager to please you. It's a complicated system but what's important for the moment here are the words "reinforcement learning". Reinforcement learning is the field of AI that led to the systems that beat humans at the Chinese game of Go, as well as Chess and Shogi -- and it beat the best human-made chess engine, Stockfish. Reinforcement learning works by getting input from an environment along with a reward signal. For example the screen pixels of Atari games, plus the score as the reward signal. Anyway, these researchers got the idea that, since large language models are using reinforcement learning, they might design some tests looking for characteristics of reinforcement learning and see if they can find evidence of reinforcement learning-generated behavior from large language models.

Here's the list of "core capabilities that reinforcement learning can enable in large language models" that they decided to look for:

"Strategic decision making. Reinforcement learning shines in goal-directed tasks that require multi-step planning and strategic decision making. Strategic decision-making can range from simple choices like asking follow-up questions to gather information (e.g., in the 20 Questions task), to complex strategy in chess."

"Complex language. Our benchmark includes realistic language and interaction scenarios, requiring large language models to combine their knowledge from pretraining to help solve tasks during reinforcement learning finetuning. Rather than focusing entirely on causal logic and strategy found in text games, several of our tasks specifically emphasize the use of realistic language."

"Credit assignment. In reinforcement learning, rewards are often delayed relative to the action that was pivotal to the outcome. For example, a seller agent might state a particularly compelling feature of the product and then, several turns later, complete a successful sale. Reinforcement learning must determine the statements that led to the good outcome, and reinforce them."

"Partial observability. In language tasks, the state consists of the entire history of tokens, and an agent may need to examine this entire context to infer the correct state. For example, the mental states of a speaker in a dialogue (e.g., whether the buyer is impatient in a selling task), previously observed facts in a guessing game, and other hidden variables might induce partial observability."

"Trajectory stitching. In a dataset with many suboptimal trajectories, it is necessary to join optimal actions from different suboptimal trajectories together to form the most optimal trajectory. An algorithm capable of trajectory stitching should be able to learn from optimal actions taken in unsuccessful trajectories and avoid suboptimal actions that occurred in successful trajectories."

They came up with 8 "tasks", called "Maze", "Text-Based Navigation", "Wordle", "Chess", "Chess Endgames ", "Twenty Questions", "Guess My City", and "Car Dealer". Yes, they really did come up with a text-based way of playing chess (there's actually a standardized notation for chess moves). They even used Stockfish to generate data. And yes, Wordle is exactly the online Worldle game you are familiar with, where you get 6 attempts to guess a hidden 5-letter word and after each guess, you're told if the letter you guessed is in the right position, in the word but not in the right position, or not in the hidden word at all.

They have a grid (on page 4) showing for each of the 8 tasks, which of the 5 "capabilities" it exercises (strategic decision making, complex language, credit assignment, partial observability, and trajectory stitching). For the task I tried above, "Guess My City", it says it exercises more than most: the first four: strategic decision making, complex language, credit assignment, partial observability, but maybe not the last one, trajectory stitching.

LMRL Gym: Benchmarks for multi-turn reinforcement learning with language models

#solidstatelife #ai #genai #llms #rlhf #reinforcementlearning

waynerad@diasp.org

Mobile ALOHA: Your housekeeping robot. Opening the blinds, watering the plants, vacuuming, making coffee, cleaning up spilled milk, taking dishes out of the dishwasher, pouring ketchup, taking out the trash, putting laundry in the washing machine, taking clothes out of the dryer, putting sheets on the bed, putting the pillow in the pillowcase, hanging clothes in the closet, folding clothes and putting them in a drawer, and turning the light off. Oh, and it plays with the cat, too. The video is taking the internet by storm.

But, not so fast. The "robot" in this video is not autonomous, it's tele-operated. But, if you go to the website of the project (link below), you find many tasks the robot can do autonomously. So, what's going on?

Autonomous skills: Cook shrimp, wipe wine, call elevator, use cabinets, rinse pan, push chairs, high-five humans.

What's going on is the tele-operation and the autonomy are related. What these researchers did was build a robot for training a robot to do things. So the tele-operation creates training data that then gets turned into a robot that can perform some tasks autonomously. The researchers noticed that "imitation learning" algorithms had been created, but there wasn't any affordable platform for creating the training data for the "imitation learning" algorithms. So they decided to make it. This is actually their second system. The first system, called "ALOHA", was a pair of robotic arms mounted on a tabletop. "ALOHA" stood for "A low-cost open-source hardware... (system for bimanual teleoperation, but that would be ALOHASFBIT).

The problem with the tabletop mounting is that many household tasks combine hand movement with whole-body movement. For example, to open a cabinet, the robot needs to back up while opening the two cabinet doors by the two door handles. And of course it has to navigate to the cabinet in the first place. And if it's putting a pot in the cabinet, it has to put the pot down, open the cabinet, pick the pot up again, put it in the cabinet, and close the cabinet. Most household tasks are like this. So they got the idea for "Mobile ALOHA".

To go from "ALOHA" to "Mobile Aloha", they took the previous tabletop ALOHA system and mounted it on a wheeled base. An AgileX Tracer AGV (automatic guided vehicle), from Trossen Robotics, to be precise. It is designed for indoor autonomous logistics and warehousing applications. It has two 150W brushless servo motors and can carry a payload of 100 kg (220 lbs) for 4 hours. It can move at about the same speed as a walking human. It can go 1.6 meters per second and the average walking speed for a human is 1.4 meters per second. By adding extra weight low to the ground, the researchers found they could increase the "tip-over" stability, enabling Mobile ALOHA to, for example, get on an elevator, where the elevator floor wasn't exactly level with the building floor.

With the robotic arms attached, the total weight is 75 kg (165 lbs), the robotic arms can extend 100 cm (40 inches) from the base, and can lift 750 g (26 Oz).

How then is the system tele-operated, and how, once the tele-operation generates data, is that data used to train the system?

Well, first of all the controller of the tele-operation system can't use their hands to control the base movement because their hands are controlling the robotic arms and grippers. So the researchers came up with a "tether" that attaches to the controller's waist, and that's how the robotic base is controlled.

How about the training? Well, the original ALOHA system represented the 14-degrees-of-freedom input from the robotic arms and grippers as a 14-dimensional vector. For Mobile ALOHA, that's extended to 2 more for the base. If you're wondering where the number 14 came from, the robotic arms are "6-degrees-of-freedom" robotic arms. More precisely, they're the 6-degrees-of-freedom ViperX 300 Robot Arm from Trossen Robotics. That's 6, and to get 7 they added the gripping force. And there's 2 of these robotic arms, so multiply by 2 and you get 14. Add 2 more degrees of freedom for the base unit, and you're now up to 16 degrees of freedom.

To expedite the training, they first trained the system on their data from the original 14-DoF ALOHA training data. They called this step "Co-training with static ALOHA". They didn't do anything else original for the training, although they did come up with an original algorithm for the original ALOHA system and that was one of three used here. That system is called ACT, which stands for Action Chunking with Transformers. As you might guess from the "transformers" part of the name, the system uses transformers like the GPT models we're familiar with (remember the "T" in "GPT" stands for "transformer" -- GPT stands for "generative pre-trained transformer".) The idea is to break actions into "chunks" represented by tokens and then have the transformer model generate tokens for actions in a manner analogous to how a large language model like GPT generates tokens for text.

The other two algorithms are called Diffusion Policy, and VINN.

If the name "Diffusion Policy" makes you think of diffusion models like Stable Diffusion or DALL-E, you're on the right track. Except unlike those models which generate images, Diffusion Policy generates "policies", which in the parlance of reinforcement learning is the function that maps the state of the agent and envorinment to actions -- the word "strategy" would make more sense in colloquial, non-technical contexts. The idea is that you represent a visual-motor action strategy as a denoising process.

VINN stands for Visual Imitation through Nearest Neighbors. The basic idea is that an encoder network is trained on all the training data. Then when it comes time to use the encoder to perform actions after training, it uses a "nearest neighbor" algorithm to find the action that is closest to the current situation.

Mobile ALOHA: Your housekeeping robot

#solidstatelife #ai #genai #reinforcementlearning #imitationlearning #robotics

waynerad@diasp.org

DenseWiki is a new site that aims to provide simple, plain-English explanations of popular concepts in machine learning. They've started by adding explanations of a few popular concepts in reinforcement learning, starting with "Actor-Critic Methods".

"As a human, when you get better at playing a game (say soccer or boxing), isn't the improvement also usually accompanied by getting better at evaluating games -- i.e. answering questions such as 'which side is doing better' at any given point in a game?"

"It also goes the other way around -- being good at evaluating your own performance during a game also enables you to coach yourself, thus being able to try new things and get better over time -- without necessarily needing external supervision."

"And that is the fundamental intuition behind 'actor critic' methods. In essence, being your own critic helps you grow as an actor, growing as an actor makes you a better critic, and the cycle continues."

Actor Critic Methods -- A simple explanation

#solidstatelife #ai #aieducation #reinforcementlearning

waynerad@diasp.org

Artificial Intelligence beating people in the physical world -- sort of. A labyrinth game is hooked up to two motors that act as "hands", a camera that acts as its "eyes", and a computer with a "model-based reinforcement learning" algorithm that acts as the "brain".

The key thing here is that the reinforcement learning algorithm practices in the physical world, not in simulation, just like humans. After 6 hour of practice, it outperforms humans. It found ways to 'cheat' by skipping certain parts of the maze and had to be explicitly instructed not to take any of those shortcuts.

The reinforcement learning algorithm incorporated is something called DreamerV3. It is an actor-critic system, and it collects experience from the physical world, then "replays" that out of a reply buffer, then "augments" that with generated "dreams". This reduces the amount of external experience the system needs to learn. (In reinforcement learning parlance, it increases the "sample efficiency".)

DreamerV3 actually consists of 3 neural networks: the world model, the critic, and the actor. All three are trained separately without sharing parameters or gradients. The system contains additional circuitry to dynamically adjust the balances of these 3 objectives without a human having to set "hyperparameters". The DreamerV3 system was originally trained on Minecraft. This labyrinth-playing system built on it is called CyberRunner.

#solidstatelife #ai #robotics #reinforcementlearning

https://www.youtube.com/watch?v=zQMKfuWZRdA

waynerad@diasp.org

"Factory: Fast contact for robotic assembly." Assembly, as they refer to it here, is things like peg insertion, electrical connector insertion, tightening of nuts and bolts ("threaded fastener mating"), wire processing, cable routing, soldering, etc.

An "essential, but highly challenging area of manufacturing." Highly challenging for robots, that is. It's physically complex and demands strict reliability requirements.

"The power of physics simulation has not substantially impacted robotic assembly. For assembly, a simulator must accurately and efficiently simulate contact-rich interactions, a longstanding challenge in robotics, particularly for geometrically-complex, tight-clearance bodies." By "tight-clearance", they mean, for example, the small amount of space between a nut and a bolt combined with the small amount of space between the threads of the nut and bolt.

"To simulate real-world motion phases (e.g., initial mating, rundown) and associated pathologies (e.g., cross-threading, jamming), collisions between the threads must be simulated. However, high-quality surface meshes for a nut-and-bolt may consist of 10k-50k triangles; a naive collision scheme may easily exceed memory and compute limits. Moreover, for reinforcement learning training, a numerical solver may need to satisfy non-penetration constraints for 1,000 environments in real-time (i.e., at the same rate as the underlying physical dynamics). Despite the omnipresence of threaded fasteners in the world, no existing simulator achieves this performance."

That is the goal here. Factory is a net set of physics simulation methods to achieve this.

Factory consists of 3 primary components: A physics simulation module, A robot learning suite, and Proof-of-concept reinforcement learning policies.

They say their physics simulation achieves "fast, accurate simulations of contact-rich interactions through a novel synthesis of signed distance function (SDF)-based collisions, contact reduction, and a Gauss-Seidel solver."

The the signed distance function is a mathematical function that determines how far a point is from a boundary, and which side of the surface the point is on. A Gauss-Seidel solver is a method for solving a system of linear equations, named after, yes, that Gauss, Carl Friedrich Gauss, and the Seidel is Philipp Ludwig von Seidel. Remember "SDF" because you're going to be seeing it a lot.

"The module is accessible within the PhysX physics engine and Isaac Gym. We demonstrate simulator performance on a wide range of challenging scenes. As an example, we simulate 1,000 simultaneous nut-and-bolt assemblies in real-time on a single GPU, whereas the prior state-of-the-art was a single nut-and-bolt assembly at 1/20 real-time."

The robot learning suite consists of "a Franka robot and all rigid-body assemblies from the NIST Assembly Task Board 1, the established benchmark for robotic assembly. The suite includes 60 carefully-designed assets, 3 robotic assembly environments, and 7 classical robot controllers. The suite is accessible within Isaac Gym. User-defined assets, environments, and controllers can be added and simulated as desired."

Proof-of-concept reinforcement learning policies are for "a simulated Franka robot to solve the most contact-rich task on the NIST board, nut-and-bolt assembly." Also in Isaac Gym. Presumably you could use the physics and robot learning assets to do your own reinforcement learning. But it's nice that they've given you some pre-trained "policies" (as they are called in the world of reinforcement learning.) (In the world of reinforcement learning, the word "policy", rather than "model", is used. More precisely, a neural network learns a "model", but a "policy" is a more general concept and can apply to learning algorithms that are not neural networks. I always tell people a "policy" corresponds to what we in normal life would call a "strategy" -- a method of deciding what action to take to win the game from any given situation. What strategy might you take to win a Go game or poker game? The reinforcement learning framework is general enough that any "reward" signal can be used. Here you get the reward and "win" the game by successfully assembling items for manufacturing.)

They say they have compared the contact forces generated from executing their policies and they are consistent with the real world.

That's a brief overview. Taking a closer look at the physics contact simulation, they give the following explanation for why they use voxel-based SDFs rather than the standard triangle-based SDFs: "Using SDFs for collisions requires precomputing SDFs offline from a mesh, which can be time- and memory-intensive. Moreover, collision schemes typically test the vertices of a trimesh against the SDF to generate contacts. For sharp objects, simply sampling vertices can cause penetration to occur, motivating iterative per-triangle contact generation. We use discrete, voxel-based SDFs as our geometric representation and demonstrate that they provide efficient, robust collision detection for challenging assets in robotic assembly."

The next technique they employ is contact reduction. They use 3 techniques from video games used to reduce the amount of contacts that have to be checked to see if a collision between objects has occurred. Those techniques are called normal similarity, penetration depth, and an area-based metric.

Contact clustering, as the name implies, groups contacts into clusters and then reduces the number of contacts in each cluster to just a few that need to be checked. Normal similarity is a clustering technique that assigns surfaces with the same surface normal to the same bin. (A surface normal is a vector that points "straight up" from a point on the surface.)

Once binned into clusters, the penetration depth technique culls bins that have "negligible penetration."

Ok, at this point, I'm not sure exactly why, but in addition to the SDF, a direct solver is required to actually detect all the collisions. They look at two options, the Jacobi solver, and the aforementioned Gauss-Seidel solver, which you already know, because it is aforementioned, is the one they selected. The Jacobi solver was the more efficient of the two on a large number of contact points, but, they discovered that using their contact reduction techniques, they could reduce the number of contact points to a sufficiently low number that the Gauss-Seidel solver was actually faster. For example, for nut-and-bolt assembly, they could reduce the number of contact points that needed to be checked from 16,000 to 300.

They describe how they tested the system on, 1,024 parallel 4-mm peg-in-hole assemblies, 1,024 parallel M16 nut-and-bolt assemblies, 1,024 parallel VGA-style D-subminiature (D-sub) connectors, 1,024 parallel 2-stage gear assemblies, 1,024 M16 nuts, 1,024 bowls falling into a pile (not something you probably actually want to see in a real manufacturing plant, but makes a cool demonstration video), 1,024 toruses, falling into a pile, and 128 parallel Franka robot + M16 nut-and-bolt assemblies.

Moving on to the robot and environment assets, they lament how a set of computer-aided-design (CAD) models called NIST Task Board 1 are not good enough for high-accuracy physics simulation. "The models for the nuts, bolts, pegs, and gear assembly do not conform to real-world tolerances and clearances; in assembly, mating parts together with tight clearances is precisely the most significant challenge. Furthermore, the models for the electrical connectors were sourced from public repositories rather than manufacturers, were geometrically incompatible, were incomplete, and/or were designed using hand measurements." This motivated them to create their own CAD models for nuts, bolts, pegs, gearshafts, electrical connectors, etc.

In addition they provide 3 environments, with the Pythonic names "FrankaNutBoltEnv", "FrankaInsertionEnv", and "FrankaGearsEnv". As you might guess, all of these involve the Franka robot. The first is for training a Franka robot to do nut-and-bolt assemblies. The second is for insertion assemblies which means things like USB plugs and sockets, RJ45 plugs and sockets, BNC plugs and sockets, D-sub plugs and sockets, etc. USB you're probably familiar with, RJ45 is the connector at the end of ethernet cables, BNC is a coaxial cable which you might use with your cable TV company internet, D-sub is the connector used for VGA cables, if you remember those, though there are variations on the plug used for other things. And the third is for training a Franka robot to assemble gear assemblies. It comes with a 4-part gear assembly.

Before we get to the reinforcement learning, we have to talk about controllers for a moment, because the actions available to the controller will determine the actions available to the reinforcement learning algorithm. The researchers looked around at what controllers were being used in the real world. They came up with the following list: Joint-space inverse differential kinematics (IK) motion controller, joint-space inverse dynamics (ID) controller, task-space impedance controller, operational-space (OSC) motion controller, open-loop force controller, closed-loop P force controller, and hybrid force-motion controller.

I didn't actually look at the mathematical formulations of these controllers. From the descriptions in the paper, it sounds like they vary in the way they incorporate gravity, inertia, and errors into their calculations for how much torque to apply to a robot joint.

Ok, now we get to the reinforcement learning. The approach they took was to train the reinforcement learning system on 3 subtasks, and then teach the system to combine the 3 subtasks into a sequence. The 3 subtasks are "pick", "place", and "screw". To train these, they used the nut & bolt environment. For "pick", the robot has to grasp a nut placed at any random location on a work surface. For "place", the robot has to place the nut on top of a bolt at a fixed location. For "screw", the robot has to screw down the nut, engaging the mating threads and tightening the appropriate amount until the nut is firmly in place at the base. These are all done with a 6-degrees-of-freedom Franka robot hand.

For "pick", a reward function was fashioned that is based on the distance between the robotic fingertips and the nut. Further reward was granted if the nut remained in the robot hand's grasp after lifting.

For "place", a reward function was fashioned that was based not just on the distance to the bolt, but a number of distances to a number of "keypoints", which also reward the robot for getting the nut in the right orientation.

For "screw", a reward function was fashioned that was based on keypoint distances, this time between the nut and the base of the bolt, to reward the robot for screwing it down, and also between the rest of the bolt and the nut, to make the tightening process more stable.

As a result, the robot was able to learn how to generate "the precise torques along the 7 arm joints to allow the high-inertia robot links to maintain appropriate posture of the gripper." This is not to say there weren't problems. "As a simplifying assumption, the joint limit of the end-effector was removed, allowing the Franka to avoid regrasping." Not something you could do in the real world. But...

"Nevertheless, training was replete with a diverse range of pathologies, including high-energy collision with the bolt shank, roll-pitch misalignment of the nut when first engaging the bolt threads, jamming of the nut during tightening, and precession of the gripper around the bolt during tightening, which induced slip between the gripper and nut."

To address these issues, the researchers embarked on a "systematic exploration of controllers/gains, observation/action spaces, and baseline rewards." "The highest performing agents consistently used an OSC motion controller with low proportional gains, an observation space consisting of pose and velocity of the gripper and nut, a 2-degrees-of-freedom action space (Z-translation and yaw), and a linear baseline reward."

In an effort to further speed things up, they put in a limit on the number of gradient updates to the policy and put in an early termination rule.

Anyway, combining the 3 subtasks in a sequence, the researchers said that they were able to achieved an end-to-end pick + place + screw success rate of 74.2%.

Through out all of this, the recorded the contact forces involved. "Although the reward functions for the reinforcement learning agents never involved contact forces, the robots learned policies that generated forces in the middle of human ranges; the much higher variance of human forces was likely due to more diverse strategies adopted by humans."

All in all, a big step forward for robotics for manufacturing assembly.

Factory: Fast contact for robotic assembly

#solidstatelife #ai #robotics #manufacturing #reinforcementlearning

waynerad@pluspora.com

"Rapid motor adaptation for legged robots" (RMA). This work was inspired by the observation that animals can rapidly adapt to different soils, moving uphill or downhill, carrying loads, moving with rested or tired muscles, responding to injury, and so on, while robotics systems require immense training to adapt to any of these, which just can't be done on time scales of fractions of a second.

The way their solution works is by, instead of just having a "policy", the term that in the reinforcement learning field refers to the actions that the reinforcement learning agent will undertake in any given state, they have a "policy" plus an "adaptation module". The "policy" is trained in simulation, except it is allowed to magically know information about its environment in the simulation, such as surface friction, the weight of its payload, and so on. The job of the "adaptation module" is to guess this environmental information, which they call the "extrinsics". Furthermore, the way these "extrinsics" are estimated based on the difference between what the robot joints are commanded to do vs what they actually do.

Further further more, this "adaptation module" can also be trained in simulation. This is because the simulation can be constructed in such a way that the "policy" training procedure is allowed to know the "privileged" information about the "extrinsics" but the "adaptation module" is denied this knowledge and has to learn it from experience.

Of course all this depends on having a rich enough simulation environment that it encompasses everything the robot will encounter in the real world. The researchers created a fractal terrain generator that creates a wide variety of physical contexts for the robot to experience with wide variation in parameters like mass and friction.

Since this is a reinforcement learning system, you might be wondering what the all-important reward function is for the policy portion. The reward function rewards the robot for moving forward and penalizes it for jerky or inefficient motions. More specifically, it is rewarded for going close to 0.35 m/s, which was chosen as the maximum speed, and penalized for lateral movement of joints, rotation of joints, joint speed, tipping sideways, vertical acceleration, foot slippage, and overall expenditure of energy.

The training of the adaptation module works by giving it access to the robot's internal state and its recent actions, but not the actual environmental "extrinsics". Since in simulation the "ground truth" is known, the adaptation module can be trained using supervised learning instead of reinforcement learning. The full set of "extrinsics" are: friction, payload mass, center of mass, motor strength, position and velocity from the motor encoders, roll and pitch from the IMU sensor, and the foot contact indicators from the foot sensors.

In the real world, the robot used is the A1 robot from Unitree, which in simulation is simulated by a simulator called RaiSim. The robot's internal state consists of the joint positions (12 values), joint velocities (12 values), roll and pitch of the torso and binary foot contact indicators (4 values), and the actions it has available are position controls for the 12 robot joints.

The neural network for the "policy" is a 3-layer fully connected network, while the neural network for the "adaptation module" is a 3-layer convolutional neural network.

For the results, well, just watch the videos.

RMA: rapid motor adaptation for legged robots

#solidstatelife #ai #robotics #quadrupeds #reinforcementlearning #simulation