#reinforcementlearning

waynerad@diasp.org

"How AlphaChip transformed computer chip design."

"AlphaChip" is the name Google has bestowed on their reinforcement learning system for doing chip layouts for semiconductor manufacturing.

It's dramatically accelerating the pace of chip design by dramatically shortening the time it takes to door the chip "floorplanning" process, with results superior to what human designers can do.

"Similar to AlphaGo and AlphaZero, which learned to master the games of Go, chess and shogi, we built AlphaChip to approach chip floorplanning as a kind of game."

"Starting from a blank grid, AlphaChip places one circuit component at a time until it's done placing all the components. Then it's rewarded based on the quality of the final layout. A novel "edge-based" graph neural network allows AlphaChip to learn the relationships between interconnected chip components and to generalize across chips, letting AlphaChip improve with each layout it designs."

"AlphaChip has generated superhuman chip layouts used in every generation of Google's TPU since its publication in 2020. These chips make it possible to massively scale-up AI models based on Google's Transformer architecture."

"TPUs lie at the heart of our powerful generative AI systems, from large language models, like Gemini, to image and video generators, Imagen and Veo. These AI accelerators also lie at the heart of Google's AI services and are available to external users via Google Cloud."

"To design TPU layouts, AlphaChip first practices on a diverse range of chip blocks from previous generations, such as on-chip and inter-chip network blocks, memory controllers, and data transport buffers. This process is called pre-training. Then we run AlphaChip on current TPU blocks to generate high-quality layouts. Unlike prior approaches, AlphaChip becomes better and faster as it solves more instances of the chip placement task, similar to how human experts do."

"With each new generation of TPU, including our latest Trillium (6th generation), AlphaChip has designed better chip layouts and provided more of the overall floorplan, accelerating the design cycle and yielding higher-performance chips."

"Beyond designing specialized AI accelerators like TPUs, AlphaChip has generated layouts for other chips across Alphabet, such as Google Axion Processors, our first Arm-based general-purpose data center CPUs."

"External organizations are also adopting and building on AlphaChip. For example, MediaTek, one of the top chip design companies in the world, extended AlphaChip to accelerate development of their most advanced chips -- like the Dimensity Flagship 5G used in Samsung mobile phones -- while improving power, performance and chip area."

How AlphaChip transformed computer chip design - Google DeepMind

#solidstatelife #ai #reinforcementlearning #semiconductors

waynerad@diasp.org

If you've been seeing reinforcement learning terms like "state", "agent", "policy", "value", "Q-values" and "deep-Q-learning" (DQN), "policy gradient", "domain randomization", and so on, and wondering what it all means, this is the video for you. It explains all this by going back to the beginning, and telling the history of reinforcement learning. Which is largely the history of getting machines to play games: Tic-tac-toe, Chess and checkers, Backgammon, RoboCup. Everything makes more sense when told in its historical context.

The video ends with speculation GPT-inspired concepts ("action-GPT") will lead to a breakthrough in robotics comparable to what we've seen in language with the GPT family of models.

How AI Learned to Feel | 75 years of Reinforcement Learning - Art of the Problem

#solidstatelife #ai #reinforcementlearning

waynerad@diasp.org

AlphaProof is a new reinforcement-learning based system for formal math reasoning from DeepMind. AlphaProof + AlphaGeometry 2, an improved version of DeepMind's geometry system, solved 4 out of 6 problems from this year's International Mathematical Olympiad (IMO), achieving the same level as a silver medalist.

"AlphaProof solved two algebra problems and one number theory problem by determining the answer and proving it was correct. This included the hardest problem in the competition, solved by only five contestants at this year's IMO. AlphaGeometry 2 proved the geometry problem, while the two combinatorics problems remained unsolved."

"AlphaProof is a system that trains itself to prove mathematical statements in the formal language Lean. It couples a pre-trained language model with the AlphaZero reinforcement learning algorithm, which previously taught itself how to master the games of chess, shogi and Go."

"Formal languages offer the critical advantage that proofs involving mathematical reasoning can be formally verified for correctness."

"When presented with a problem, AlphaProof generates solution candidates and then proves or disproves them by searching over possible proof steps in Lean. Each proof that was found and verified is used to reinforce AlphaProof's language model, enhancing its ability to solve subsequent, more challenging problems."

"We trained AlphaProof for the IMO by proving or disproving millions of problems, covering a wide range of difficulties and mathematical topic areas over a period of weeks leading up to the competition. The training loop was also applied during the contest, reinforcing proofs of self-generated variations of the contest problems until a full solution could be found."

The blog post seems to have revealed few details of how AlphaProof works. But it sounds like we're about to enter a new era of math proofs, where all kinds of theorems will be discovered and proved.

AI achieves silver-medal standard solving International Mathematical Olympiad problems

#solidstatelife #ai #genai #llms #reinforcementlearning #rl #mathematics #proofs

waynerad@diasp.org

Richard Sutton interviewed by Edan Meyer. Rich Sutton literally half-wrote the book on reinforcement learning -- my textbook on reinforcement learning, Reinforcement Learning: An Introduction, was written by him and Andrew Barto. I've never seen him (or Andrew Barto) on video before so this was interesting to see. (Full disclosure, I only read about half of the book, and I 'cheated' and didn't do all the exercises.)

The thing that I thought was most interesting was his disagreement with the self-supervised learning approach. For those of you not up on the terminology, "self-supervised" is a term that means you take any data, and you mask out some piece of it, and try to train your neural network to "predict" the part that's masked out from the part that isn't masked. The easiest way to do this is to just unmask all the "past" data and mask all the "future" data and as the neural network to predict the "next word" or "next video frame" or "next" whatever. It's called "self-supervised" because neural network training started with paired inputs and outputs where the "outputs" that the neural network was to learn were written by humans, and this came to be called "supervised" learning. "Unsupervised" learning came to refer to throwing mountains of data at an algorithm and asking it to find whatever patterns are in there. So to describe this alternate mode where it's like "supervised" learning but the "correct answers" are created just by masking out input data, the term "self-supervised" was coined.

I thought "self-supervised" learning was a very important breakthrough. It's what led directly to ChatGPT and all the other chatbots we know and love (we do love them right?). But Rich Sutton is kind of a downer when it comes to self-suprevised learning.

"Outside of reinforcement learning is lots of guys trying to predict the next observation, or the next video frame. Their fixation on that problem is what I mean by they've done very little, because the thing you want to predict about the world is not the next frame. You want to predict consequential things. Things that matter. Things that you can influence. And things that are happening multiple steps in the future."

"The problem is that you have to interact the world. You have to predict and control it, and you have large sensory sensory motor vectors, then the question is what is my background? Well, if I'm a supervised learning guy, I say, maybe I can apply my supervised learning tools to them. They all want to have labels, and so the labels I have is the very next data point. So I should predict that that next data point. This is is a way of thinking perfectly consistent with their background, but if you're coming from the point of reinforcement learning you think about predicting multiple steps in the future. Just as you predict value functions, predict reward, you should also predict the other events -- these things will be causal. I want to predict, what will happen if I if I drop this? Will it spill? will there be water all over? what might it feel on me? Those are not single step predictions. They involve whole sequences of actions picking things up and then spilling them and then letting them play out. There are consequences, and so to make a model of the world it's not going to be like a video frame. It's not going to be like playing out the video. You model the world at a higher level."

I talked with Rich Sutton - Edan Meyer

#solidstatelife #ai #reinforcementlearning #rl

waynerad@diasp.org

The full text of Dimitri P. Bertsekas's book A Course in Reinforcement Learning is available online for free. It's also available for purchase in print form. About 450 pages. It's the textbook for his course at Arizona State University "Reinforcement Learning and Optimal Control".

I've gone through more than half of Richard Sutton and Andrew Barto's book Reinforcement Learning: An Introduction (though I confess to have 'cheated' and not done all the exercises). It might be worth reading this book, too, to see the same material from an alternate point of view.

"Reinforcement learning can be viewed as the art and science of sequential decision making for large and difficult problems, often in the presence of imprecisely known and changing environment conditions. Dynamic programming is a broad and well-established algorithmic methodology for making optimal sequential decisions, and is the theoretical foundation upon which reinforcement learning rests. This is unlikely to change in the future, despite the rapid pace of technological innovation. In fact, there are strong connections between sequential decision making and the new wave of technological change, generative technology, transformers, GPT applications, and natural language processing ideas, as we will aim to show in this book."

"In dynamic programming there are two principal objects to compute: the optimal value function that provides the optimal cost that can be attained starting from any given initial state, and the optimal policy that provides the optimal decision to apply at any given state and time. Unfortunately, the exact application of dynamic programming runs into formidable computational difficulties, commonly referred to as the curse of dimensionality. To address these, reinforcement learning aims to approximate the optimal value function and policy, by using manageable off-line and/or on-line computation, which often involves neural networks (hence the alternative name Neuro-Dynamic Programming)."

"Thus there are two major methodological approaches in reinforcement learning: approximation in value space, where we approximate in some way the optimal value function, and approximation in policy space, whereby we construct a suboptimal policy by using some form of optimization over a suitably restricted class of policies."

"The book focuses primarily on approximation in value space, with limited coverage of approximation in policy space. However, it is structured so that it can be easily supplemented by an instructor who wishes to go into approximation in policy space in greater detail, using any of a number of available sources."

"An important part of our line of development is a new conceptual framework, which aims to bridge the gaps between the artificial intelligence, control theory, and operations research views of our subject. This framework, the focus of the author's recent monograph 'Lessons from AlphaZero ...',, centers on approximate forms of dynamic programming that are inspired by some of the major successes of reinforcement learning involving games. Primary examples are the recent (2017) AlphaZero program (which plays chess), and the similarly structured and earlier (1990s) TD-Gammon program (which plays backgammon)."

A Course in Reinforcement Learning

#solidstatelife #ai #aieducation #reinforcementlearning #rl

waynerad@diasp.org

Agent Hospital is a simulacrum of hospital with evolvable medical agents alrighty then. And an excuse to use the word "simulacrum".

"Once arrived the Agent Hospital, the patient's journey begins at the triage station. Patients arrive and describe their symptoms to the nursing agents. The instructions guide the nursing staff in their decision-making, enabling them to direct patients to the appropriate specialist departments where medical professional agents are available to conduct further diagnostics."

"After the initial assessment, patients follow the advice from the triage station and proceed to register at the registration counter. They then wait in the designated waiting area for their consultation turn with the specialists from the respective departments."

"When it is their turn for consultation, patients engage in a preliminary dialogue with the physician agents to describe their symptoms and the duration since onset. The physician then determines which medical examination is needed to investigate the cause and assist with diagnosis and treatment. In the current version, only one type of medical examination will be conducted for each patient based on the decisions made by doctor agents."

"After receiving the prescribed list of medical examinations, patients proceed to the relevant department to undergo the tests. The resulting medical data which are pre-generated by LLM are subsequently presented to the patient and the doctor. This process designed to mimic real-time diagnostic feedback, aligns with the presentation of symptoms."

"Subsequent to the medical examination, patients are guided to the respective department where physician agents undertake the diagnostic process. Patients disclose their symptoms and share the results of the medical examination with the physician agents, who then undergo diagnostic processes based on a predefined disease set. The diagnostic result is promptly communicated back to the patient, showcasing the model's capacity to integrate complex medical data and its advanced diagnostic ability."

"The medical agent is presented with the patient's symptoms, results from medical examinations and the diagnosis of the disease they made. In addition, three distinct treatment plans tailored to mild, moderate, and severe conditions are also provided. The doctor is then tasked with selecting the appropriate plan from the mild, moderate, or severe options, according to the patient's specific needs. If any medicine is prescribed, patients proceed to the dispensary to collect it."

"At the end of the diagnostic and treatment process, the patient provides feedback or updates on their health condition for follow-up actions. To mimic the dynamic progression of diseases accurately, the LLM-enhanced simulation involves a few key steps: doctors devise treatment plans based on the patient's detailed health information and test results, and then these details -- specifically the patient's symptoms, the prescribed treatment plan, and the diagnosed disease are incorporated into a template for simulation."

Ok, as you can see, quite an elaborate simulation. But how do the medical agents actually learn? The whole point of doing all this is to get medical agents that actually learn. Here's what they say (big chunk of quotes to follow):

"Doctor agents continuously learn and accumulate experience during the treatment process in Agent Hospital, thereby enhancing their medical capabilities similar to human doctors. We assume that doctor agents are constantly repeating this process during all working hours."

"Apart from improving their skills through clinical practice, doctor agents also proactively accumulate knowledge by reading medical documents outside of work hours. This process primarily involves strategies to avoid parametric knowledge learning for agents."

"To facilitate the evolution of LLM-powered medical agents, we propose MedAgent-Zero strategy MedAgent-Zero is a parameter-free strategy, and no manually labeled data is applied as AlphaGo-Zero."

"There are two important modules in this strategy, namely the Medical Record Library and the Experience Base. Successful cases, which are to be used as references for future medical interventions, are compiled and stored in the medical record library. For cases where treatment fails, doctors are tasked to reflect and analyze the reasons for diagnostic inaccuracies and distill a guiding principle to be used as a cautionary reminder for subsequent treatment processes."

"In the process of administering treatment, it is highly beneficial for doctors to consult and reference previously validated medical records. These medical records contain abundant knowledge and demonstrate the rationale behind accurate and adequate responses to diverse medical conditions. Therefore, we propose to build a medical record library for doctor agents to sharpen their medical abilities, including historical medical records from hospital practices and exemplar cases from medical documents."

"Learning from diagnostic errors is also crucial for the growth of doctors. We believe that LLM-powered medical professional agents can engage in self-reflection from these errors, distilling relevant principles (experience) to ensure correct diagnoses when encountering similar issues in future cases."

"If the answer is wrong, the agent will reflect the initial problem, generated answer, and golden answer to summarize reusable principles. All principles generated are subject to a validation process. Upon generation, the principle is integrated into the original question which was initially answered incorrectly, allowing medical professional agents to re-diagnose. Only if the diagnosis is correct will the principle be added to the experience base."

"To eliminate the influence of noise and maximize the utilization of the experience base, we incorporate additional judgment when utilizing experience. This judgment involves evaluating whether the top-K experience retrieved based on semantic similarity are helpful for the treating process. Helpful experience will be incorporated into the prompt, while unhelpful experience will be excluded."

Ok, so, kind of analogous to how our chatbots are originally pretrained (by self-supervised training) transformers that get further training from a reinforcement learning system called RLHF (reinforcement learning through human feedback), here we also have a LLM-based system where reinforcement learning is employed (albeit in a different way) to further train the LLMs.

I have mixed feelings about this. There's part of me that says this is a silly exercise, unlikely to produce anything reliable enough to be useful, and another part of me that says, yeah, but this could be the beginning of how all hospitals are run 20 or 30 years in the future.

Agent Hospital: A simulacrum of hospital with evolvable medical agents

#solidstatelife #ai #genai #llms #medicalai #reinforcementlearning #rl

waynerad@diasp.org

"LMRL Gym: Benchmarks for multi-turn reinforcement learning with language models."

One of their benchmarks was "Guess My City" so I decided to try that.


Me:

Guess my city.

ChatGPT:

Sure, I'd be happy to play a game of guessing your city! Please provide some clues or descriptions about your city, such as landmarks, geographical features, climate, or any other distinctive characteristics. The more details you can provide, the better I'll be able to make an accurate guess.

Me:

Landmarks: It has a building known as the "cash register". Geographical features: it has mountains to the west. Climate: cold winters and moderate summers with occasional hot days, with a lot of variation all year.

ChatGPT:

Based on your clues, I believe your city might be Denver, Colorado.

The "cash register" building you're referring to is likely the Wells Fargo Center, known for its unique architecture that resembles a cash register. The geographical feature of mountains to the west aligns with Denver's location at the eastern foot of the Rocky Mountains. Additionally, Denver is known for its cold winters and moderate summers, with a climate that can vary greatly, which matches your description of the climate in your city.

Is this correct?

Me:

Yes

ChatGPT:

Great! I'm glad I guessed correctly. Denver is a fascinating city with its unique blend of urban and natural landscapes. If you have any more questions or need information about Denver or anything else, feel free to ask!


I also tried this with Claude, Bard, and DeepSeek.

In the interest of brevity, I'm going to skip the full transcripts of the conversations.

Claude guessed Dallas, Atlanta, and San Francisco before guessing Denver. It thought "cash register" could refer to buildings in those cities and had to be given a hint that my city had mountains to the west. It didn't think to ask for "geographical features" right off the bat that would have resulted in me telling it that.

Bard did something totally unexpected and did Google Maps searches. This (surprisingly or unsurprisingly?) was unhelpful. It guessed Los Angeles because there is an "American Cash Register Company" there. It had to be told "cash register" is nickname.

DeepSeek (the Chinese large language model) initially just said sorry, I don't have enough information to guess your city. Unlike the others it didn't immediately recognize "guess my city" as a game. I had to explain it's a game where it was supposed to ask me questions to guess my city. When I followed up with the country and geography, it responded with encyclopedic information on the geography of the western United States. Once it properly understood it was supposed to ask questions, it asked whether my city had a nickname. I said yes, "The Mile High City," and it immediately guessed Denver from there.

By the way, all the large language models gave a different name for Denver's "cash register" building (so called because it has the silhouette of a cash register if viewed from the proper angle), and I don't know which is correct because I don't know the true name of the building and can't be bothered to figure it out.

What this is all about is "evaluating capabilities enabled by reinforcement learning". As you may or may not know, what enables large language models to function as "chatbots" is not just their "predict the next token" language training (which is called self-supervised training, for historical reasons, don't worry if the term makes no sense), but an additional technique called reinforcement learning through human feedback (RLHF). This technique uses humans to train a model that is then flipped around and used as a reward signal for a second model, which generates feedback that gets fed into the original model which teaches it to behave "helpfully". This is why ChatGPT and its ilk come across as so eager to please you. It's a complicated system but what's important for the moment here are the words "reinforcement learning". Reinforcement learning is the field of AI that led to the systems that beat humans at the Chinese game of Go, as well as Chess and Shogi -- and it beat the best human-made chess engine, Stockfish. Reinforcement learning works by getting input from an environment along with a reward signal. For example the screen pixels of Atari games, plus the score as the reward signal. Anyway, these researchers got the idea that, since large language models are using reinforcement learning, they might design some tests looking for characteristics of reinforcement learning and see if they can find evidence of reinforcement learning-generated behavior from large language models.

Here's the list of "core capabilities that reinforcement learning can enable in large language models" that they decided to look for:

"Strategic decision making. Reinforcement learning shines in goal-directed tasks that require multi-step planning and strategic decision making. Strategic decision-making can range from simple choices like asking follow-up questions to gather information (e.g., in the 20 Questions task), to complex strategy in chess."

"Complex language. Our benchmark includes realistic language and interaction scenarios, requiring large language models to combine their knowledge from pretraining to help solve tasks during reinforcement learning finetuning. Rather than focusing entirely on causal logic and strategy found in text games, several of our tasks specifically emphasize the use of realistic language."

"Credit assignment. In reinforcement learning, rewards are often delayed relative to the action that was pivotal to the outcome. For example, a seller agent might state a particularly compelling feature of the product and then, several turns later, complete a successful sale. Reinforcement learning must determine the statements that led to the good outcome, and reinforce them."

"Partial observability. In language tasks, the state consists of the entire history of tokens, and an agent may need to examine this entire context to infer the correct state. For example, the mental states of a speaker in a dialogue (e.g., whether the buyer is impatient in a selling task), previously observed facts in a guessing game, and other hidden variables might induce partial observability."

"Trajectory stitching. In a dataset with many suboptimal trajectories, it is necessary to join optimal actions from different suboptimal trajectories together to form the most optimal trajectory. An algorithm capable of trajectory stitching should be able to learn from optimal actions taken in unsuccessful trajectories and avoid suboptimal actions that occurred in successful trajectories."

They came up with 8 "tasks", called "Maze", "Text-Based Navigation", "Wordle", "Chess", "Chess Endgames ", "Twenty Questions", "Guess My City", and "Car Dealer". Yes, they really did come up with a text-based way of playing chess (there's actually a standardized notation for chess moves). They even used Stockfish to generate data. And yes, Wordle is exactly the online Worldle game you are familiar with, where you get 6 attempts to guess a hidden 5-letter word and after each guess, you're told if the letter you guessed is in the right position, in the word but not in the right position, or not in the hidden word at all.

They have a grid (on page 4) showing for each of the 8 tasks, which of the 5 "capabilities" it exercises (strategic decision making, complex language, credit assignment, partial observability, and trajectory stitching). For the task I tried above, "Guess My City", it says it exercises more than most: the first four: strategic decision making, complex language, credit assignment, partial observability, but maybe not the last one, trajectory stitching.

LMRL Gym: Benchmarks for multi-turn reinforcement learning with language models

#solidstatelife #ai #genai #llms #rlhf #reinforcementlearning

waynerad@diasp.org

Mobile ALOHA: Your housekeeping robot. Opening the blinds, watering the plants, vacuuming, making coffee, cleaning up spilled milk, taking dishes out of the dishwasher, pouring ketchup, taking out the trash, putting laundry in the washing machine, taking clothes out of the dryer, putting sheets on the bed, putting the pillow in the pillowcase, hanging clothes in the closet, folding clothes and putting them in a drawer, and turning the light off. Oh, and it plays with the cat, too. The video is taking the internet by storm.

But, not so fast. The "robot" in this video is not autonomous, it's tele-operated. But, if you go to the website of the project (link below), you find many tasks the robot can do autonomously. So, what's going on?

Autonomous skills: Cook shrimp, wipe wine, call elevator, use cabinets, rinse pan, push chairs, high-five humans.

What's going on is the tele-operation and the autonomy are related. What these researchers did was build a robot for training a robot to do things. So the tele-operation creates training data that then gets turned into a robot that can perform some tasks autonomously. The researchers noticed that "imitation learning" algorithms had been created, but there wasn't any affordable platform for creating the training data for the "imitation learning" algorithms. So they decided to make it. This is actually their second system. The first system, called "ALOHA", was a pair of robotic arms mounted on a tabletop. "ALOHA" stood for "A low-cost open-source hardware... (system for bimanual teleoperation, but that would be ALOHASFBIT).

The problem with the tabletop mounting is that many household tasks combine hand movement with whole-body movement. For example, to open a cabinet, the robot needs to back up while opening the two cabinet doors by the two door handles. And of course it has to navigate to the cabinet in the first place. And if it's putting a pot in the cabinet, it has to put the pot down, open the cabinet, pick the pot up again, put it in the cabinet, and close the cabinet. Most household tasks are like this. So they got the idea for "Mobile ALOHA".

To go from "ALOHA" to "Mobile Aloha", they took the previous tabletop ALOHA system and mounted it on a wheeled base. An AgileX Tracer AGV (automatic guided vehicle), from Trossen Robotics, to be precise. It is designed for indoor autonomous logistics and warehousing applications. It has two 150W brushless servo motors and can carry a payload of 100 kg (220 lbs) for 4 hours. It can move at about the same speed as a walking human. It can go 1.6 meters per second and the average walking speed for a human is 1.4 meters per second. By adding extra weight low to the ground, the researchers found they could increase the "tip-over" stability, enabling Mobile ALOHA to, for example, get on an elevator, where the elevator floor wasn't exactly level with the building floor.

With the robotic arms attached, the total weight is 75 kg (165 lbs), the robotic arms can extend 100 cm (40 inches) from the base, and can lift 750 g (26 Oz).

How then is the system tele-operated, and how, once the tele-operation generates data, is that data used to train the system?

Well, first of all the controller of the tele-operation system can't use their hands to control the base movement because their hands are controlling the robotic arms and grippers. So the researchers came up with a "tether" that attaches to the controller's waist, and that's how the robotic base is controlled.

How about the training? Well, the original ALOHA system represented the 14-degrees-of-freedom input from the robotic arms and grippers as a 14-dimensional vector. For Mobile ALOHA, that's extended to 2 more for the base. If you're wondering where the number 14 came from, the robotic arms are "6-degrees-of-freedom" robotic arms. More precisely, they're the 6-degrees-of-freedom ViperX 300 Robot Arm from Trossen Robotics. That's 6, and to get 7 they added the gripping force. And there's 2 of these robotic arms, so multiply by 2 and you get 14. Add 2 more degrees of freedom for the base unit, and you're now up to 16 degrees of freedom.

To expedite the training, they first trained the system on their data from the original 14-DoF ALOHA training data. They called this step "Co-training with static ALOHA". They didn't do anything else original for the training, although they did come up with an original algorithm for the original ALOHA system and that was one of three used here. That system is called ACT, which stands for Action Chunking with Transformers. As you might guess from the "transformers" part of the name, the system uses transformers like the GPT models we're familiar with (remember the "T" in "GPT" stands for "transformer" -- GPT stands for "generative pre-trained transformer".) The idea is to break actions into "chunks" represented by tokens and then have the transformer model generate tokens for actions in a manner analogous to how a large language model like GPT generates tokens for text.

The other two algorithms are called Diffusion Policy, and VINN.

If the name "Diffusion Policy" makes you think of diffusion models like Stable Diffusion or DALL-E, you're on the right track. Except unlike those models which generate images, Diffusion Policy generates "policies", which in the parlance of reinforcement learning is the function that maps the state of the agent and envorinment to actions -- the word "strategy" would make more sense in colloquial, non-technical contexts. The idea is that you represent a visual-motor action strategy as a denoising process.

VINN stands for Visual Imitation through Nearest Neighbors. The basic idea is that an encoder network is trained on all the training data. Then when it comes time to use the encoder to perform actions after training, it uses a "nearest neighbor" algorithm to find the action that is closest to the current situation.

Mobile ALOHA: Your housekeeping robot

#solidstatelife #ai #genai #reinforcementlearning #imitationlearning #robotics

waynerad@diasp.org

DenseWiki is a new site that aims to provide simple, plain-English explanations of popular concepts in machine learning. They've started by adding explanations of a few popular concepts in reinforcement learning, starting with "Actor-Critic Methods".

"As a human, when you get better at playing a game (say soccer or boxing), isn't the improvement also usually accompanied by getting better at evaluating games -- i.e. answering questions such as 'which side is doing better' at any given point in a game?"

"It also goes the other way around -- being good at evaluating your own performance during a game also enables you to coach yourself, thus being able to try new things and get better over time -- without necessarily needing external supervision."

"And that is the fundamental intuition behind 'actor critic' methods. In essence, being your own critic helps you grow as an actor, growing as an actor makes you a better critic, and the cycle continues."

Actor Critic Methods -- A simple explanation

#solidstatelife #ai #aieducation #reinforcementlearning

waynerad@diasp.org

Artificial Intelligence beating people in the physical world -- sort of. A labyrinth game is hooked up to two motors that act as "hands", a camera that acts as its "eyes", and a computer with a "model-based reinforcement learning" algorithm that acts as the "brain".

The key thing here is that the reinforcement learning algorithm practices in the physical world, not in simulation, just like humans. After 6 hour of practice, it outperforms humans. It found ways to 'cheat' by skipping certain parts of the maze and had to be explicitly instructed not to take any of those shortcuts.

The reinforcement learning algorithm incorporated is something called DreamerV3. It is an actor-critic system, and it collects experience from the physical world, then "replays" that out of a reply buffer, then "augments" that with generated "dreams". This reduces the amount of external experience the system needs to learn. (In reinforcement learning parlance, it increases the "sample efficiency".)

DreamerV3 actually consists of 3 neural networks: the world model, the critic, and the actor. All three are trained separately without sharing parameters or gradients. The system contains additional circuitry to dynamically adjust the balances of these 3 objectives without a human having to set "hyperparameters". The DreamerV3 system was originally trained on Minecraft. This labyrinth-playing system built on it is called CyberRunner.

#solidstatelife #ai #robotics #reinforcementlearning

https://www.youtube.com/watch?v=zQMKfuWZRdA

waynerad@diasp.org

"Factory: Fast contact for robotic assembly." Assembly, as they refer to it here, is things like peg insertion, electrical connector insertion, tightening of nuts and bolts ("threaded fastener mating"), wire processing, cable routing, soldering, etc.

An "essential, but highly challenging area of manufacturing." Highly challenging for robots, that is. It's physically complex and demands strict reliability requirements.

"The power of physics simulation has not substantially impacted robotic assembly. For assembly, a simulator must accurately and efficiently simulate contact-rich interactions, a longstanding challenge in robotics, particularly for geometrically-complex, tight-clearance bodies." By "tight-clearance", they mean, for example, the small amount of space between a nut and a bolt combined with the small amount of space between the threads of the nut and bolt.

"To simulate real-world motion phases (e.g., initial mating, rundown) and associated pathologies (e.g., cross-threading, jamming), collisions between the threads must be simulated. However, high-quality surface meshes for a nut-and-bolt may consist of 10k-50k triangles; a naive collision scheme may easily exceed memory and compute limits. Moreover, for reinforcement learning training, a numerical solver may need to satisfy non-penetration constraints for 1,000 environments in real-time (i.e., at the same rate as the underlying physical dynamics). Despite the omnipresence of threaded fasteners in the world, no existing simulator achieves this performance."

That is the goal here. Factory is a net set of physics simulation methods to achieve this.

Factory consists of 3 primary components: A physics simulation module, A robot learning suite, and Proof-of-concept reinforcement learning policies.

They say their physics simulation achieves "fast, accurate simulations of contact-rich interactions through a novel synthesis of signed distance function (SDF)-based collisions, contact reduction, and a Gauss-Seidel solver."

The the signed distance function is a mathematical function that determines how far a point is from a boundary, and which side of the surface the point is on. A Gauss-Seidel solver is a method for solving a system of linear equations, named after, yes, that Gauss, Carl Friedrich Gauss, and the Seidel is Philipp Ludwig von Seidel. Remember "SDF" because you're going to be seeing it a lot.

"The module is accessible within the PhysX physics engine and Isaac Gym. We demonstrate simulator performance on a wide range of challenging scenes. As an example, we simulate 1,000 simultaneous nut-and-bolt assemblies in real-time on a single GPU, whereas the prior state-of-the-art was a single nut-and-bolt assembly at 1/20 real-time."

The robot learning suite consists of "a Franka robot and all rigid-body assemblies from the NIST Assembly Task Board 1, the established benchmark for robotic assembly. The suite includes 60 carefully-designed assets, 3 robotic assembly environments, and 7 classical robot controllers. The suite is accessible within Isaac Gym. User-defined assets, environments, and controllers can be added and simulated as desired."

Proof-of-concept reinforcement learning policies are for "a simulated Franka robot to solve the most contact-rich task on the NIST board, nut-and-bolt assembly." Also in Isaac Gym. Presumably you could use the physics and robot learning assets to do your own reinforcement learning. But it's nice that they've given you some pre-trained "policies" (as they are called in the world of reinforcement learning.) (In the world of reinforcement learning, the word "policy", rather than "model", is used. More precisely, a neural network learns a "model", but a "policy" is a more general concept and can apply to learning algorithms that are not neural networks. I always tell people a "policy" corresponds to what we in normal life would call a "strategy" -- a method of deciding what action to take to win the game from any given situation. What strategy might you take to win a Go game or poker game? The reinforcement learning framework is general enough that any "reward" signal can be used. Here you get the reward and "win" the game by successfully assembling items for manufacturing.)

They say they have compared the contact forces generated from executing their policies and they are consistent with the real world.

That's a brief overview. Taking a closer look at the physics contact simulation, they give the following explanation for why they use voxel-based SDFs rather than the standard triangle-based SDFs: "Using SDFs for collisions requires precomputing SDFs offline from a mesh, which can be time- and memory-intensive. Moreover, collision schemes typically test the vertices of a trimesh against the SDF to generate contacts. For sharp objects, simply sampling vertices can cause penetration to occur, motivating iterative per-triangle contact generation. We use discrete, voxel-based SDFs as our geometric representation and demonstrate that they provide efficient, robust collision detection for challenging assets in robotic assembly."

The next technique they employ is contact reduction. They use 3 techniques from video games used to reduce the amount of contacts that have to be checked to see if a collision between objects has occurred. Those techniques are called normal similarity, penetration depth, and an area-based metric.

Contact clustering, as the name implies, groups contacts into clusters and then reduces the number of contacts in each cluster to just a few that need to be checked. Normal similarity is a clustering technique that assigns surfaces with the same surface normal to the same bin. (A surface normal is a vector that points "straight up" from a point on the surface.)

Once binned into clusters, the penetration depth technique culls bins that have "negligible penetration."

Ok, at this point, I'm not sure exactly why, but in addition to the SDF, a direct solver is required to actually detect all the collisions. They look at two options, the Jacobi solver, and the aforementioned Gauss-Seidel solver, which you already know, because it is aforementioned, is the one they selected. The Jacobi solver was the more efficient of the two on a large number of contact points, but, they discovered that using their contact reduction techniques, they could reduce the number of contact points to a sufficiently low number that the Gauss-Seidel solver was actually faster. For example, for nut-and-bolt assembly, they could reduce the number of contact points that needed to be checked from 16,000 to 300.

They describe how they tested the system on, 1,024 parallel 4-mm peg-in-hole assemblies, 1,024 parallel M16 nut-and-bolt assemblies, 1,024 parallel VGA-style D-subminiature (D-sub) connectors, 1,024 parallel 2-stage gear assemblies, 1,024 M16 nuts, 1,024 bowls falling into a pile (not something you probably actually want to see in a real manufacturing plant, but makes a cool demonstration video), 1,024 toruses, falling into a pile, and 128 parallel Franka robot + M16 nut-and-bolt assemblies.

Moving on to the robot and environment assets, they lament how a set of computer-aided-design (CAD) models called NIST Task Board 1 are not good enough for high-accuracy physics simulation. "The models for the nuts, bolts, pegs, and gear assembly do not conform to real-world tolerances and clearances; in assembly, mating parts together with tight clearances is precisely the most significant challenge. Furthermore, the models for the electrical connectors were sourced from public repositories rather than manufacturers, were geometrically incompatible, were incomplete, and/or were designed using hand measurements." This motivated them to create their own CAD models for nuts, bolts, pegs, gearshafts, electrical connectors, etc.

In addition they provide 3 environments, with the Pythonic names "FrankaNutBoltEnv", "FrankaInsertionEnv", and "FrankaGearsEnv". As you might guess, all of these involve the Franka robot. The first is for training a Franka robot to do nut-and-bolt assemblies. The second is for insertion assemblies which means things like USB plugs and sockets, RJ45 plugs and sockets, BNC plugs and sockets, D-sub plugs and sockets, etc. USB you're probably familiar with, RJ45 is the connector at the end of ethernet cables, BNC is a coaxial cable which you might use with your cable TV company internet, D-sub is the connector used for VGA cables, if you remember those, though there are variations on the plug used for other things. And the third is for training a Franka robot to assemble gear assemblies. It comes with a 4-part gear assembly.

Before we get to the reinforcement learning, we have to talk about controllers for a moment, because the actions available to the controller will determine the actions available to the reinforcement learning algorithm. The researchers looked around at what controllers were being used in the real world. They came up with the following list: Joint-space inverse differential kinematics (IK) motion controller, joint-space inverse dynamics (ID) controller, task-space impedance controller, operational-space (OSC) motion controller, open-loop force controller, closed-loop P force controller, and hybrid force-motion controller.

I didn't actually look at the mathematical formulations of these controllers. From the descriptions in the paper, it sounds like they vary in the way they incorporate gravity, inertia, and errors into their calculations for how much torque to apply to a robot joint.

Ok, now we get to the reinforcement learning. The approach they took was to train the reinforcement learning system on 3 subtasks, and then teach the system to combine the 3 subtasks into a sequence. The 3 subtasks are "pick", "place", and "screw". To train these, they used the nut & bolt environment. For "pick", the robot has to grasp a nut placed at any random location on a work surface. For "place", the robot has to place the nut on top of a bolt at a fixed location. For "screw", the robot has to screw down the nut, engaging the mating threads and tightening the appropriate amount until the nut is firmly in place at the base. These are all done with a 6-degrees-of-freedom Franka robot hand.

For "pick", a reward function was fashioned that is based on the distance between the robotic fingertips and the nut. Further reward was granted if the nut remained in the robot hand's grasp after lifting.

For "place", a reward function was fashioned that was based not just on the distance to the bolt, but a number of distances to a number of "keypoints", which also reward the robot for getting the nut in the right orientation.

For "screw", a reward function was fashioned that was based on keypoint distances, this time between the nut and the base of the bolt, to reward the robot for screwing it down, and also between the rest of the bolt and the nut, to make the tightening process more stable.

As a result, the robot was able to learn how to generate "the precise torques along the 7 arm joints to allow the high-inertia robot links to maintain appropriate posture of the gripper." This is not to say there weren't problems. "As a simplifying assumption, the joint limit of the end-effector was removed, allowing the Franka to avoid regrasping." Not something you could do in the real world. But...

"Nevertheless, training was replete with a diverse range of pathologies, including high-energy collision with the bolt shank, roll-pitch misalignment of the nut when first engaging the bolt threads, jamming of the nut during tightening, and precession of the gripper around the bolt during tightening, which induced slip between the gripper and nut."

To address these issues, the researchers embarked on a "systematic exploration of controllers/gains, observation/action spaces, and baseline rewards." "The highest performing agents consistently used an OSC motion controller with low proportional gains, an observation space consisting of pose and velocity of the gripper and nut, a 2-degrees-of-freedom action space (Z-translation and yaw), and a linear baseline reward."

In an effort to further speed things up, they put in a limit on the number of gradient updates to the policy and put in an early termination rule.

Anyway, combining the 3 subtasks in a sequence, the researchers said that they were able to achieved an end-to-end pick + place + screw success rate of 74.2%.

Through out all of this, the recorded the contact forces involved. "Although the reward functions for the reinforcement learning agents never involved contact forces, the robots learned policies that generated forces in the middle of human ranges; the much higher variance of human forces was likely due to more diverse strategies adopted by humans."

All in all, a big step forward for robotics for manufacturing assembly.

Factory: Fast contact for robotic assembly

#solidstatelife #ai #robotics #manufacturing #reinforcementlearning

waynerad@pluspora.com

"Rapid motor adaptation for legged robots" (RMA). This work was inspired by the observation that animals can rapidly adapt to different soils, moving uphill or downhill, carrying loads, moving with rested or tired muscles, responding to injury, and so on, while robotics systems require immense training to adapt to any of these, which just can't be done on time scales of fractions of a second.

The way their solution works is by, instead of just having a "policy", the term that in the reinforcement learning field refers to the actions that the reinforcement learning agent will undertake in any given state, they have a "policy" plus an "adaptation module". The "policy" is trained in simulation, except it is allowed to magically know information about its environment in the simulation, such as surface friction, the weight of its payload, and so on. The job of the "adaptation module" is to guess this environmental information, which they call the "extrinsics". Furthermore, the way these "extrinsics" are estimated based on the difference between what the robot joints are commanded to do vs what they actually do.

Further further more, this "adaptation module" can also be trained in simulation. This is because the simulation can be constructed in such a way that the "policy" training procedure is allowed to know the "privileged" information about the "extrinsics" but the "adaptation module" is denied this knowledge and has to learn it from experience.

Of course all this depends on having a rich enough simulation environment that it encompasses everything the robot will encounter in the real world. The researchers created a fractal terrain generator that creates a wide variety of physical contexts for the robot to experience with wide variation in parameters like mass and friction.

Since this is a reinforcement learning system, you might be wondering what the all-important reward function is for the policy portion. The reward function rewards the robot for moving forward and penalizes it for jerky or inefficient motions. More specifically, it is rewarded for going close to 0.35 m/s, which was chosen as the maximum speed, and penalized for lateral movement of joints, rotation of joints, joint speed, tipping sideways, vertical acceleration, foot slippage, and overall expenditure of energy.

The training of the adaptation module works by giving it access to the robot's internal state and its recent actions, but not the actual environmental "extrinsics". Since in simulation the "ground truth" is known, the adaptation module can be trained using supervised learning instead of reinforcement learning. The full set of "extrinsics" are: friction, payload mass, center of mass, motor strength, position and velocity from the motor encoders, roll and pitch from the IMU sensor, and the foot contact indicators from the foot sensors.

In the real world, the robot used is the A1 robot from Unitree, which in simulation is simulated by a simulator called RaiSim. The robot's internal state consists of the joint positions (12 values), joint velocities (12 values), roll and pitch of the torso and binary foot contact indicators (4 values), and the actions it has available are position controls for the 12 robot joints.

The neural network for the "policy" is a 3-layer fully connected network, while the neural network for the "adaptation module" is a 3-layer convolutional neural network.

For the results, well, just watch the videos.

RMA: rapid motor adaptation for legged robots

#solidstatelife #ai #robotics #quadrupeds #reinforcementlearning #simulation