#robotics

waynerad@diasp.org

OpenAI is apparently partnering with a robotics company, Figure 01. The robot seems to do a remarkable job of integrating language with vision and body movements. They say it works using neural networks entirely. No teleoperation or hand-coded code. Nothing is scripted ahead of time.

"Can I have something to eat?"

"Sure thing."

"Great. Can you explain why you did what you just did while you pick up this trash?"

"On it! So I gave you the apple because it's the only edible item I could provide you with from the table."

"Great. So, based on the scene right now, where do you think the dishes in front of you go next?"

"The dishes on the table, like that plate and cup, are likely to go into the drying rack next."

"Great. Can you put them there?"

"Of course."

OpenAI's 'AGI Robot' Develops SHOCKING NEW ABILITIES | Sam Altman Gives Figure 01 Get a Brain - Wes Roth

#solidstatelife #ai #openai #figure01 #genai #llms #robotics

waynerad@diasp.org

5 robotics startups that caught the attention of Bill Gates.

Agility Robotics: A humanoid robot for logistics work.

Tevel: "Flying autonomous robots that can scan tree canopies and pick ripe apples and stone fruits around the clock, while simultaneously collecting comprehensive harvesting data."

Apptronik: "'General-purpose' humanoid bi-pedal robots".

RoMeLa: "A robot that can navigate rocky and unstable terrain, and retain its balance without falling over."

Field AI: "Developing AI software for other companies' robots that enables them to perceive their environments, navigate without GPS (on land, by water, or in the air), and even communicate with each other."

The start-ups making robots a reality | Bill Gates

#solidstatelife #startups #robotics

waynerad@diasp.org

CES 2024: "Looking into the future".

"They are putting AI in everything."

AI pillow, AI mattress, AI office chair, AI fridges, AI washers & dryers, AI smart lamps, AI grills, AI barbecue, AI cooking gear, AI pressure cookers, AI food processors, AI air fryers, AI stethoscopes, AI bird feeders, AI telescopes, AI backpacks, AI upscaling TVs, AI realtime language translator. Most require an internet connection to work.

Manufacturing and warehouse robots, delivery robots, lawn care robots, lawnmower robots, pool cleaning robots, robot bartender, robot barista, robots that cook stir fry and make you ice cream, robot front desk assistant, hospital robot, robot to throw tennis balls for your dog and feed your dog, robot that can roll around your house and project things on the wall.

Computer vision self-checkout without scanning barcodes, computer vision food tray scanner that tells you how many calories are in the food, whether it has any allergens, and other stuff to do with the food, computer vision for vehicles.

Augmented reality form factor that is just regular glasses, 3D video without glasses, VR roller coaster haptic suits.

A car with all 4 wheels able to move independently so itcan rotate in place and move sideways for parallel parking.

A 1-person helicopter, no steering mechanism, autonomous, a car with drone propellers on the roof that can fold into the car.

Giant LED walls everywhere, a ride where people sit on a seat hanging from the ceiling while moving through a world that's all on a giant LED screen.

Transparent TVs -- possibly great for storefronts.

A water maker pulling moisture from the air, home beer making, an automated manicure system, a mouth mask that enables your friends to hear you in a video game but nobody can hear you in real life.

Crazy AI tech everywhere (the CES 2024 experience) - Matt Wolfe

#solidstatelife #ai #genai #robotics #virtualreality #augmentedreality #computervision #ces2024

waynerad@diasp.org

Mobile ALOHA: Your housekeeping robot. Opening the blinds, watering the plants, vacuuming, making coffee, cleaning up spilled milk, taking dishes out of the dishwasher, pouring ketchup, taking out the trash, putting laundry in the washing machine, taking clothes out of the dryer, putting sheets on the bed, putting the pillow in the pillowcase, hanging clothes in the closet, folding clothes and putting them in a drawer, and turning the light off. Oh, and it plays with the cat, too. The video is taking the internet by storm.

But, not so fast. The "robot" in this video is not autonomous, it's tele-operated. But, if you go to the website of the project (link below), you find many tasks the robot can do autonomously. So, what's going on?

Autonomous skills: Cook shrimp, wipe wine, call elevator, use cabinets, rinse pan, push chairs, high-five humans.

What's going on is the tele-operation and the autonomy are related. What these researchers did was build a robot for training a robot to do things. So the tele-operation creates training data that then gets turned into a robot that can perform some tasks autonomously. The researchers noticed that "imitation learning" algorithms had been created, but there wasn't any affordable platform for creating the training data for the "imitation learning" algorithms. So they decided to make it. This is actually their second system. The first system, called "ALOHA", was a pair of robotic arms mounted on a tabletop. "ALOHA" stood for "A low-cost open-source hardware... (system for bimanual teleoperation, but that would be ALOHASFBIT).

The problem with the tabletop mounting is that many household tasks combine hand movement with whole-body movement. For example, to open a cabinet, the robot needs to back up while opening the two cabinet doors by the two door handles. And of course it has to navigate to the cabinet in the first place. And if it's putting a pot in the cabinet, it has to put the pot down, open the cabinet, pick the pot up again, put it in the cabinet, and close the cabinet. Most household tasks are like this. So they got the idea for "Mobile ALOHA".

To go from "ALOHA" to "Mobile Aloha", they took the previous tabletop ALOHA system and mounted it on a wheeled base. An AgileX Tracer AGV (automatic guided vehicle), from Trossen Robotics, to be precise. It is designed for indoor autonomous logistics and warehousing applications. It has two 150W brushless servo motors and can carry a payload of 100 kg (220 lbs) for 4 hours. It can move at about the same speed as a walking human. It can go 1.6 meters per second and the average walking speed for a human is 1.4 meters per second. By adding extra weight low to the ground, the researchers found they could increase the "tip-over" stability, enabling Mobile ALOHA to, for example, get on an elevator, where the elevator floor wasn't exactly level with the building floor.

With the robotic arms attached, the total weight is 75 kg (165 lbs), the robotic arms can extend 100 cm (40 inches) from the base, and can lift 750 g (26 Oz).

How then is the system tele-operated, and how, once the tele-operation generates data, is that data used to train the system?

Well, first of all the controller of the tele-operation system can't use their hands to control the base movement because their hands are controlling the robotic arms and grippers. So the researchers came up with a "tether" that attaches to the controller's waist, and that's how the robotic base is controlled.

How about the training? Well, the original ALOHA system represented the 14-degrees-of-freedom input from the robotic arms and grippers as a 14-dimensional vector. For Mobile ALOHA, that's extended to 2 more for the base. If you're wondering where the number 14 came from, the robotic arms are "6-degrees-of-freedom" robotic arms. More precisely, they're the 6-degrees-of-freedom ViperX 300 Robot Arm from Trossen Robotics. That's 6, and to get 7 they added the gripping force. And there's 2 of these robotic arms, so multiply by 2 and you get 14. Add 2 more degrees of freedom for the base unit, and you're now up to 16 degrees of freedom.

To expedite the training, they first trained the system on their data from the original 14-DoF ALOHA training data. They called this step "Co-training with static ALOHA". They didn't do anything else original for the training, although they did come up with an original algorithm for the original ALOHA system and that was one of three used here. That system is called ACT, which stands for Action Chunking with Transformers. As you might guess from the "transformers" part of the name, the system uses transformers like the GPT models we're familiar with (remember the "T" in "GPT" stands for "transformer" -- GPT stands for "generative pre-trained transformer".) The idea is to break actions into "chunks" represented by tokens and then have the transformer model generate tokens for actions in a manner analogous to how a large language model like GPT generates tokens for text.

The other two algorithms are called Diffusion Policy, and VINN.

If the name "Diffusion Policy" makes you think of diffusion models like Stable Diffusion or DALL-E, you're on the right track. Except unlike those models which generate images, Diffusion Policy generates "policies", which in the parlance of reinforcement learning is the function that maps the state of the agent and envorinment to actions -- the word "strategy" would make more sense in colloquial, non-technical contexts. The idea is that you represent a visual-motor action strategy as a denoising process.

VINN stands for Visual Imitation through Nearest Neighbors. The basic idea is that an encoder network is trained on all the training data. Then when it comes time to use the encoder to perform actions after training, it uses a "nearest neighbor" algorithm to find the action that is closest to the current situation.

Mobile ALOHA: Your housekeeping robot

#solidstatelife #ai #genai #reinforcementlearning #imitationlearning #robotics

wazoox@diasp.eu

Mobile ALOHA : Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

#robotics #IA

Imitation learning from human demonstrations has shown impressive performance in robotics. However, most results focus on table-top manipulation, lacking the mobility and dexterity necessary for generally useful tasks. In this work, we develop a system for imitating mobile manipulation tasks that are bimanual and require whole-body control. We first present Mobile ALOHA, a low-cost and whole-body teleoperation system for data collection. It augments the ALOHA system with a mobile base, and a whole-body teleoperation interface. Using data collected with Mobile ALOHA, we then perform supervised behavior cloning and find that co-training with existing static ALOHA datasets boosts performance on mobile manipulation tasks. With 50 demonstrations for each task, co-training can increase success rates by up to 90%, allowing Mobile ALOHA to autonomously complete complex mobile manipulation tasks such as sauteing and serving a piece of shrimp, opening a two-door wall cabinet to store heavy cooking pots, calling and entering an elevator, and lightly rinsing a used pan using a kitchen faucet.

https://mobile-aloha.github.io/

waynerad@diasp.org

Artificial Intelligence beating people in the physical world -- sort of. A labyrinth game is hooked up to two motors that act as "hands", a camera that acts as its "eyes", and a computer with a "model-based reinforcement learning" algorithm that acts as the "brain".

The key thing here is that the reinforcement learning algorithm practices in the physical world, not in simulation, just like humans. After 6 hour of practice, it outperforms humans. It found ways to 'cheat' by skipping certain parts of the maze and had to be explicitly instructed not to take any of those shortcuts.

The reinforcement learning algorithm incorporated is something called DreamerV3. It is an actor-critic system, and it collects experience from the physical world, then "replays" that out of a reply buffer, then "augments" that with generated "dreams". This reduces the amount of external experience the system needs to learn. (In reinforcement learning parlance, it increases the "sample efficiency".)

DreamerV3 actually consists of 3 neural networks: the world model, the critic, and the actor. All three are trained separately without sharing parameters or gradients. The system contains additional circuitry to dynamically adjust the balances of these 3 objectives without a human having to set "hyperparameters". The DreamerV3 system was originally trained on Minecraft. This labyrinth-playing system built on it is called CyberRunner.

#solidstatelife #ai #robotics #reinforcementlearning

https://www.youtube.com/watch?v=zQMKfuWZRdA

waynerad@diasp.org

Anduril Industries announces Roadrunner and Roadrunner-M. This is an unmanned aerial vehicle that is a vertical-take-off-and-landing (VTOL) vehicle that looks kind of like a stubby rocket with big fins. It uses twin turbojet engines. It's billed as an "interceptor" for ground defense against aircraft. There's a Roadrunner-M high-explosive interceptor variant. (Don't know what the "M" stands for -- "munitions" maybe?). It has it's own portable "hangar" that it launches from that is portable, so it can be launched from anywhere, controlled by one operator, and can come back and land when it is done with its mission and be reused for the next mission.

Anduril Industries is a company founded by Palmer Luckey, creator of the Oculus Rift and founder of Oculus VR, which was acquired by Facebook in 2014. Apparently the premise of the company is that Silicon Valley tech companies don't want to sell technology to the military, so Anduril was created to step in and fill that void and provide technology to the military that other Silicon Valley company are unwilling to sell. They already have a bunch of products which I never heard of, which are ALTIUS, Anvil, Dive-LD, Ghost, Fury, Menace, Sentry, WISP, and Lattice Command & Control.

ALTIUS (stands for Agile-Launched, Tactically-Integrated Unmanned System) is a drone that can be launched from a helicopter, as well as from the ground or naval ship, and can function as a loitering munition or perform other functions such as signals intelligence.

Anvil is a quadcopter designed to take out enemy drones. Basically by getting under them and whacking them from the bottom.

Dive-LD is an autonomous underwater vehicle (AUV). It's designed for underwater reconnaissance, seafloor mapping, anti-mine operations, and anti-submarine warfare.

Ghost is an unmanned aerial vehicle (UAV) that is not a quadcopter but uses a design more like traditional helicopters, and is designed for quiet flight. It's for reconnaissance missions. There's a new Ghost-X variant with upgraded propulsion, battery life, communications, multiple payload capability, and additional sensors.

Fury is a high-performance, high-end fighter jet, but unmanned. Can fly up to mach 0.95 and can pull 9 Gs.

Menace is, eh, it looks like one of those containers that you see on container ships & trains. But what it is is a rapidly deployable "expeditionary command, control, communications, and computing (C4)" station.

Sentry is a family of sensor towers. Anduril promises "edge" AI can deliver highly accurate awareness and identification of aircraft and other "objects of interest".

WISP (stands for Wide-Area Infrared System for Persistent Surveillance) is a rotating, 360-degree infrared sensor that uses AI for threat detection and situational awareness.

Lattice Command & Control is their software system for integrating information and controlling aerial vehicles. "Lattice uses technologies like sensor fusion, computer vision, edge computing, and machine learning and artificial intelligence to detect, track, and classify every object of interest in an operator's vicinity." See below for more on Lattice.

Anduril unveils Roadrunner & Roadrunner-M - Anduril Industries

#solidstatelife #ai #uavs #robotics #militaryai

waynerad@diasp.org

An autonomous excavator built a six metre-high and sixty-five-metre-long dry-stone wall. It picks up boulders, scans them, algorithmically determines the optimal placement for them, and places them.

The wall is part of "a digitally planned and autonomously excavated landscape and park."

"Using sensors, the excavator can autonomously draw a 3D map of the construction site and localise existing building blocks and stones for the wall's construction. Specifically designed tools and machine vision approaches enable the excavator to scan and grab large stones in its immediate environment. It can also register their approximate weight as well as their centre of gravity. An algorithm determines the best position for each stone, and the excavator then conducts the task itself by placing the stones in the desired location."

"Our geometric planning algorithm uses a combination of constrained registration and signed-distance-field classification to determine how these should be positioned toward the formation of stable and explicitly shaped structures."

The paper is paywalled but I can tell you, because I was trying to figure out how a code CAD system called SummonScript works, that a signed distance field is a grid of voxels where the contents of each cell is a distance that represents the closest distance to the surface of an object. By convention, positive numbers represent 'outside' the object, while negative numbers represent 'inside' the object.

As for "constrained registration", I don't know why they call it 'registration', but the basic idea is that you put in two geometric objects, and the algorithm figures out what geometric transformations (translations, rotations, and scaling) turns the first object into something as close as possible to the second object. It's called 'constrained' because you can tack on additional constraints that you want the algorithm to satisfy. These could be angles that the algorithm is not allowed to change or points that must remain aligned with other points. Since the research paper is paywalled I can't give any more specifics of the algorithm here. Obviously one of the constraints is that it can't do scaling since the size of the stones can't change.

Autonomous excavator constructs a six-metre-high dry-stone wall | ETH Zurich

#solidstatelife #robotics #construction

waynerad@diasp.org

RT-2 (Robotics Transformer 2) is a robot that can follow text commands using an integrated large language model and vision model. The idea is to "directly train vision-language models designed for open-vocabulary visual question answering and visual dialogue to output low-level robot actions."

"Although such models are typically trained to produce natural language tokens, we can train them on robotic trajectories by tokenizing the actions into text tokens and creating 'multimodal sentences' that 'respond' to robotic instructions paired with camera observations by producing corresponding actions."

"Robotic policies derived from such vision-language models exhibit a range of remarkable capabilities, combining the physical motions learned from the robot data with the ability to interpret images and text learned from web data into a single model."

Examples of commands the robot is able to obey successfully are: "Put strawberry into the correct bowl", "Pick up the bag about to fall off the table", "Move the apple to Denver Nuggets" (Denver Nuggets placemat -- go Denver Nuggets!), "Pick robot" (pick up toy robot), "Place orange in matching bowl", "Move Redbull can to H" (what do you call these letters that you use to put text on houses?), "Move soccer ball to basketball", "Move banana to Germany" (flag placemats), "Move cup to the wine bottle", "Pick animal with different color" (one black, two yellow, it correctly picks up the black one), "Move Coke can to Taylor Swift" (portraits), "Move Coke can to X" (letters), "Move bag to Google" (bag with "Google" logo on it, "Move banana to the sum of two plus one" (pieces of paper with numbers on them), and "Pick land animal".

RT-2: Vision-language-action models

#solidstatelife #ai #robotics #llms