#videogames

bliter@diaspora-fr.org

#385 - #TheFlameWars : la #guerre #Amiga vs #Atari ST sublimée ! - #Gunhed TV

top
https://www.youtube.com/watch?v=9GNFFA4IwK0

Si vous aimez mes inepties, sachez que vous pouvez aussi les lire !

Les #Chroniques de #GunhedTV
Tome 1 : https://amzn.to/3PaALsR
Tome 2 : https://amzn.to/3PabkrJ
Tome 3 : https://amzn.to/3Hf0pLn
Tome 4 : https://amzn.to/3FAwz2o

Si la guerre Amiga vs #AtariST m'était contée
https://amzn.to/3W0hT2g

Babes in #VideoGames
Tome 1 : https://amzn.to/40WyXsl
Tome 2 : https://amzn.to/3UMOLvI


The Flame Wars est un livre en anglais retraçant l'épopée la guerre Amiga vs Atari ST. C'est également un bel hommage au #processeur #Motorola #68000.

#retrogaming #retrocomputer #commodore #histoire #story

waynerad@diasp.org

The Scalable, Instructable, Multiworld Agent (SIMA) from DeepMind plays video games for you. You tell it what you want to do in regular language, and it goes into a 3D environment, including some provided by commercial video games, and carries out keyboard-and-mouse actions.

Before getting into how they did this, might be worth citing some of the reasons they thought this was challenging: Video games can be open-ended, visually complex, and have hundreds of different objects. Video games are asynchronous -- no turn taking like chess or Go, or many research environments, which stop and wait while the agent computes its next action. Each instance of a commercial video game needs its own GPU -- no running hundreds or thousands of actors per game per experiment as has been historically done in reinforcement learning. AI agents see the same screen pixels that a human player gets -- no access to internal game state, rewards, or any other "privileged information". AI agents use the same keyboard-and-mouse controls that humans do -- no handcrafted action spaces or high-level APIs.

In addition to all those challenges, they demanded their agents follow instructions in regular language, rather than simply pursuing a high score in the game, and the agents were not allowed to use simplified grammars or command sets.

"Since the agent-environment interface is human compatible, it allows agents the potential to achieve anything that a human could, and allows direct imitation learning from human behavior."

"A key motivation of SIMA is the idea that learning language and learning about environments are mutually reinforcing. A variety of studies have found that even when language is not necessary for solving a task, learning language can help agents to learn generalizable representations and abstractions, or to learn more efficiently." "Conversely, richly grounded learning can also support language learning."

I figure you're all eager to know what the games were. They were: Goat Simulator 3 (you play the goat), Hydroneer (you run a mining operation and dig for gold), No Man's Sky (you explore a galaxy of procedurally-generated planets), Satisfactory (you attempt to build a space elevator on an alien planet), Teardown (you complete heists by solving puzzles), Valheim (you try to survive in a world of Norse mythology), and Wobbly Life (you complete jobs to earn money to buy your own house).

However, before the games, they trained SIMA in research environments. Those, which you probably never heard of, are: Construction Lab (agents are challenged to build things from construction blocks), Playhouse (a procedurally-generated house), ProcTHOR (procedurally-generated rooms, such as offices and libraries), and WorldLab (an environment with better simulated physics).

The SIMA agent itself maps visual observations and language instructions to keyboard-and-mouse actions. But it does that in several stages. For input, it takes a language instruction from you, and the pixels of the screen.

The video and language instruction both go through encoding layers before being input to a single, large, multi-modal transformer. The transformer doesn't output keyboard and mouse actions directly. Instead, it outputs a "state representation" that gets fed into a reinforcement learning network, which translates the "state" into what in reinforcement learning parlance is called a "policy". A more intuitive regular word might be "strategy". Basically this is a function that, when given input from the environment including the agent's state within the environment, will output an action. Here, the actions are the same actions a human would take with mouse and keyboard.

The multi-modal transformer was trained from scratch. A recent new algorithm called Classifier-Free Guidance (CFG) was used, an algorithm inspired by the algorithms used by diffusion models to "condition" the diffusion model on the text you, the user, typed in.

Even in the research environments, it is hard to automate judging of whether an agent completed its tasks. Instructions may be such things as, "make a pile of rocks to mark this spot" or "see if you can jump over this chasm". The environment may not provide any signal indicating these have been fulfilled. There are some they can handle, though, like "move forward", "lift the green cube", and "use the knife to chop the carrots".

For commercial video games, all the agent gets is pixels on the screen, just like a human player, and has no access to the internal game state of the game. The games generally don't allow any game state to be saved and restored, something researchers like for reproducibility.

For video games, they resorted to detecting on-screen text using OCR. They did this in particular for two games, No Man's Sky and Valheim, "which both feature a significant amount of on-screen text."

Why not just have people look, i.e. have humans judge whether the instructions were followed? Turns out humans were "the slowest and most expensive." They were able to get judgments from humans who were experts at the particular game an agent was playing, though.

For automated judgment, if a task contains a knife, a cutting board, and a carrot, the agent may ascertain the goal ("cut the carrot on the cutting board") without relying on the language instruction. This example illustrates the need to differentiate between following a language task and inferring the language task from "environmental affordances".

How'd SIMA do? It looks like its success rate got up to about 60% for Playhouse, but only about 30% for Valheim. That's the percentage of tasks completed. The ranking goes Playhouse, Worldlab, Satisfactory, Construction Lab, No Man's Sky, Goat Simulator 3, and Valheim.

"Note that humans would also find some of these tasks challenging, and thus human-level performance would not be 100%."

Grouped by "skill category", movement instructions ("stop", "move", "look") were the easiest, while food and resource gathering instructions ("eat", "cook", "collect", "harvest") were the hardest.

For No Man's Sky, they did a direct comparison with humans. Human's averaged 60%, while SIMA had around 30%.

How long til the AIs can beat the humans?

A generalist AI agent for 3D virtual environments

#solidstatelife #ai #genai #llms #computervision #multimodal #videogames