Genie 2 is a new foundation "world model" from DeepMind, "capable of generating an endless variety of action-controllable, playable 3D environments for training and evaluating embodied agents. Based on a single prompt image, it can be played by a human or AI agent using keyboard and mouse inputs."
Apparently these models that you can interact with like video games have a name now: "world models".
"Until now, world models have largely been confined to modeling narrow domains. In Genie 1, we introduced an approach for generating a diverse array of 2D worlds. Today we introduce Genie 2, which represents a significant leap forward in generality. Genie 2 can generate a vast diversity of rich 3D worlds."
"Genie 2 responds intelligently to actions taken by pressing keys on a keyboard, identifying the character and moving it correctly. For example, our model has to figure out that arrow keys should move the robot and not the trees or clouds."
"We can generate diverse trajectories from the same starting frame, which means it is possible to simulate counterfactual experiences for training agents."
"Genie 2 is capable of remembering parts of the world that are no longer in view and then rendering them accurately when they become observable again."
"Genie 2 generates new plausible content on the fly and maintains a consistent world for up to a minute."
"Genie 2 can create different perspectives, such as first-person view, isometric views, or third person driving videos."
"Genie 2 learned to create complex 3D visual scenes."
"Genie 2 models various object interactions, such as bursting balloons, opening doors, and shooting barrels of explosives."
"Genie 2 models other agents" -- NPCs -- "and even complex interactions with them."
"Genie 2 models water effects."
"Genie 2 models smoke effects."
"Genie 2 models gravity."
"Genie 2 models point and directional lighting."
"Genie 2 models reflections, bloom and coloured lighting."
"Genie 2 can also be prompted with real world images, where we see that it can model grass blowing in the wind or water flowing in a river."
"Genie 2 makes it easy to rapidly prototype diverse interactive experiences."
"Thanks to Genie 2's out-of-distribution generalization capabilities, concept art and drawings can be turned into fully interactive environments."
"By using Genie 2 to quickly create rich and diverse environments for AI agents, our researchers can also generate evaluation tasks that agents have not seen during training."
"The Scalable Instructable Multiworld Agent (SIMA) is designed to complete tasks in a range of 3D game worlds by following natural-language instructions. Here we used Genie 2 to generate a 3D environment with two doors, a blue and a red one, and provided instructions to the SIMA agent to open each of them."
Towards the very end of the blog post, we are given a few hints as to how Genie 2 works internally.
"Genie 2 is an autoregressive latent diffusion model, trained on a large video dataset. After passing through an autoencoder, latent frames from the video are passed to a large transformer dynamics model, trained with a causal mask similar to that used by large language models."
"At inference time, Genie 2 can be sampled in an autoregressive fashion, taking individual actions and past latent frames on a frame-by-frame basis. We use classifier-free guidance to improve action controllability."