So the claim is being made now that you can take any image -- a photo you've just taken on on your phone, a sketch that your child, or you just drew, or an image you generated using, say, Midjourney or DALL-E 3, and hand it to an AI model called Genie that will take the image and make it "interactive". You can control the main character and the scene will change around it. A tortoise made of glass, or maybe a translucent jellyfish floating through a post-apocalyptic cityscape.

"I can't help but point out the speed with which many of us are now becoming accustomed to new announcements and how we're adjusting to them."

"OpenAI Sora model has been out for just over a week and here's a paper where we can imagine it being interactive."

The Genie model is a vision transformer (ViT) model. That means it incorporates the "attention mechanism" we call "transformers" in its neural circuitry. That doesn't necessarily mean it "tokenizes" video, like the Sora model, but it does that, too. It also uses a particular variation of the transformer called the "ST-transformer" that is supposed to be more efficient for video. They don't say what the "ST" stands for but I'm guessing it stands for "spatial-temporal". It contains neural network layers that are dedicated to either spatial or temporal attention processing. This "ST" vision transformer was key to the creation of the video tokenizer, as what they did to create the tokenizer was use a "spatial-only" tokenizer (something called VQ-VAE) and modified it to do "spatial-temporal" tokenization. (They call their tokenizer ST-ViViT.)

(VQ-VAE, if you care to know, stands for "Vector Quantised Variational AutoEncoder". The term "autoencoder" means a combination of encoder and corresponding decoder. The "variational" part means the encoding in the middle is considered a latent "variable" that is designed to adhere to a predetermined statistical distribution. The "vector quantized" part means the vectors that come out are discrete, rather than continuous. I don't know how being discrete is advantageous in this context.)

After this there are two more neural network models. One of them takes the original frames, and one takes the video tokens and the output from the first model.

The first model is called the "latent action model". It takes video frames as input. Remember "latent" is just another word for "hidden". This is a neural network that is trained by watching videos all day. As it watches videos, it is challenged to predict later frames of video from previous frames that came before. In the process, it is asked to generate some parameters that describe what is being predicted. These are called the "latent actions". The idea is if you are given a video frame and the corresponding "latent actions", you can predict the next frames.

The second model is called the "dynamics" model. It takes tokens and the "latent actions" from the first model, and outputs video tokens.

Once all these models are trained up -- the tokenizer, the latent action model, and the dynamics model -- you're ready to interact.

You put in a photo of a tortoise made of glass, and now you can control it like a video game character.

The image you input serves as the initial frame. It gets tokenized and everything is tokenized from that point onward. The system can generate new video in a manner analogous to how a large language model generates new text by outputting text tokens. The key, though, is that by using the keyboard to initiate actions, you're inputing actions directly into the "latent actions" parameters. Doing so alters the video tokens that get generated, which alter all the subsequent video after that.

The researchers trained it on videos of 2D platformer games.

The AI "Genie" is out + humanoid robotics step closer

#solidstatelife #ai #genai #computervision