#diffusionmodels

waynerad@diasp.org

Photorealistic AI-generated talking humans. "VLOGGER" is a system for generating video to match audio of a person talking. So you can make video of any arbitrary person saying any arbitrary thing. You just supply the audio (which could itself be AI-generated) and a still image of a person (which also could itself be AI-generated).

Most of the sample videos wouldn't play for me, but the ones in the top section did and seem pretty impressive. You have to "unmute" them to hear the audio and see that the video matches the audio.

They say the system works using a 2-step approach where the first step is to take just the audio signal, and use a neural network to predict what facial expressions, gaze, gestures, pose, body language, etc, would be appropriately associated with that audio, and the second step is to combine the output of the first step with the image you provide to generate the video. Perhaps surprisingly (at least to me), both of these are done with diffusion networks. I would've expected the second step to be done with diffusion networks, but the first to be done with some sort of autoencoder network. But no, they say they used a diffusion network for that step, too.

So the first step is taking the audio signal and converting to spectrograms. In parallel with that the input image is input into a "reference pose" network that analyses it to determine what the person looks like and what pose the rest of the system has to deal with as a starting point.

These are fed into the "motion generation network". The output of this network is "residuals" that describe face and body positions. It generates one set of all these parameters for each frame that will be in the resulting video.

The result of the "motion generation network", along with the reference image and the pose of the person in the reference image is then passed to the next stage, which is the temporal diffusion network that generates the video. A "temporal diffusion" network is a diffusion network that generates images, but it has been modified so that it maintains consistency from frame to frame, hence the "temporal" word tacked on to the name. In this case, the temporal diffusion network has undergone the additional step of being trained to handle the 3D motion "residual" parameters. Unlike previous non-diffusion-based image generators that simply stretched images in accordance with motion parameters, this network incorporates the "warping" parameters into the training of the neural network itself, resulting in much more realistic renditions of human faces stretching and moving.

This neural network generates a fixed number of frames. They use a technique called "temporal outpainting" to extend the video to any number of frames. The "temporal outpainting" system re-inputs the previous frames, minus 1, and uses that to generate the next frame. In this manner they can generate a video of any length with any number of frames.

As a final step they incorporate an upscaler to increase the pixel resolution of the output.

VLOGGER: Multimodal diffusion for embodied avatar synthesis

#solidstatelife #ai #computervision #generativeai #diffusionmodels

waynerad@diasp.org

"Sora AI: When progress is a bad thing."

This guy did experiments where he asked people to pick which art was AI generated and which art was human made. They couldn't tell the difference. Almost nobody could tell the difference.

To be sure, and "just to mess with people", he would tell people AI-generated art was made by humans and human art was made by AI and ask people to tell him how they could tell. People would proceed to tell him all the reasons why an AI-generated art piece was an amazing masterpiece clearly crafted by human hands -- with emotions and feelings. And when shown art made by a human and told it was AI-generated, people would write out a paragraph describing to me all the reasons how they could clearly tell why this was generated by AI.

That's pretty interesting but actually not the point of this video. The point of the video is that AI art generators don't give people the same level of control over art they make themselves, but it clearly has the understanding of, for example, what is a road and what is a car, and a basic understanding of physics and cause and effect of things.

He thinks we're very close to being able to take a storyboard and "shove it into the AI and it just comes up with the perfect 3D model based on the sketch, comes up with the skeletal mesh, comes up with the animations it -- infers details of the house based on your terrible drawings -- it manages the camera angles, creates the light sources, gives you access to all the key framing data and positions of each object within the scene, and with just a few tweaks you'd have a finished product. The ad would be done in like an hour or two, something that."

He's talking about the "Duck Tea" example in the video -- he made up a product called "Duck Tea" that doesn't exist and pondered what would be involved in making an ad for it.

"Would have taken weeks of planning and work, something that would have taken a full team a long time to finish, would take one guy one afternoon."

The solution: Vote for Michelle Obama because she will introduce Universal Basic Income?

Sora AI: When progress is a bad thing - KnowledgeHusk

#solidstatelife #ai #genai #diffusionmodels #computervision #aiethics

waynerad@diasp.org

Reaction video to OpenAI Sora, OpenAI's system for generating video from text.

I encountered the reaction video first, in fact I discovered Sora exists from seeing the reaction video, but see below for the official announcement from OpenAI.

It's actually kind of interesting and amusing comparing the guesses in the reaction videos about how the system works from the way it actually works. People are guessing based on their knowledge of traditional computer graphics and 3D modeling. However...

The way Sora works is quite fascinating. We don't know the knitty-gritty details but OpenAI has described the system at a high level.

Basically it combines ideas from their image generation and large language model systems.

Their image generation systems, DALL-E 2 and DALL-E 3, are diffusion models. Their large language models, GPT-2, GPT-3, GPT-4, GPT-4-Vision, etc, are transformer models. (In fact "GPT" stands for "generative pretrained transformer").

I haven't seen diffusion and transformer models combined before.

Diffusion models work by having a set of parameters in what they call "latent space" that describe the "meaning" of the image. The word "latent" is another way of saying "hidden". The "latent space" parameters are "hidden" inside the model but they are created in such a way that the images and text descriptions are correlated, which is what makes it possible to type in a text prompt and get an image out. I've elsewhere given high-level hand-wavey descriptions of how the latent space parameters are turned into images through the diffusion process, and how the text and images are correlated (a training method called CLIP), so I won't repeat that here.

Large language models, on the other hand, work by turning words and word pieces into "tokens". The "tokens" are vectors constructed in such a way that the numerical values in the vectors are related to the underlying meaning of the words.

To make a model that combines both of these ideas, they figured out a way of doing something analogous to "tokens" but for video. They call their video "tokens" "patches". So Sora works with visual "patches".

One way to think of "patches" is as video compression both spatially and temporally. Unlike a video compression algorithm such as mpeg that does this using pre-determined mathematical formulas (discrete Fourier transforms and such), in this system the "compression" process is learned and is all made of neural networks.

So with a large language model, you type in text and it outputs tokens which represent text, which are decoded to text for you. With Sora, you type in text and it outputs tokens, except here the tokens represent visual "patches", and the decoder turns the visual "patches" into pixels for you to view.

Because the "compression" works both ways, in addition to "decoding" patches to get pixels, you can also input pixels and "encode" them into patches. This enables Sora to input video and perform a wide range of video editing tasks. It can create perfectly looping video, it can animate static images (why no Mona Lisa examples, though?), it can extend videos, either forward or backward in time. Sora can gradually interpolate between two input videos, creating seamless transitions between videos with entirely different subjects and scene compositions. I found these to be the most freakishly fascinating examples on their page of sample videos.

They list the following "emerging simulation capabilities":

"3D consistency." "Sora can generate videos with dynamic camera motion. As the camera shifts and rotates, people and scene elements move consistently through three-dimensional space."

This is where they have the scene everyone is reacting to in the reaction videos, where the couple is walking down the street in Japan with the cherry blossoms.

By the way, I was wondering what kind of name is "Sora" so I looked it up on behindthename.com. It says there are two Japanese kanji characters both pronounced "sora" and both of which mean "sky".

"Long-range coherence and object permanence." "For example, our model can persist people, animals and objects even when they are occluded or leave the frame. Likewise, it can generate multiple shots of the same character in a single sample, maintaining their appearance throughout the video."

"Interacting with the world." "Sora can sometimes simulate actions that affect the state of the world in simple ways. For example, a painter can leave new strokes along a canvas that persist over time, or a man can eat a burger and leave bite marks."

"Simulating digital worlds." "Sora can simultaneously control the player in Minecraft with a basic policy while also rendering the world and its dynamics in high fidelity."

However they say, "Sora currently exhibits numerous limitations as a simulator." "For example, it does not accurately model the physics of many basic interactions, like glass shattering."

This is incredible - ThePrimeTime

#solidstatelife #ai #genai #diffusionmodels #gpt #llms #computervision #videogeneration #openai

waynerad@diasp.org

MagicAnimate animates humans based on a reference image. See the Mona Lisa jogging or doing yoga.

The way it works is, well, first of all it uses a diffusion network to generate the video. Systems for generating video using GANs (generative adversarial networks) have also been developed. Diffusion networks, however, have recently shown themselves to be better at taking a human-entered text prompt and turning that into an image. The problem, though, is that if you want to make a video, you go frame by frame, and since each frame is independent of the others, that inevitably leads to flickering.

The key insight here is that instead of doing the "diffusion" process frame-by-frame, you do it on the entire video all at once. This enables "temporal consistency" across frames. A couple more elements are necessary to get the whole system to work, though.

One is discarding the normal way diffusion networks use an internal encoding that is tied to a text prompt. In this system, since a reference image is provided instead, there is no text prompt. So the whole system is trained to use an internal encoding that is based on appearance. This enables the system to maintain the appearance of the original video for both the human being animated and the background.

The other key piece that gets the system to work is incorporating a prior system called ControlNet. ControlNet analyzes the pose provided and converts in into a motion signal, which is a dense set of body "keypoints". The first step of the process involves analysis of the control points. Joint diffusion of the control points and the reference image is the second stage after that.

If you're wondering how the system manages to hold the entire video in memory to do the diffusion process on the entire videos, the answer is that actually it doesn't. Because they needed to get the system to work on GPUs with limited memory, the researchers actually devised a "sliding window" system where it would generate overlapping segments of video. The frames are close enough that they can be combined with simple averaging and the end result looks okay.

Speaking of the researchers, this was a joint team between ByteDance and the National University of Singapore. ByteDance as in, the maker of TikTok. Application of this to TikTok is obvious.

#solidstatelife #ai #genai #diffusionmodels #videoai

https://showlab.github.io/magicanimate/

waynerad@diasp.org

Visual Electric: "A breakthrough interface for generative AI".

"Visual Electric is the first image generator designed for creatives -- a canvas that facilitates the flow of ideas, so you can truly spread out and see where the tool takes you. It's designed to help give form to the vision in your mind's eye... or lead you to something even better."

"We believe that in order to be truly useful, AI needs to augment our existing creative process with all its winding paths, switchbacks and u-turns. This requires tools that embrace ambiguity over certainty. Often our best new ideas appear in the margins."

Commercial product with a free tier.

The interface has such features as autosuggest for your text prompts, "remix" tools that let you change colors and styles while keeping everything else in the image the same, an assortment of "hand-crafted" styles, the ability to do "inpainting" while keeping the rest of the image the same, the ability to create variations that change the "temperature" of the generative network, the ability to upscale the image and make other touch-ups, and they claim to have trained the whole thing on their own "library of stunning images".

Visual Electric

#solidstatelife #ai #genai #diffusionmodels

waynerad@diasp.org

Stable Video from Stability AI, the same company that made Stable Diffusion, has been released. Károly Zsolnai-Fehér of Two Minute Papers does a quick run-down, comparing it with existing systems like Runway, Emu Video, and Imagen Video. Stable Video was trained on 600 million videos.

Imperfections: The videos have to be short. Sometimes instead of real animation, you get camera panning. If you want text in your video, it will have trouble. It requires a lot of GPU memory to run. It can't do iterative edits, which Emu Video can do.

On the plus side, Stable Video is completely open source.

Stable Video AI watched 600,000,000 videos - Two Minute Papers

#solidstatelife #ai #genai #diffusionmodels

waynerad@diasp.org

Optical illusions created with diffusion models. Images that change appearance when flipped or rotated. Actually these researchers created a general-purpose system for making optical illusions for a variety of transformations. They've named their optical illusions "visual anagrams".

Now, I know I told you all I would write up an explanation of how diffusion models work, and I've not yet done that. There's a lot of advanced math that goes into them.

The key thing to understand, here, about diffusion models, is that they work by taking an image and adding Gaussian noise... in reverse. You start with random noise, and then you "de-noise" the image step by step. And you "de-noise" it in the direction of a text prompt.

The way this process works is, you feed in the image and the text prompt, and what the neural network computes is the "noise". Crucially, this "noise" computation isn't a single number, it's a pixel-by-pixel noise estimate -- essentially another image. "Noise" compared to what? Compared to the text prompt. Amazingly enough, using this "error" to "correct" the image and then iterating on the process guides it into an image that fits the text prompt.

The trick they've done here is, they first take the image and compute the "noise" on it the normal way. Then they take the image and put it through its transformation -- rotation, vertical flipping, or puzzle-piece-like rearrangement (rotation, reflection, and translation), then compute the "noise" on that image (using a different text prompt!) and then they do the reverse transformation on the "noise" image. They then combine the original "noise" and the reverse transformation "noise" by simple averaging.

This only works for certain transformations. Basically the two conditions the transformation has to satisfy are "linearity" and "statistical consistency". By linearity, they mean diffusion models fundamentally think in terms of "signal + noise" as a linear combination. If your transformation breaks this assumption, your transformation won't work. By "statistical consistency" they mean diffusion networks assume the "noise" is Gaussian, meaning it follows a Gaussian distribution. If your transformation breaks this assumption, it won't work.

These assumptions hold for the 3 transformations I've mentioned so far: rotation, reflection, and translation. It also works for one more: color inversion. Like a photographic negative. The color values have to be kept centered on 0, though. Their examples are only black-and-white.

Another thing they had to do was use a different diffusion model because Stable Diffusion actually has "latent space" values that refer to groups of pixels. They used an alternative called DeepFloyd IF, where the "latent space" values are per-pixel. I haven't figured out exactly what "latent space" values are learned by each of these models so I can't tell you why this distinction matters.

Another thing is that the system also incorporated "negative prompting" in its "noise" estimate, but they discovered you have to be very careful with "negative prompting". Negative prompts tell the system what it must leave out rather than include in the image. An example that illustrates the problem is, for example if you said "oil painting of a dog" and "oil painting of a cat". They both have "oil painting" so you're telling the system to both include and exclude "oil painting".

The website has lots of animated examples; check it out.

Visual anagrams: Generating multi-view optical illusions with diffusion models

#solidstatelife #ai #genai #diffusionmodels #opticalillusions

waynerad@diasp.org

AI models form "mental models" of the world. An AI trained to play a board game called Othello was opened up and found to be forming a mental model of the board. The AI system here was actually language model. It was given valid games as training data, and its job was to output a word that would represent a move in the game -- except the AI system was never told there was a game. It was simply given training data that to it looks like sequences of words, and output words. Without knowing what it was dealing with was a board game, you would expect it would develop a statistical model of likely good moves. But surprisingly, when the researchers opened up the box and looked at what was actually happening inside the language model, they found it was creating a representation of the board, keeping track of whose pieces are in which positions.

In another study, diffusion models, the kind of models that create images, such as DALL-E 2, Midjourney, and Stable Diffusion, are opened up, and inside it is found that they form 3D depth models of images. In other words, even though diffusion models are trained on 2-dimensional images only, they form "mental models" that are 3-dimensional. Early on in the process of generating an image, they concieve of how the objects in the scene are related in 3 dimensions.

I previously told you all about an interview where Ilya Sutskever, leader of the research team that created GPT-4, who said he believes large language models have real understanding. He said people say these models just learn statistical regularities and have no model of the world, but he disagrees, and says the best way to predict what words will come next is to genuinely understand what is being talked about. So asking models to predict what word comes next is a far bigger deal than meets the eye. To predict, you need to understand the true underlying process that produced the data. Even though language models only see the world through the shadow of text as expressed by human beings on the internet, they are able to develop a shocking degree of understanding.

That was back in April. Now we are starting to see evidence, from looking inside, that language models do indeed form "mental models" of the world they are predicting. Not just language models but diffusion models as well. This may be a general feature of generative models and maybe we will find it more and more.

Beyond surface statistics - AI secretly builds visual models of the world - Wes Roth

#solidstatelife #ai #genai #llms #diffusionmodels #othello

waynerad@diasp.org

"A woman with flowers in her hair in a courtyard, in the style of ..." and then you can pick from 1,590 artists. Aditya Shankar was wondering how Stable Diffusion would draw what would otherwise be the exact same prompt except you can see how 1,500+ artists would have drawn it.

I put a prompt in stable diffusion to see how 1500+ artists would have drawn it

#solidstatelife #ai #generativemodels #diffusionmodels #stablediffusion

waynerad@diasp.org

Make-A-Video, a new AI system from Facebook, "lets people turn text prompts into brief, high-quality video clips."

"The system learns what the world looks like from paired text-image data and how the world moves from video footage with no associated text."

The way the system works is, they first started with a text-to-image system, which is in fact a diffusion network (like DALL-E 2, Midjourney, Imogen, and Stable Diffusion). They extended the system by adding layers -- in this case they added convolutional layers to the part of the system that used convolutional layers (for image processing) and added attention layers to the part of the system that does the text processing using attention layers. Actually the additional convolution layers are 1D convolution layers. The attention layers as well aren't full attention layers, but use an approximation system that requires less computing power. In fact the additional convolutional layers are called "pseudo-convolutional" layers and the additional attention layers are called "pseudo attention" layers. What these layers do is in addition to the spatial (space-based) information from the existing layers, they get a "time step" input as so have "temporal" (time-based) information as well. So these layers link together spatial and temporal information. And evidently, to do them as full convolutional and full attention layers would consume too much computing power.

With the original layers trained to do spatial text-to-image generation, the new layers are trained on videos to learn how things move in video. So it learns how ocean waves move, how elephants move, and so on. This training involves an additional frame rate conditioning parameter.

At this point we're almost done. The last step is to do upscaling, which they call superresolution. This increases the output to the full resolution. But they don't just do this spatially, they do it temporally as well. So just as when you increase the resolution of an image, a neural network has to imagine what pixels should go in the newly created pixels, this system has a neural network that images what pixels should go into newly created video frames, to maintain smoothness between frames.

Introducing Make-A-Video: An AI system that generates videos from text

#solidstatelife #ai #generativenetworks #diffusionmodels

waynerad@diasp.org

In addition to doing text-to-image, apparently Stable Diffusion can also do image-to-image. Here's a little collection of MSPaint sketches to concept art.

Reddit is going crazy turning MSPaint sketches into concept art using Img2img with #stablediffusion and it’s wild (1/n)

#solidstatelife #ai #computervision #generativeai #diffusionmodels

waynerad@diasp.org

The official Stable Diffusion launch announcement, with links to source code, the full model (via HuggingFace), two key research papers on the fundamentals of diffusion models, and the training dataset, which is 5.85 billion CLIP-filtered image-text pairs, 14x bigger than than the previous largest dataset.

"Stable Diffusion is a text-to-image model that will empower billions of people to create stunning art within seconds. It is a breakthrough in speed and quality meaning that it can run on consumer GPUs."

"Stable Diffusion runs on under 10 GB of VRAM on consumer GPUs, generating images at 512x512 pixels in a few seconds. This will allow both researchers and soon the public to run this under a range of conditions, democratizing image generation. We look forward to the open ecosystem that will emerge around this and further models to truly explore the boundaries of latent space."

Stable Diffusion launch announcement

#solidstatelife #ai #computervision #generativeai #diffusionmodels