#generativenetworks

waynerad@diasp.org

Make-A-Video, a new AI system from Facebook, "lets people turn text prompts into brief, high-quality video clips."

"The system learns what the world looks like from paired text-image data and how the world moves from video footage with no associated text."

The way the system works is, they first started with a text-to-image system, which is in fact a diffusion network (like DALL-E 2, Midjourney, Imogen, and Stable Diffusion). They extended the system by adding layers -- in this case they added convolutional layers to the part of the system that used convolutional layers (for image processing) and added attention layers to the part of the system that does the text processing using attention layers. Actually the additional convolution layers are 1D convolution layers. The attention layers as well aren't full attention layers, but use an approximation system that requires less computing power. In fact the additional convolutional layers are called "pseudo-convolutional" layers and the additional attention layers are called "pseudo attention" layers. What these layers do is in addition to the spatial (space-based) information from the existing layers, they get a "time step" input as so have "temporal" (time-based) information as well. So these layers link together spatial and temporal information. And evidently, to do them as full convolutional and full attention layers would consume too much computing power.

With the original layers trained to do spatial text-to-image generation, the new layers are trained on videos to learn how things move in video. So it learns how ocean waves move, how elephants move, and so on. This training involves an additional frame rate conditioning parameter.

At this point we're almost done. The last step is to do upscaling, which they call superresolution. This increases the output to the full resolution. But they don't just do this spatially, they do it temporally as well. So just as when you increase the resolution of an image, a neural network has to imagine what pixels should go in the newly created pixels, this system has a neural network that images what pixels should go into newly created video frames, to maintain smoothness between frames.

Introducing Make-A-Video: An AI system that generates videos from text

#solidstatelife #ai #generativenetworks #diffusionmodels

waynerad@pluspora.com

Megapixel portraits. I really like the way they take classical paintings and animate them. The video shows you a diagram of how the system works, but it's still a bit hard to follow so I'm going to try to describe it in a different way based on the paper.

Basically what they do is train a neural network on the "driver" video, where it takes a frame as input and produces the same frame as output. This may sound pointless but the neural network is challenged to build a 3D model of the person to do this.

The "driver" here refers to the video that the "source" is going to be changed to imitate. The "source" is the still image such as a classical painting that the neural network will be producing a video in the style of. Once the neural network is trained on the "driver", it is then trained on the "source" even though the "source" is a single frame.

The 3D model that is produced internally models two things, basically: head rotations and facial expressions. There are a few more details necessary to be part of the model but in the interest of being concise, I'm going to skip over those. Basically what it needs to do to make the painting, or whatever the "source" is, change in the manner of the "driver" video is to "3D warp" the source image. So they created a 3D warping generator. This 3D warping generator has to work using the head rotations and facial expressions data that come out of the earlier analysis stage.

That may sound pretty straightforward, but there are more tricks involved in getting this to work. First, they're actually incorporating a face recognition network in order to figure out gaze direction. Second, even though the neural networks to do the 3D warping and final 2D rendering are regular convolutional networks, a generative adversarial network (GAN) is used as part of the training process to get the resulting images to have high enough resolution. So the convolutional network is trained with one of the terms in its loss function actually coming from a whole nother neural network which is a GAN. The next trick is that there are complicated math formulas (described in the paper) that serve as motion descriptors, which are also part of the loss function used to train the convolutional neural network.

As if that's not enough, they put in another neural network, which they call the "student", with the special purpose of properly distilling the single-frame source picture. This neural network is actually an image-to-image neural network. It's not included in their diagram in the video and it's not clear to me how it fits in to the rest of the system.

All in all this is a system that, while it produces marvelous results, has a lot of moving parts that are not intuitive at all and if you were going to try to implement it yourself, you'd be spending a lot of time figuring out all those little details.

MegaPortraits: One-shot Megapixel Neural Head Avatars - Никита Дробышев

#solidstatelife #ai #computervision #generativenetworks

waynerad@diasp.org

Megapixel portraits. I really like the way they take classical paintings and animate them. The video shows you a diagram of how the system works, but it's still a bit hard to follow so I'm going to try to describe it in a different way based on the paper.

Basically what they do is train a neural network on the "driver" video, where it takes a frame as input and produces the same frame as output. This may sound pointless but the neural network is challenged to build a 3D model of the person to do this.

The "driver" here refers to the video that the "source" is going to be changed to imitate. The "source" is the still image such as a classical painting that the neural network will be producing a video in the style of. Once the neural network is trained on the "driver", it is then trained on the "source" even though the "source" is a single frame.

The 3D model that is produced internally models two things, basically: head rotations and facial expressions. There are a few more details necessary to be part of the model but in the interest of being concise, I'm going to skip over those. Basically what it needs to do to make the painting, or whatever the "source" is, change in the manner of the "driver" video is to "3D warp" the source image. So they created a 3D warping generator. This 3D warping generator has to work using the head rotations and facial expressions data that come out of the earlier analysis stage.

That may sound pretty straightforward, but there are more tricks involved in getting this to work. First, they're actually incorporating a face recognition network in order to figure out gaze direction. Second, even though the neural networks to do the 3D warping and final 2D rendering are regular convolutional networks, a generative adversarial network (GAN) is used as part of the training process to get the resulting images to have high enough resolution. So the convolutional network is trained with one of the terms in its loss function actually coming from a whole nother neural network which is a GAN. The next trick is that there are complicated math formulas (described in the paper) that serve as motion descriptors, which are also part of the loss function used to train the convolutional neural network.

As if that's not enough, they put in another neural network, which they call the "student", with the special purpose of properly distilling the single-frame source picture. This neural network is actually an image-to-image neural network. It's not included in their diagram in the video and it's not clear to me how it fits in to the rest of the system.

All in all this is a system that, while it produces marvelous results, has a lot of moving parts that are not intuitive at all and if you were going to try to implement it yourself, you'd be spending a lot of time figuring out all those little details.

MegaPortraits: One-shot Megapixel Neural Head Avatars - Никита Дробышев

#solidstatelife #ai #computervision #generativenetworks

waynerad@diasp.org

"This food does not exist." "Recent methods like diffusion and auto-regressive models are all the rage these days: DALL-E 2, Craiyon (formerly DALL-E mini), ruDALL-E... Why not go in this direction? TL;DR: cos we're poor."

"StyleGAN models shine in terms of photorealism, as can be some by some of our food results. For another example, the website ThisPersonDoesNotExist.com produces very believable face images. While GANs are still better at this, diffusion models are catching up and this may change soon."

"Diffusion models offer better control and flexibility, thanks in large part to text guidance. This comes at the cost of larger models and slower generation times."

This food does not exist

#solidstatelife #ai #computervision #generativenetworks #gans

waynerad@diasp.org

AI music video. "In this video, I utilized artificial intelligence to generate an animated music video for the song Canvas by Resonate". Made with a software program called Disco Diffusion V5.2 Turbo, which apparently uses a technique called VQGAN + CLIP, which stands for Vector Quantized Generative Adversarial Network and Contrastive Language-Image Pre-training.

"While this AI is impressive, it still required additional input beyond just the song lyrics to achieve the music video I was looking for. For example, I added keyframes for camera motion throughout the generated world. These keyframes were manually synchronized to the beat by me. I also specified changes to the art style at different moments of the song. Since many of the lyrics are quite non-specific, even a human illustrator would have a hard time making visual representations. To make the lyrics more digestible by the AI, I sometimes modified the phrase to be more coherent, such as specifying a setting or atmosphere."

I asked AI to make a Music Video... the results are trippy

#musicfortoday #solidstatelife #ai #generativenetworks