Make-A-Video, a new AI system from Facebook, "lets people turn text prompts into brief, high-quality video clips."
"The system learns what the world looks like from paired text-image data and how the world moves from video footage with no associated text."
The way the system works is, they first started with a text-to-image system, which is in fact a diffusion network (like DALL-E 2, Midjourney, Imogen, and Stable Diffusion). They extended the system by adding layers -- in this case they added convolutional layers to the part of the system that used convolutional layers (for image processing) and added attention layers to the part of the system that does the text processing using attention layers. Actually the additional convolution layers are 1D convolution layers. The attention layers as well aren't full attention layers, but use an approximation system that requires less computing power. In fact the additional convolutional layers are called "pseudo-convolutional" layers and the additional attention layers are called "pseudo attention" layers. What these layers do is in addition to the spatial (space-based) information from the existing layers, they get a "time step" input as so have "temporal" (time-based) information as well. So these layers link together spatial and temporal information. And evidently, to do them as full convolutional and full attention layers would consume too much computing power.
With the original layers trained to do spatial text-to-image generation, the new layers are trained on videos to learn how things move in video. So it learns how ocean waves move, how elephants move, and so on. This training involves an additional frame rate conditioning parameter.
At this point we're almost done. The last step is to do upscaling, which they call superresolution. This increases the output to the full resolution. But they don't just do this spatially, they do it temporally as well. So just as when you increase the resolution of an image, a neural network has to imagine what pixels should go in the newly created pixels, this system has a neural network that images what pixels should go into newly created video frames, to maintain smoothness between frames.
Introducing Make-A-Video: An AI system that generates videos from text
There are no comments yet.