Photorealistic AI-generated talking humans. "VLOGGER" is a system for generating video to match audio of a person talking. So you can make video of any arbitrary person saying any arbitrary thing. You just supply the audio (which could itself be AI-generated) and a still image of a person (which also could itself be AI-generated).
Most of the sample videos wouldn't play for me, but the ones in the top section did and seem pretty impressive. You have to "unmute" them to hear the audio and see that the video matches the audio.
They say the system works using a 2-step approach where the first step is to take just the audio signal, and use a neural network to predict what facial expressions, gaze, gestures, pose, body language, etc, would be appropriately associated with that audio, and the second step is to combine the output of the first step with the image you provide to generate the video. Perhaps surprisingly (at least to me), both of these are done with diffusion networks. I would've expected the second step to be done with diffusion networks, but the first to be done with some sort of autoencoder network. But no, they say they used a diffusion network for that step, too.
So the first step is taking the audio signal and converting to spectrograms. In parallel with that the input image is input into a "reference pose" network that analyses it to determine what the person looks like and what pose the rest of the system has to deal with as a starting point.
These are fed into the "motion generation network". The output of this network is "residuals" that describe face and body positions. It generates one set of all these parameters for each frame that will be in the resulting video.
The result of the "motion generation network", along with the reference image and the pose of the person in the reference image is then passed to the next stage, which is the temporal diffusion network that generates the video. A "temporal diffusion" network is a diffusion network that generates images, but it has been modified so that it maintains consistency from frame to frame, hence the "temporal" word tacked on to the name. In this case, the temporal diffusion network has undergone the additional step of being trained to handle the 3D motion "residual" parameters. Unlike previous non-diffusion-based image generators that simply stretched images in accordance with motion parameters, this network incorporates the "warping" parameters into the training of the neural network itself, resulting in much more realistic renditions of human faces stretching and moving.
This neural network generates a fixed number of frames. They use a technique called "temporal outpainting" to extend the video to any number of frames. The "temporal outpainting" system re-inputs the previous frames, minus 1, and uses that to generate the next frame. In this manner they can generate a video of any length with any number of frames.
As a final step they incorporate an upscaler to increase the pixel resolution of the output.
VLOGGER: Multimodal diffusion for embodied avatar synthesis
#solidstatelife #ai #computervision #generativeai #diffusionmodels