Dnext

December 22, 2023 2:37am

MagicAnimate animates humans based on a reference image. See the Mona Lisa jogging or doing yoga.

The way it works is, well, first of all it uses a diffusion network to generate the video. Systems for generating video using GANs (generative adversarial networks) have also been developed. Diffusion networks, however, have recently shown themselves to be better at taking a human-entered text prompt and turning that into an image. The problem, though, is that if you want to make a video, you go frame by frame, and since each frame is independent of the others, that inevitably leads to flickering.

The key insight here is that instead of doing the "diffusion" process frame-by-frame, you do it on the entire video all at once. This enables "temporal consistency" across frames. A couple more elements are necessary to get the whole system to work, though.

One is discarding the normal way diffusion networks use an internal encoding that is tied to a text prompt. In this system, since a reference image is provided instead, there is no text prompt. So the whole system is trained to use an internal encoding that is based on appearance. This enables the system to maintain the appearance of the original video for both the human being animated and the background.

The other key piece that gets the system to work is incorporating a prior system called ControlNet. ControlNet analyzes the pose provided and converts in into a motion signal, which is a dense set of body "keypoints". The first step of the process involves analysis of the control points. Joint diffusion of the control points and the reference image is the second stage after that.

If you're wondering how the system manages to hold the entire video in memory to do the diffusion process on the entire videos, the answer is that actually it doesn't. Because they needed to get the system to work on GPUs with limited memory, the researchers actually devised a "sliding window" system where it would generate overlapping segments of video. The frames are close enough that they can be combined with simple averaging and the end result looks okay.

Speaking of the researchers, this was a joint team between ByteDance and the National University of Singapore. ByteDance as in, the maker of TikTok. Application of this to TikTok is obvious.

#solidstatelife #ai #genai #diffusionmodels #videoai

https://showlab.github.io/magicanimate/

There are no comments yet.