Megapixel portraits. I really like the way they take classical paintings and animate them. The video shows you a diagram of how the system works, but it's still a bit hard to follow so I'm going to try to describe it in a different way based on the paper.

Basically what they do is train a neural network on the "driver" video, where it takes a frame as input and produces the same frame as output. This may sound pointless but the neural network is challenged to build a 3D model of the person to do this.

The "driver" here refers to the video that the "source" is going to be changed to imitate. The "source" is the still image such as a classical painting that the neural network will be producing a video in the style of. Once the neural network is trained on the "driver", it is then trained on the "source" even though the "source" is a single frame.

The 3D model that is produced internally models two things, basically: head rotations and facial expressions. There are a few more details necessary to be part of the model but in the interest of being concise, I'm going to skip over those. Basically what it needs to do to make the painting, or whatever the "source" is, change in the manner of the "driver" video is to "3D warp" the source image. So they created a 3D warping generator. This 3D warping generator has to work using the head rotations and facial expressions data that come out of the earlier analysis stage.

That may sound pretty straightforward, but there are more tricks involved in getting this to work. First, they're actually incorporating a face recognition network in order to figure out gaze direction. Second, even though the neural networks to do the 3D warping and final 2D rendering are regular convolutional networks, a generative adversarial network (GAN) is used as part of the training process to get the resulting images to have high enough resolution. So the convolutional network is trained with one of the terms in its loss function actually coming from a whole nother neural network which is a GAN. The next trick is that there are complicated math formulas (described in the paper) that serve as motion descriptors, which are also part of the loss function used to train the convolutional neural network.

As if that's not enough, they put in another neural network, which they call the "student", with the special purpose of properly distilling the single-frame source picture. This neural network is actually an image-to-image neural network. It's not included in their diagram in the video and it's not clear to me how it fits in to the rest of the system.

All in all this is a system that, while it produces marvelous results, has a lot of moving parts that are not intuitive at all and if you were going to try to implement it yourself, you'd be spending a lot of time figuring out all those little details.

MegaPortraits: One-shot Megapixel Neural Head Avatars - Никита Дробышев

#solidstatelife #ai #computervision #generativenetworks

1
2