Deepfakes detection with attention. Deepfakes have two kinds of defects, one within any given video frame and the other between them. Defects in frames are generated because the generation model is imperfect. Defects between frames when they are sequenced into videos happen because the generator doesn't know what global constraints to apply.
The generator models are so far always generative adversarial networks (GANs) or variational autoencoders (VAEs), but either one will have defects from its upscaling phase. Basically these models have a set of parameters that are used to generate the face and a system for going from a low-resolution image to a high-resolution image, filling in details as it goes. This is called "deconvolution", and it has imperfections, in particular uneven overlap, which means the large pixels are decomposed in such a way that they overlap, which prevents edge artifacts from being introduced, but detail is lost in the overlap. In addition, there are defects which you could call "semantic" defects, for example, it can get the specular reflection in the eyes wrong. Specular reflection in eyes refers to the reflection of the scene the person is looking at reflected in the eye. Details of teeth can make them look slightly rougher than real teeth. Some face semantics are correct in their details, but only look amiss when you zoom out. For example generated faces often have less symmetry than real faces. Human eyes are always separated by a certain distance and have the same color, but the eyes of the fake face sometimes have different distance and color.
Between frames, the defects tend to be things like the frequency of eye blinking, and inconsistency in the combination of head movements and facial expressions that are different from what real humans tend to make. If you're thinking all the system needs to do is faithfully copy these from the source video, yeah, that's right, but the generative models often fail to do that.
The deepfake detection introduced here is based on the vision transformer architecture. Transformers were originally invented for language translation, translating one sequential sequence of words in one language into another sequential sequence of words in another language. It turned out to be useful to have the translation system, when it is looking at generating the next output word, to have an "attention" mechanism that lets the model look at any word in the input stream, and look at the words out of order, paying "attention" to any part of the original sentence at any time. Vision transformers port this idea over to images and video.
The system works by dividing the input image into small patches that correspond to parts of the face. Each "patch" is turned into an "embedding" that's analogous to word embeddings in language. These patch "embeddings" have information on their position in the image encoded into them. The "embeddings" then go trough "feature extractors", which are plugged into a "transformation matrix", the parameters of which are also learned though a training process. A "global forgery template" is also learned and when combined with the "transformation matrix" results in an "attention map". The "attention" mechanism is used both to direct the attention of the system to different parts of the same image, to detect single-frame defects, and to make a "long range" attention map, that is used to detect defects across frames.
What finally comes out the other end of the system is a set of activations that represent the confidence each of the "patches" is "suspicious". Enough "suspicious" patches and the system classifies the video as a deepfake.
Detection of deepfake videos using long distance attention
#solidstatelife #ai #deepfakes