Optical illusions created with diffusion models. Images that change appearance when flipped or rotated. Actually these researchers created a general-purpose system for making optical illusions for a variety of transformations. They've named their optical illusions "visual anagrams".
Now, I know I told you all I would write up an explanation of how diffusion models work, and I've not yet done that. There's a lot of advanced math that goes into them.
The key thing to understand, here, about diffusion models, is that they work by taking an image and adding Gaussian noise... in reverse. You start with random noise, and then you "de-noise" the image step by step. And you "de-noise" it in the direction of a text prompt.
The way this process works is, you feed in the image and the text prompt, and what the neural network computes is the "noise". Crucially, this "noise" computation isn't a single number, it's a pixel-by-pixel noise estimate -- essentially another image. "Noise" compared to what? Compared to the text prompt. Amazingly enough, using this "error" to "correct" the image and then iterating on the process guides it into an image that fits the text prompt.
The trick they've done here is, they first take the image and compute the "noise" on it the normal way. Then they take the image and put it through its transformation -- rotation, vertical flipping, or puzzle-piece-like rearrangement (rotation, reflection, and translation), then compute the "noise" on that image (using a different text prompt!) and then they do the reverse transformation on the "noise" image. They then combine the original "noise" and the reverse transformation "noise" by simple averaging.
This only works for certain transformations. Basically the two conditions the transformation has to satisfy are "linearity" and "statistical consistency". By linearity, they mean diffusion models fundamentally think in terms of "signal + noise" as a linear combination. If your transformation breaks this assumption, your transformation won't work. By "statistical consistency" they mean diffusion networks assume the "noise" is Gaussian, meaning it follows a Gaussian distribution. If your transformation breaks this assumption, it won't work.
These assumptions hold for the 3 transformations I've mentioned so far: rotation, reflection, and translation. It also works for one more: color inversion. Like a photographic negative. The color values have to be kept centered on 0, though. Their examples are only black-and-white.
Another thing they had to do was use a different diffusion model because Stable Diffusion actually has "latent space" values that refer to groups of pixels. They used an alternative called DeepFloyd IF, where the "latent space" values are per-pixel. I haven't figured out exactly what "latent space" values are learned by each of these models so I can't tell you why this distinction matters.
Another thing is that the system also incorporated "negative prompting" in its "noise" estimate, but they discovered you have to be very careful with "negative prompting". Negative prompts tell the system what it must leave out rather than include in the image. An example that illustrates the problem is, for example if you said "oil painting of a dog" and "oil painting of a cat". They both have "oil painting" so you're telling the system to both include and exclude "oil painting".
The website has lots of animated examples; check it out.
Visual anagrams: Generating multi-view optical illusions with diffusion models
#solidstatelife #ai #genai #diffusionmodels #opticalillusions