#diffusionmodels

waynerad@diasp.org

Diffusion models are evolutionary algorithms, claims a team of researchers from Tufts, Harvard, and TU Wien.

"At least two processes in the biosphere have been recognized as capable of generalizing and driving novelty: evolution, a slow variational process adapting organisms across generations to their environment through natural selection; and learning, a faster transformational process allowing individuals to acquire knowledge and generalize from subjective experience during their lifetime. These processes are intensively studied in distinct domains within artificial intelligence. Relatively recent work has started drawing parallels between the seemingly unrelated processes of evolution and learning. We here argue that in particular diffusion models, where generative models trained to sample data points through incremental stochastic denoising, can be understood through evolutionary processes, inherently performing natural selection, mutation, and reproductive isolation."

"Both evolutionary processes and diffusion models rely on iterative refinements that combine directed updates with undirected perturbations: in evolution, random genetic mutations introduce diversity while natural selection guides populations toward greater fitness, and in diffusion models, random noise is progressively transformed into meaningful data through learned denoising steps that steer samples toward the target distribution. This parallel raises fundamental questions: Are the mechanisms underlying evolution and diffusion models fundamentally connected? Is this similarity merely an analogy, or does it reflect a deeper mathematical duality between biological evolution and generative modeling?"

"To answer these questions, we first examine evolution from the perspective of generative models. By considering populations of species in the biosphere, the variational evolution process can also be viewed as a transformation of distributions: the distributions of genotypes and phenotypes. Over evolutionary time scales, mutation and selection collectively alter the shape of these distributions. Similarly, many biologically inspired evolutionary algorithms can be understood in the same way: they optimize an objective function by maintaining and iteratively changing a large population's distribution. In fact, this concept is central to most generative models: the transformation of distributions. Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and diffusion models are all trained to transform simple distributions, typically standard Gaussian distributions, into complex distributions, where the samples represent meaningful images, videos, or audio, etc."

"On the other hand, diffusion models can also be viewed from an evolutionary perspective. As a generative model, diffusion models transform Gaussian distributions in an iterative manner into complex, structured data-points that resemble the training data distribution. During the training phase, the data points are corrupted by adding noise, and the model is trained to predict this added noise to reverse the process. In the sampling phase, starting with Gaussiandistributed data points, the model iteratively denoises to incrementally refine the data point samples. By considering noise-free samples as the desired outcome, such a directed denoising can be interpreted as directed selection, with each step introducing slight noise, akin to mutations. Together, this resembles an evolutionary process, where evolution is formulated as a combination of deterministic dynamics and stochastic mutations within the framework of non-equilibrium thermodynamics. This aligns with recent ideas that interpret the genome as a latent space parameterization of a multi-scale generative morphogenetic process, rather than a direct blueprint of an organism. If one were to revert the time direction of an evolutionary process, the evolved population of potentially highly correlated high-fitness solutions will dissolve gradually, i.e., step by step and thus akin to the forward process in diffusion models, into the respectively chosen initial distribution, typically Gaussian noise."

The researchers proceed to present a mathematical representation of diffusion models. Then, "By substituting Equations 8 and 10 into Equation 5, we derive the Diffusion Evolution algorithm: an evolutionary optimization procedure based on iterative error correction akin to diffusion models but without relying on neural networks at all." They present pseudocode for an algorithm to demonstrate this.

Equations 1-3 are about the added noise, equations 4-5 are about reversing the process and using a neural network to estimate and remove the noise, equation 6 represents the process using Bayes' Theorem and introduces a representation using functions (f() and g()), and equations 7-9 are some plugging and chugging changing the representation of those equations to get the form where you can substitute back inte equation 5 as mentioned above.

"When inversely denoising, i.e., evolving from time T to 0, while increasing alpha-sub-t, the Gaussian term will initially have a high variance, allowing global exploration at first. As the evolution progresses, the variance decreases giving lower weight to distant populations, leads to local optimization (exploitation). This locality avoids global competition and thus allows the algorithm to maintain multiple solutions and balance exploration and exploitation. Hence, the denoising process of diffusion models can be understood in an evolutionary manner: x-hat-0 represents an estimated high fitness parameter target. In contrast, x-sub-t can be considered as diffused from high-fitness points. The first two parts in the Equation 5, ..., guide the individuals towards high fitness targets in small steps. The last part of Equation 5, sigma-sub-t-w, is an integral part of diffusion models, perturbing the parameters in our approach similarly to random mutations."

Obviously, consult the paper if you want the mathematical details.

"We conduct two sets of experiments to study Diffusion Evolution in terms of diversity and solving complex reinforcement learning tasks. Moreover, we utilize techniques from the diffusion models literature to improve Diffusion Evolution. In the first experiment, we adopt an accelerated sampling method to significantly reduce the number of iterations. In the second experiment, we propose Latent Space Diffusion Evolution, inspired by latent space diffusion models, allowing us to deploy our approach to complex problems with high-dimensional parameter spaces through exploring a lower-dimensional latent space."

"Our method consistently finds more diverse solutions without sacrificing fitness performance. While CMA-ES shows higher entropy on the Ackley and Rastrigin functions, it finds significantly lower fitness solutions compared to Diffusion Evolution, suggesting it is distracted by multiple solutions rather than finding diverse ones.

"We apply the Diffusion Evolution method to reinforcement learning tasks to train neural networks for controlling the cart-pole system. This system has a cart with a hinged pole, and the objective is to keep the pole vertical as long as possible by moving the cart sideways while not exceeding a certain range."

"Deploying our original Diffusion Evolution method to this problem results in poor performance and lack of diversity. To address this issue, we propose Latent Space Diffusion Evolution: inspired by the latent space diffusion model, we map individual parameters into a lower-dimensional latent space in which we perform the Diffusion Evolution Algorithm. However, this approach requires a decoder and a new fitness function f-prime for z, which can be challenging to obtain."

"We also found that this latent evolution can still operate in a much larger dimensional parameter space, utilizing a three-layer neural network with 17,410 parameters, while still achieving strong performance. Combined with accelerated sampling method, we can solve the cart pole task in only 10 generations, with 512 population size, one fitness evaluation per individual."

"This parallel we draw here between evolution and diffusion models gives rise to several challenges and open questions. While diffusion models, by design, have a finite number of sampling steps, evolution is inherently open-ended. How can Diffusion Evolution be adapted to support open-ended evolution? Could other diffusion model implementations yield different evolutionary methods with diverse and unique features? Can advancements in diffusion models help introduce inductive biases into evolutionary algorithms? How do latent diffusion models correlate with neutral genes? Additionally, can insights from the field of evolution enhance diffusion models?"

Diffusion models are evolutionary algorithms

#solidstatelife #evolution #ai #genia #diffusionmodels

waynerad@diasp.org

"How Kpopalypse determines the use of AI-generated imagery in k-pop music videos."

"Hyuna sorry I mean IU's 'Holssi' has a video which is mainly not AI, but the floating people certainly are AI."

"The dog/wolf/whatever the fuck that is at the start of Kiss Of Life's 'Get Loud', that's AI-generated for sure -- no, not CGI."

"There's lots of floaty AI-generated crap in Odd Youth's 'Best Friendz' video, like random bubbles, confetti, and... people having accidents, how aegyo, much heart shape."

"There's also a technique in AI image generation that I like to call 'detail spam'. Watch the sequence of images in Achii's 'Fly' video from 2:30 to 2:36. This is all AI-generation at work."

"Same again with Jay 'where's my soju' Park and 'Gimme A Minute (to type in this prompt for exploding cars)'."

"XG use AI in their imagery all the time. For an example, check out the 'Princess Mononoke'-inspired foot imagery at 1:20 in the video [to "Howling"]."

"Speaing of all things environment, I'll leave you with environmental expert Chuu's 'Strawberry Rush' which is almost certainly using a fair bit of AI-generated imagery for all the more boilerplate-looking background cartoon shit."

How Kpopalypse determines the use of AI-generated imagery in k-pop music videos

#solidstatelife #computervision #diffusionmodels #aidetection

waynerad@diasp.org

Diffusion Illusions: Flip illusions, rotation overlays, twisting squares, hidden overlays, Parker puzzles...

If you've never heard of "Parker puzzles", Matt Parker, the math YouTuber, asked this research team to make him a jigsaw puzzle with two solutions: one is a teacup, and the other is a doughnut.

The system they made starts with diffusion models, which are the models you use when you type a text prompt in and it generates the image for you. Napoleon as a cat or unicorn astronauts or whatever.

What if you could generate two images at once that are mathematically related somehow?

That's what the Diffusion Illusions system does. Actually it can even do more than two images.

First I must admit, the system uses an image parameterization system called Fourier Features Networks, and I clicked through to the research paper for Fourier features Networks, but I couldn't understand it. The "Fourier" part suggests sines and cosines, and yes, there's sine and cosine math in there, but there's also "bra-ket" notion, like you normally see in quantum physics, with partial differential equations in the bra-ket notation, and such. So, I don't understand how Fourier Features works.

There's a video of a short talk from SIGGRAPH, and in it (at about 4:30 in), they claim that diffusion models, all by themselves, have "adversarial artifacts" that Fourier Features fixes. I have no idea why diffusion models on their own would have any kind of "adversarial artifacts" problems. So obviously if I have no idea what might cause the problems, I have no idea why Fourier Features might fix them.

Ok, with that out of the way, the way the system works is there are the output images that the system generates, which they call "prime" images. The fact that they give them a name implies there's an additional type of image in the system, and there is. They call these other images the "dream target" images. Central to the whole thing is the "arrangement process" formulation. The only requirement of the "arrangement process" function is that it is differentiable, so deep learning methods can be applied to it. It is this "arrangement process" that decides whether you're generating flip illusions, rotation overlay illusions, hidden overlay illusions, twisting squares illusions, Parker puzzles, or something else -- you could define your own.

After this, it runs two training processes concurrently. The first is the standard way diffusion illusions are trained. This calculates an "error", also called a loss, from the target text conditioning, which is called the score distillation loss.

Apparently, however, circumstances exist where it is not trivial for prime images to follow the gradients from the Score Distillation Loss to give you images that create the illusion you are asking for. To get the system unstuck, they added the "dream target loss" training system. The "dream target" images are images made from your text prompts individually. So, let's say you want to make a flip illusion that is a penguin viewed one way and a giraffe when flipped upside down. In this instance, the system will take the "penguin" prompt and create an image from it, and take the "giraffe" prompt and create a separate image for it, and flip it upside down. These become the "dream target" images.

The system then computes a loss on the prime images and "dream target" images, as well as the original score distillation loss. If the system has any trouble converging on the "dream target" images, new "dream target" images are generated from the same original text prompts.

In this way, the system creates visual illusions. You can even print the images and turn them into real-life puzzles. For some illusions, you print on transparent plastic and overlap the images using an overhead projector.

Diffusion Illusions

#solidstatelife #ai #computervision #genai #diffusionmodels

waynerad@diasp.org

This looks like the video game Doom, but it is actually the output of a diffusion model.

Not only that, but the idea here isn't just to generate video that looks indistinguishable from Doom gameplay, but to create a "game engine" that actually lets you play the game. In fact this diffusion model "game engine" is called "GameNGen", which you pronounce "game engine".

To do this, they actually made two neural networks. The first is a reinforcement learning agent that plays the actual game Doom. As it does so, its output gets ferried over to the second neural network as "training data". In this manner, the first neural network creates unlimited training data for the second neural network.

The second neural network is the actual diffusion model. They started with Stable Diffusion 1.4, a diffusion model "conditioned on" text, which is what enables it to generate images when you input text. They ripped out the "text" stuff, and replaced it with conditioning on "actions", which are the buttons and mouse movements you make to play the game, and previous frames.

Inside the diffusion model, it creates "latent state" that represents the state of the game -- sort of. That's the idea, but it doesn't actually do a good job of it. It does a good job of remembering state that is actually represented on the screen (health, ammo, available weapons, etc), because it's fed the previous 3 frames of video every time step to generate the next frame of video, but not so good at remembering anything that goes off the screen. Oh, probably should mention, this diffusion model runs fast enough to generate images at "real time" video frame rates.

Because it doesn't use the actual Doom game engine state code -- or otherwise represent the game state with conventional code -- but represents state inside the neural network, but does so imperfectly for stuff that goes off the screen, when humans play this game, it seems like real Doom for short time periods, but when played over any extended length of time, humans can tell it's not real Doom.

GameNGen - Michael Kan

#solidstatelife #ai #genai #computervision #diffusionmodels #videogames #doom

waynerad@diasp.org

NoLabs is "an open source biolab that lets you run experiments with the latest state-of-the-art models and workflow engine for bio research."

"The goal of the project is to accelerate bio research by making inference models easy to use for everyone. We are currently supporting protein workflow components (predicting useful protein properties such as solubility, localisation, gene ontology, folding, etc.), drug discovery components (construct ligands and test binding to target proteins) and small molecules design components (design small molecules given a protein target and check drug-likeness and binding affinity)."

I haven't tried this but figured I'd pass it along to all of you, because if you work in biology it looks useful.

At the center of the system is Workflow Engine, which is a visual language where you connect dataflows together visually.

Next is BioBuddy, "a drug discovery copilot that supports: Downloading data from ChemBL, downloading data from RcsbPDB, questions about drug discovery process, targets, chemical components etc, and writing review reports based on published papers."

There's an 11 additional components that come in the form of docker containers that you can plug in: RFdiffusion for protein design, ESMFold for evolutionary scale modeling, ESMAtlas for "metagenomic" structures, Go Model 150M for protein function prediction, ESM Protein Localization model for protein localisation prediction, p2rank for protein binding site prediction, Solubility Model for protein solubility prediction, DiffDock for protein-ligand structure prediction, RoseTTAFold for predicting protein structures based on amino acid sequences, REINVENT4 for doing reinforcement learning on a protein receptor, SC GPT for cell type classification based on genes, and BLAST API for searching using various BLAST (Basic Local Alignment Search Tool) databases.

BasedLabs/NoLabs: Open source biolab

#solidstatelife #ai #genai #llms #diffusionmodels #reinforcementlearning #biology #dna #proteins

waynerad@diasp.org

Hunyuan-DiT is an image generator that generates art with "Chinese elements" using Chinese prompts. It's an open-source model created by Chinese giant TenCent. It's a diffusion model, and diffusion models are trained from text with "contrastive" learning. Hunyuan-DiT was started using an English dataset, and then was "fine-tuned" from there with a Chinese image and Chinese text dataset. Because of this, even though it is optimized to generating Chinese images from Chinese text, it is still capable of generating images from English text. It knows Chinese places, Chinese painting styles, Chinese food, Chinese dragons, traditional Chinese attire, and so on. It looks like if you ask it to generate images of people, it will generate images of Chinese people unless you ask otherwise.

Hunyuan-DiT: A powerful multi-resolution diffusion transformer with fine-grained Chinese understanding

#solidstatelife #ai #genai #computervision #diffusionmodels

waynerad@diasp.org

ToonCrafter: Generative Cartoon Interpolation.

Check out the numerous examples. This looks like something that could really help human animators make cartoons faster without losing their own hand-drawn animation style.

The way the system works is you input two frames, and ask the system to interpolate all the frames in between. You can optionally further augment the input with a sketch.

The way the system works is by using a diffusion model for video generation called DynamiCrafter. The DynamiCrafter has an internal "latent representation" that encodes something of the meaning of the frames that it uses to generate the video frames.

This system, ToonCrafter, uses the first and last frames to work backward to the "latent representations", then interpolates the "latent representations" to get the intermediate frames.

Because DynamiCrafter was trained on live-action video, and there's a huge gap in visual style between live-action and cartoons, such as exaggerated expressions and simplified textures, they had to take pains to "fine tune" the system with a lot of additional training on a high-quality cartoon dataset they constructed themselves.

It addition to the DynamiCrafter video generator, the also added a "detail-injecting" 3D decoder. This is an additional complex part of the system, with multiple 3D residual network layers and upsampling layers.

ToonCrafter: Generative Cartoon Interpolation

#solidstatelife #ai #genai #computervision #diffusionmodels #animation

waynerad@diasp.org

Photorealistic AI-generated talking humans. "VLOGGER" is a system for generating video to match audio of a person talking. So you can make video of any arbitrary person saying any arbitrary thing. You just supply the audio (which could itself be AI-generated) and a still image of a person (which also could itself be AI-generated).

Most of the sample videos wouldn't play for me, but the ones in the top section did and seem pretty impressive. You have to "unmute" them to hear the audio and see that the video matches the audio.

They say the system works using a 2-step approach where the first step is to take just the audio signal, and use a neural network to predict what facial expressions, gaze, gestures, pose, body language, etc, would be appropriately associated with that audio, and the second step is to combine the output of the first step with the image you provide to generate the video. Perhaps surprisingly (at least to me), both of these are done with diffusion networks. I would've expected the second step to be done with diffusion networks, but the first to be done with some sort of autoencoder network. But no, they say they used a diffusion network for that step, too.

So the first step is taking the audio signal and converting to spectrograms. In parallel with that the input image is input into a "reference pose" network that analyses it to determine what the person looks like and what pose the rest of the system has to deal with as a starting point.

These are fed into the "motion generation network". The output of this network is "residuals" that describe face and body positions. It generates one set of all these parameters for each frame that will be in the resulting video.

The result of the "motion generation network", along with the reference image and the pose of the person in the reference image is then passed to the next stage, which is the temporal diffusion network that generates the video. A "temporal diffusion" network is a diffusion network that generates images, but it has been modified so that it maintains consistency from frame to frame, hence the "temporal" word tacked on to the name. In this case, the temporal diffusion network has undergone the additional step of being trained to handle the 3D motion "residual" parameters. Unlike previous non-diffusion-based image generators that simply stretched images in accordance with motion parameters, this network incorporates the "warping" parameters into the training of the neural network itself, resulting in much more realistic renditions of human faces stretching and moving.

This neural network generates a fixed number of frames. They use a technique called "temporal outpainting" to extend the video to any number of frames. The "temporal outpainting" system re-inputs the previous frames, minus 1, and uses that to generate the next frame. In this manner they can generate a video of any length with any number of frames.

As a final step they incorporate an upscaler to increase the pixel resolution of the output.

VLOGGER: Multimodal diffusion for embodied avatar synthesis

#solidstatelife #ai #computervision #generativeai #diffusionmodels

waynerad@diasp.org

"Sora AI: When progress is a bad thing."

This guy did experiments where he asked people to pick which art was AI generated and which art was human made. They couldn't tell the difference. Almost nobody could tell the difference.

To be sure, and "just to mess with people", he would tell people AI-generated art was made by humans and human art was made by AI and ask people to tell him how they could tell. People would proceed to tell him all the reasons why an AI-generated art piece was an amazing masterpiece clearly crafted by human hands -- with emotions and feelings. And when shown art made by a human and told it was AI-generated, people would write out a paragraph describing to me all the reasons how they could clearly tell why this was generated by AI.

That's pretty interesting but actually not the point of this video. The point of the video is that AI art generators don't give people the same level of control over art they make themselves, but it clearly has the understanding of, for example, what is a road and what is a car, and a basic understanding of physics and cause and effect of things.

He thinks we're very close to being able to take a storyboard and "shove it into the AI and it just comes up with the perfect 3D model based on the sketch, comes up with the skeletal mesh, comes up with the animations it -- infers details of the house based on your terrible drawings -- it manages the camera angles, creates the light sources, gives you access to all the key framing data and positions of each object within the scene, and with just a few tweaks you'd have a finished product. The ad would be done in like an hour or two, something that."

He's talking about the "Duck Tea" example in the video -- he made up a product called "Duck Tea" that doesn't exist and pondered what would be involved in making an ad for it.

"Would have taken weeks of planning and work, something that would have taken a full team a long time to finish, would take one guy one afternoon."

The solution: Vote for Michelle Obama because she will introduce Universal Basic Income?

Sora AI: When progress is a bad thing - KnowledgeHusk

#solidstatelife #ai #genai #diffusionmodels #computervision #aiethics

waynerad@diasp.org

Reaction video to OpenAI Sora, OpenAI's system for generating video from text.

I encountered the reaction video first, in fact I discovered Sora exists from seeing the reaction video, but see below for the official announcement from OpenAI.

It's actually kind of interesting and amusing comparing the guesses in the reaction videos about how the system works from the way it actually works. People are guessing based on their knowledge of traditional computer graphics and 3D modeling. However...

The way Sora works is quite fascinating. We don't know the knitty-gritty details but OpenAI has described the system at a high level.

Basically it combines ideas from their image generation and large language model systems.

Their image generation systems, DALL-E 2 and DALL-E 3, are diffusion models. Their large language models, GPT-2, GPT-3, GPT-4, GPT-4-Vision, etc, are transformer models. (In fact "GPT" stands for "generative pretrained transformer").

I haven't seen diffusion and transformer models combined before.

Diffusion models work by having a set of parameters in what they call "latent space" that describe the "meaning" of the image. The word "latent" is another way of saying "hidden". The "latent space" parameters are "hidden" inside the model but they are created in such a way that the images and text descriptions are correlated, which is what makes it possible to type in a text prompt and get an image out. I've elsewhere given high-level hand-wavey descriptions of how the latent space parameters are turned into images through the diffusion process, and how the text and images are correlated (a training method called CLIP), so I won't repeat that here.

Large language models, on the other hand, work by turning words and word pieces into "tokens". The "tokens" are vectors constructed in such a way that the numerical values in the vectors are related to the underlying meaning of the words.

To make a model that combines both of these ideas, they figured out a way of doing something analogous to "tokens" but for video. They call their video "tokens" "patches". So Sora works with visual "patches".

One way to think of "patches" is as video compression both spatially and temporally. Unlike a video compression algorithm such as mpeg that does this using pre-determined mathematical formulas (discrete Fourier transforms and such), in this system the "compression" process is learned and is all made of neural networks.

So with a large language model, you type in text and it outputs tokens which represent text, which are decoded to text for you. With Sora, you type in text and it outputs tokens, except here the tokens represent visual "patches", and the decoder turns the visual "patches" into pixels for you to view.

Because the "compression" works both ways, in addition to "decoding" patches to get pixels, you can also input pixels and "encode" them into patches. This enables Sora to input video and perform a wide range of video editing tasks. It can create perfectly looping video, it can animate static images (why no Mona Lisa examples, though?), it can extend videos, either forward or backward in time. Sora can gradually interpolate between two input videos, creating seamless transitions between videos with entirely different subjects and scene compositions. I found these to be the most freakishly fascinating examples on their page of sample videos.

They list the following "emerging simulation capabilities":

"3D consistency." "Sora can generate videos with dynamic camera motion. As the camera shifts and rotates, people and scene elements move consistently through three-dimensional space."

This is where they have the scene everyone is reacting to in the reaction videos, where the couple is walking down the street in Japan with the cherry blossoms.

By the way, I was wondering what kind of name is "Sora" so I looked it up on behindthename.com. It says there are two Japanese kanji characters both pronounced "sora" and both of which mean "sky".

"Long-range coherence and object permanence." "For example, our model can persist people, animals and objects even when they are occluded or leave the frame. Likewise, it can generate multiple shots of the same character in a single sample, maintaining their appearance throughout the video."

"Interacting with the world." "Sora can sometimes simulate actions that affect the state of the world in simple ways. For example, a painter can leave new strokes along a canvas that persist over time, or a man can eat a burger and leave bite marks."

"Simulating digital worlds." "Sora can simultaneously control the player in Minecraft with a basic policy while also rendering the world and its dynamics in high fidelity."

However they say, "Sora currently exhibits numerous limitations as a simulator." "For example, it does not accurately model the physics of many basic interactions, like glass shattering."

This is incredible - ThePrimeTime

#solidstatelife #ai #genai #diffusionmodels #gpt #llms #computervision #videogeneration #openai

waynerad@diasp.org

MagicAnimate animates humans based on a reference image. See the Mona Lisa jogging or doing yoga.

The way it works is, well, first of all it uses a diffusion network to generate the video. Systems for generating video using GANs (generative adversarial networks) have also been developed. Diffusion networks, however, have recently shown themselves to be better at taking a human-entered text prompt and turning that into an image. The problem, though, is that if you want to make a video, you go frame by frame, and since each frame is independent of the others, that inevitably leads to flickering.

The key insight here is that instead of doing the "diffusion" process frame-by-frame, you do it on the entire video all at once. This enables "temporal consistency" across frames. A couple more elements are necessary to get the whole system to work, though.

One is discarding the normal way diffusion networks use an internal encoding that is tied to a text prompt. In this system, since a reference image is provided instead, there is no text prompt. So the whole system is trained to use an internal encoding that is based on appearance. This enables the system to maintain the appearance of the original video for both the human being animated and the background.

The other key piece that gets the system to work is incorporating a prior system called ControlNet. ControlNet analyzes the pose provided and converts in into a motion signal, which is a dense set of body "keypoints". The first step of the process involves analysis of the control points. Joint diffusion of the control points and the reference image is the second stage after that.

If you're wondering how the system manages to hold the entire video in memory to do the diffusion process on the entire videos, the answer is that actually it doesn't. Because they needed to get the system to work on GPUs with limited memory, the researchers actually devised a "sliding window" system where it would generate overlapping segments of video. The frames are close enough that they can be combined with simple averaging and the end result looks okay.

Speaking of the researchers, this was a joint team between ByteDance and the National University of Singapore. ByteDance as in, the maker of TikTok. Application of this to TikTok is obvious.

#solidstatelife #ai #genai #diffusionmodels #videoai

https://showlab.github.io/magicanimate/

waynerad@diasp.org

Visual Electric: "A breakthrough interface for generative AI".

"Visual Electric is the first image generator designed for creatives -- a canvas that facilitates the flow of ideas, so you can truly spread out and see where the tool takes you. It's designed to help give form to the vision in your mind's eye... or lead you to something even better."

"We believe that in order to be truly useful, AI needs to augment our existing creative process with all its winding paths, switchbacks and u-turns. This requires tools that embrace ambiguity over certainty. Often our best new ideas appear in the margins."

Commercial product with a free tier.

The interface has such features as autosuggest for your text prompts, "remix" tools that let you change colors and styles while keeping everything else in the image the same, an assortment of "hand-crafted" styles, the ability to do "inpainting" while keeping the rest of the image the same, the ability to create variations that change the "temperature" of the generative network, the ability to upscale the image and make other touch-ups, and they claim to have trained the whole thing on their own "library of stunning images".

Visual Electric

#solidstatelife #ai #genai #diffusionmodels

waynerad@diasp.org

Stable Video from Stability AI, the same company that made Stable Diffusion, has been released. Károly Zsolnai-Fehér of Two Minute Papers does a quick run-down, comparing it with existing systems like Runway, Emu Video, and Imagen Video. Stable Video was trained on 600 million videos.

Imperfections: The videos have to be short. Sometimes instead of real animation, you get camera panning. If you want text in your video, it will have trouble. It requires a lot of GPU memory to run. It can't do iterative edits, which Emu Video can do.

On the plus side, Stable Video is completely open source.

Stable Video AI watched 600,000,000 videos - Two Minute Papers

#solidstatelife #ai #genai #diffusionmodels

waynerad@diasp.org

Optical illusions created with diffusion models. Images that change appearance when flipped or rotated. Actually these researchers created a general-purpose system for making optical illusions for a variety of transformations. They've named their optical illusions "visual anagrams".

Now, I know I told you all I would write up an explanation of how diffusion models work, and I've not yet done that. There's a lot of advanced math that goes into them.

The key thing to understand, here, about diffusion models, is that they work by taking an image and adding Gaussian noise... in reverse. You start with random noise, and then you "de-noise" the image step by step. And you "de-noise" it in the direction of a text prompt.

The way this process works is, you feed in the image and the text prompt, and what the neural network computes is the "noise". Crucially, this "noise" computation isn't a single number, it's a pixel-by-pixel noise estimate -- essentially another image. "Noise" compared to what? Compared to the text prompt. Amazingly enough, using this "error" to "correct" the image and then iterating on the process guides it into an image that fits the text prompt.

The trick they've done here is, they first take the image and compute the "noise" on it the normal way. Then they take the image and put it through its transformation -- rotation, vertical flipping, or puzzle-piece-like rearrangement (rotation, reflection, and translation), then compute the "noise" on that image (using a different text prompt!) and then they do the reverse transformation on the "noise" image. They then combine the original "noise" and the reverse transformation "noise" by simple averaging.

This only works for certain transformations. Basically the two conditions the transformation has to satisfy are "linearity" and "statistical consistency". By linearity, they mean diffusion models fundamentally think in terms of "signal + noise" as a linear combination. If your transformation breaks this assumption, your transformation won't work. By "statistical consistency" they mean diffusion networks assume the "noise" is Gaussian, meaning it follows a Gaussian distribution. If your transformation breaks this assumption, it won't work.

These assumptions hold for the 3 transformations I've mentioned so far: rotation, reflection, and translation. It also works for one more: color inversion. Like a photographic negative. The color values have to be kept centered on 0, though. Their examples are only black-and-white.

Another thing they had to do was use a different diffusion model because Stable Diffusion actually has "latent space" values that refer to groups of pixels. They used an alternative called DeepFloyd IF, where the "latent space" values are per-pixel. I haven't figured out exactly what "latent space" values are learned by each of these models so I can't tell you why this distinction matters.

Another thing is that the system also incorporated "negative prompting" in its "noise" estimate, but they discovered you have to be very careful with "negative prompting". Negative prompts tell the system what it must leave out rather than include in the image. An example that illustrates the problem is, for example if you said "oil painting of a dog" and "oil painting of a cat". They both have "oil painting" so you're telling the system to both include and exclude "oil painting".

The website has lots of animated examples; check it out.

Visual anagrams: Generating multi-view optical illusions with diffusion models

#solidstatelife #ai #genai #diffusionmodels #opticalillusions

waynerad@diasp.org

AI models form "mental models" of the world. An AI trained to play a board game called Othello was opened up and found to be forming a mental model of the board. The AI system here was actually language model. It was given valid games as training data, and its job was to output a word that would represent a move in the game -- except the AI system was never told there was a game. It was simply given training data that to it looks like sequences of words, and output words. Without knowing what it was dealing with was a board game, you would expect it would develop a statistical model of likely good moves. But surprisingly, when the researchers opened up the box and looked at what was actually happening inside the language model, they found it was creating a representation of the board, keeping track of whose pieces are in which positions.

In another study, diffusion models, the kind of models that create images, such as DALL-E 2, Midjourney, and Stable Diffusion, are opened up, and inside it is found that they form 3D depth models of images. In other words, even though diffusion models are trained on 2-dimensional images only, they form "mental models" that are 3-dimensional. Early on in the process of generating an image, they concieve of how the objects in the scene are related in 3 dimensions.

I previously told you all about an interview where Ilya Sutskever, leader of the research team that created GPT-4, who said he believes large language models have real understanding. He said people say these models just learn statistical regularities and have no model of the world, but he disagrees, and says the best way to predict what words will come next is to genuinely understand what is being talked about. So asking models to predict what word comes next is a far bigger deal than meets the eye. To predict, you need to understand the true underlying process that produced the data. Even though language models only see the world through the shadow of text as expressed by human beings on the internet, they are able to develop a shocking degree of understanding.

That was back in April. Now we are starting to see evidence, from looking inside, that language models do indeed form "mental models" of the world they are predicting. Not just language models but diffusion models as well. This may be a general feature of generative models and maybe we will find it more and more.

Beyond surface statistics - AI secretly builds visual models of the world - Wes Roth

#solidstatelife #ai #genai #llms #diffusionmodels #othello

waynerad@diasp.org

"A woman with flowers in her hair in a courtyard, in the style of ..." and then you can pick from 1,590 artists. Aditya Shankar was wondering how Stable Diffusion would draw what would otherwise be the exact same prompt except you can see how 1,500+ artists would have drawn it.

I put a prompt in stable diffusion to see how 1500+ artists would have drawn it

#solidstatelife #ai #generativemodels #diffusionmodels #stablediffusion

waynerad@diasp.org

Make-A-Video, a new AI system from Facebook, "lets people turn text prompts into brief, high-quality video clips."

"The system learns what the world looks like from paired text-image data and how the world moves from video footage with no associated text."

The way the system works is, they first started with a text-to-image system, which is in fact a diffusion network (like DALL-E 2, Midjourney, Imogen, and Stable Diffusion). They extended the system by adding layers -- in this case they added convolutional layers to the part of the system that used convolutional layers (for image processing) and added attention layers to the part of the system that does the text processing using attention layers. Actually the additional convolution layers are 1D convolution layers. The attention layers as well aren't full attention layers, but use an approximation system that requires less computing power. In fact the additional convolutional layers are called "pseudo-convolutional" layers and the additional attention layers are called "pseudo attention" layers. What these layers do is in addition to the spatial (space-based) information from the existing layers, they get a "time step" input as so have "temporal" (time-based) information as well. So these layers link together spatial and temporal information. And evidently, to do them as full convolutional and full attention layers would consume too much computing power.

With the original layers trained to do spatial text-to-image generation, the new layers are trained on videos to learn how things move in video. So it learns how ocean waves move, how elephants move, and so on. This training involves an additional frame rate conditioning parameter.

At this point we're almost done. The last step is to do upscaling, which they call superresolution. This increases the output to the full resolution. But they don't just do this spatially, they do it temporally as well. So just as when you increase the resolution of an image, a neural network has to imagine what pixels should go in the newly created pixels, this system has a neural network that images what pixels should go into newly created video frames, to maintain smoothness between frames.

Introducing Make-A-Video: An AI system that generates videos from text

#solidstatelife #ai #generativenetworks #diffusionmodels

waynerad@diasp.org

In addition to doing text-to-image, apparently Stable Diffusion can also do image-to-image. Here's a little collection of MSPaint sketches to concept art.

Reddit is going crazy turning MSPaint sketches into concept art using Img2img with #stablediffusion and it’s wild (1/n)

#solidstatelife #ai #computervision #generativeai #diffusionmodels