Text-to-Image Diffusion Models Part II

An Illustrated Guide to Diffusion Models

Jun 03, 2023

You may have noticed that last week's edition didn't arrive in your inbox. I was away on vacation with family and just got back into the thick of things. After that well-rested break, I'm back to discuss diffusion models. In the last edition, we looked at the high-level intuition behind diffusion models and how they work. This week, we'll take a look at why these models are called diffusion models, the two processes that govern these models, and how these models are trained.

This Week on Gradient Ascent:

An Illustrated Guide to Diffusion Models 🌬️
[Check out] Mind Reading? 🧠
[Check out] Controlled Diffusion Models 📝
[Definitely watch] The State of GPT 📽️
[Consider taking] Free LangChain course ⛓️

An Illustrated Guide to Diffusion Models

We've seen the intuition behind diffusion models. But why on earth are they called diffusion models in the first place? Imagine your favorite caffeinated beverage. Yes, coffee. Now imagine the precise moment when a drop of milk falls into a hot, freshly brewed cup of coffee. You can see that exact spot where it touched the coffee's surface for a fraction of a second. Before long, the milk has permeated the coffee, turning it into a beautiful amber color. You can no longer pinpoint where the drop of milk first hit the coffee. This process by which the milk randomly distributes itself within the coffee is called diffusion. Our goal is to recover the original coffee and reverse the mixing of milk in it. This is the idea behind diffusion models too. They gradually try to unmix noise (milk) from the image (coffee).

Going Forward In Reverse

Everything about these models can be explained by two processes that go in opposite directions – The forward process and the reverse process. Yes, very creatively named.

The Forward Process

The forward process governs how these models are trained. We add noise to an image. It's the model's job to figure out the noise that we added. In other words, it has to figure out what must be removed to recover the original image. Neural networks are really good at these kinds of tasks.

Typically, U-Nets are used for this purpose. U-Nets are amazing for image segmentation tasks. Turns out, they're just as amazing for removing noise too! All we need is a small tweak to the network. In a segmentation problem, the model takes in an image as input and produces a segmentation map as output.

Here, it takes a noisy image as input and predicts the noise as output.

So, how do we train this model? We take a dataset of images of our choosing. These can be anything. If we have cat and dog images, the model will learn to create cat and dog images. If we have random cartoon images, it will learn to create random cartoon images. You get the drift.

Let's assume that the dataset has cat and dog images. From this dataset, we choose an image at random. We then add noise to this image. If we add a little bit of noise, we can still see the original image for the most part. If we add a ton of noise, the original image becomes unrecognizable – We no longer know what that cat or dog looked like or even whether it was a cat or dog in the first place.

Thus, for the same image, we can add different amounts of noise to create different training examples. So, for each image in our dataset, we can create several training examples for the model to learn from.

Great, we have a dataset to train the model on. Now what?

We show these noisy images to the model and ask it to guess the noise that was added to each image.

At first, the model doesn't have a clue of how much noise has been added. It randomly guesses. Here's where our trusty backpropagation and gradient descent come in. We compare the model's predictions with the actual noise we added. We use this to point out where the model went wrong. The model updates its weights. Rinse and repeat.

At the end of this process, we have a superb noise-removal network. What does this have anything to do with generating new images?

The Reverse Process

The reverse process is what we use in practice to generate images. As the name implies, the goal is to remove noise from the image until we have a noiseless image.

Now that training is complete, let's see how our model does on a few images it's not seen before. First, a simple example. We add a teensy bit of noise to an image of a dog and show the model this noisy image. It effortlessly removes the noise from this image to reveal the dog.

Next up, a harder example. We add quite a bit of noise to a tabby cat. We can still make out that it's a cat underneath all that noise. But only just. When the model sees this corrupted image, it does a pretty good job of denoising it. But some of the details of the original image have changed – The eyes are closed, the stripes are missing, and so on. After all, the model didn't see the original image without the noise. Thus, it makes a guess on how much noise to remove from each pixel of the image. That's why the details are either changed or missing.

Now, here's where the beauty of this method shines through. For the final test, we give the model an image of pure noise. This isn't an image of a cat or a dog where a truckload of noise has been added. This is just an image of pure noise. Nothing else.

What happens when we give this image to the model?

It tries to remove the noise to reveal the original image. The only problem – there's no original image, right? Doesn't matter. Since the model has been trained to think that there's always a cat or dog hiding behind a mountain of noise, it removes noise in a way to produce an image that actually resembles a cat or dog. So it makes sure that there are paws, eyes and ears, and so on. Thus, in this case, the model removes noise to reveal either a picture of a cat or a dog.

From no noise to pure noise, the model always assumes there’s something behind the curtain.

Crucially, we can't control what it generates (For now). All we can be sure of is that the result will be a cat or a dog.

So, to summarize, the model always assumes that there's an image hiding behind noise. What the image is depends on the training dataset. If we give it pure noise, it will remove the noise in such a way that the result is a meaningful image.

In other words, this model paints a picture by removing noise.

One Step or Many?

There might be many questions swimming in your head right now. However, if you've used one of these models before, there might be one question that burns more brightly than the others. If all the model does is remove noise from the image, why does it take so long to produce the final image?

These models don't denoise an image instantaneously. They denoise the image over a number of steps. At each step, a bit of noise is removed from the image. After this, the partially denoised image is fed back to the model. This process repeats until the image is fully noise free. Currently, researchers are focusing on ways to reduce the number of steps needed to denoise an image.

So, we've seen how the diffusion model creates images by subtracting noise. But there are two problems we haven't talked about. First, the system above isn't how Stable Diffusion works. There's a secret ingredient I haven't yet explained. What do you think that is? Second, what happened to the text prompt? The answers to these questions will be revealed in the final installment.

Resources To Consider:

Mind Reading?

Project Page: https://mind-video.com/

Paper: https://arxiv.org/abs/2305.11675

In this crazy futuristic work, researchers use fMRI data to reconstruct high-quality videos from brain signals. Yes, you read correctly. Brain signals. Just look at the image below to see why I'm so excited by this work.

Controlled Image Generation

Project Page: https://dave.ml/selfguidance/

Paper: https://arxiv.org/abs/2306.00986

The authors propose a novel method to control image generation using diffusion models. Without any extra models or training, they show how their approach can move, resize, and replace objects with items from real images. This is particularly useful for image editing applications, and I'm sure we'll see this tech put into consumer apps in the near future.

The State of GPT

In this talk, Andrej Karpathy covers everything involved in the training of GPT assistants like ChatGPT. This 45-minute talk is packed with information and is a must-watch!

Free LangChain Course

Link: https://www.deeplearning.ai/short-courses/langchain-for-llm-application-development/

Coursera just announced a new free short course on LangChain for LLM app development. In this course, the creator of LangChain will teach you how it works, how to apply it to your proprietary data to build apps, and much more. It's free only for a short time, so consider taking it if this is relevant to you.

Text-to-Image Diffusion Models Part II

An Illustrated Guide to Diffusion Models

This Week on Gradient Ascent:

An Illustrated Guide to Diffusion Models

Going Forward In Reverse

The Forward Process

The Reverse Process

One Step or Many?

Resources To Consider:

Mind Reading?

Controlled Image Generation

The State of GPT

Free LangChain Course

Discussion about this post