Project 5

NITYA SRI ADAPALA

Part A: The power of diffusion models!

Part 0: Setup

For this project, I am going to use the DeepFloyd IF diffusion model. DeepFloyd is a two stage model trained by Stability AI. The first stage produces images of size 64 x 64 and the second stage takes the outputs of the first stage and generates images of size 256 x 256. DeepFloyd was trained as a text-to-image model, which takes text prompts as input and outputs images that are aligned with the text. Because the text encoder is very large, and barely fits on a free tier Colab GPU, I precomputed a couple of text embeddings to try. In the notebook, I instantiate DeepFloyd's stage_1 and stage_2 objects used for generation, as well as several text prompts for sample generation. Below are three text prompts and their corresponding images. For the first one, I set the number of inference steps to 20. The images are detailed and very colourful but the definition of the images are not too sharp - it is still a little blurry. For the second one, I increased the number of inference steps to 40. The images' definition increased as the outlines of the shapes and figures are cleaner. The man wearing the hat seems more realistic than the previous set of images whereas the rocket ship seems more like a 2D cartoon. For the third one, I decreased the number of inference steps to 5. The quality of the images decreased by a lot and the colours are very dull and muted. The images are not as clear as the previous sets of images and there is some speckling across the images which makes it harder to see the key features in the image. The objects in the image seem to be represented more abstractly and less realistically. Therefore, increasing the number of inference steps produces more detailed, bright and realistic images whereas decreasing the number of inference steps produces more abstract, dull, and less detailed images. I am using the seed = 180.

num_inference_steps = 20

num_inference_steps = 40

num_inference_steps = 5

Part 1: Sampling Loops

1.1 Implementing the Forward Process

A key part of diffusion is the forward process, which takes a clean image and adds noise to it. In this part, I write a function to implement this. Given a clean image, we get a noisy image at timestep t by sampling from a Gaussian. Note that the forward process is not just adding noise, we also scale the image.

Noisy Campanile at t=250

Noisy Campanile at t=500

Noisy Campanile at t=750

1.2 Classical Denoising

In this section, I take noisy images for timesteps [250, 500, 750], but use Gaussian blur filtering to try to remove the noise. Getting good results was quite difficult, if not impossible.

Noisy Campanile at t=250

Noisy Campanile at t=500

Noisy Campanile at t=750

Gaussian Blur Denoising at t=250

Gaussian Blur Denoising at t=500

Gaussian Blur Denoising at t=750

1.3 One-Step Denoising

Now, I use a pretrained diffusion model to denoise. The actual denoiser is stage_1.unet, which is a UNet that has already been trained on a very, very large dataset of pairs of images. We can use it to recover Gaussian noise from the image. Then, we can remove this noise to recover something close to the original image. Note: this UNet is conditioned on the amount of Gaussian noise by taking timestep t as additional input.

Gaussian Blur Denoising at t=250

1.4 Iterative Denoising

In the previous part, the denoising UNet does a much better job of projecting the image onto the natural image manifold, but it does get worse as you add more noise. This makes sense, as the problem is much harder with more noise! However, diffusion models are designed to denoise iteratively, which is what I implement in this section.

Noisy Campanile at t=690

Noisy Campanile at t=540

Noisy Campanile at t=390

Noisy Campanile at t=240

Noisy Campanile at t=90

Iteratively Denoised Campanile

One-Step Denoised Campanile

Gaussian Blurred Campanile

1.5 Diffusion Model Sampling

Another thing we can do with the iterative_denoise function is to generate images from scratch. We can do this by setting i_start = 0 and passing in random noise. This effectively denoises pure noise.

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

1.6 Classifier-Free Guidance (CFG)

In order to greatly improve image quality (at the expense of image diversity), we can use a technique called Classifier-Free Guidance.In CFG, we compute both a conditional and an unconditional noise estimate, which we use with gamma (strength of the CFG) to calculate our new noise estimate. To get an unconditional noise estimate, we can just pass an empty prompt embedding to the diffusion model. When gamma > 1, we get much higher quality images.

Sample 1 with CFG

Sample 2 with CFG

Sample 3 with CFG

Sample 4 with CFG

Sample 5 with CFG

1.7 Image-to-image Translation

In part 1.4, we take a real image, add noise to it, and then denoise. This effectively allows us to make edits to existing images. The more noise we add, the larger the edit will be. This works because in order to denoise an image, the diffusion model must to some extent "hallucinate" new things -- the model has to be "creative." Another way to think about it is that the denoising process "forces" a noisy image back onto the manifold of natural images. Here, I am going to take the original test image, noise it a little, and force it back onto the image manifold without any conditioning. Effectively, I am going to get an image that is similar to the test image (with a low-enough noise level). This follows the SDEdit algorithm.

SDEdit with i_start=1

SDEdit with i_start=3

SDEdit with i_start=5

SDEdit with i_start=7

SDEdit with i_start=10

SDEdit with i_start=20

Original Image

SDEdit with i_start=1

SDEdit with i_start=3

SDEdit with i_start=5

SDEdit with i_start=7

SDEdit with i_start=10

SDEdit with i_start=20

Original Image

SDEdit with i_start=1

SDEdit with i_start=3

SDEdit with i_start=5

SDEdit with i_start=7

SDEdit with i_start=10

SDEdit with i_start=20

Original Image

1.7.1 Editing Hand-Drawn and Web Images

This procedure works particularly well if we start with a nonrealistic image (e.g. painting, a sketch, some scribbles) and project it onto the natural image manifold. I experiment by starting with hand-drawn or other non-realistic images and see how you can get them onto the natural image manifold in fun ways.

SDEdit with i_start=1

SDEdit with i_start=3

SDEdit with i_start=5

SDEdit with i_start=7

SDEdit with i_start=10

SDEdit with i_start=20

Web Image

SDEdit with i_start=1

SDEdit with i_start=3

SDEdit with i_start=5

SDEdit with i_start=7

SDEdit with i_start=10

SDEdit with i_start=20

Original sketch

SDEdit with i_start=1

SDEdit with i_start=3

SDEdit with i_start=5

SDEdit with i_start=7

SDEdit with i_start=10

SDEdit with i_start=20

Original sketch

1.7.2 Inpainting

We can use the same procedure to implement inpainting (following the RePaint paper). That is, given an image, x, and a binary mask, m, we can create a new image that has the same content where m is 0, but new content wherever m is 1. To do this, we can run the diffusion denoising loop. But at every step, after obtaining the image, we "force" it to have the same pixels as the original images, where m is 0. Essentially, we leave everything inside the edit mask alone, but we replace everything outside the edit mask with our original image - with the correct amount of noise added for timestep, t.

Campanile

Mask

Hole to Fill

Campanile Inpainted

London Bridge

Mask

Hole to Fill

London Bridge Inpainted

Colosseum

Mask

Hole to Fill

Colosseum Inpainted

1.7.3 Text-Conditional Image-to-image Translation

Now, we will do the same thing as SDEdit, but guide the projection with a text prompt. This is no longer pure "projection to the natural image manifold" but also adds control using language.

Campanile

Mask

Hole to Fill

Rocket Ship at noise level 1

Rocket Ship at noise level 3

Rocket Ship at noise level 5

Rocket Ship at noise level 7

Rocket Ship at noise level 10

Rocket Ship at noise level 20

London Bridge

Mask

Hole to Fill

Man in front of London Bridge at noise level 1

Man in front of London Bridge at noise level 3

Man in front of London Bridge at noise level 5

Man in front of London Bridge at noise level 7

Man in front of London Bridge at noise level 10

Man in front of London Bridge at noise level 20

Colosseum

Mask

Hole to Fill

Dog in front of Colosseum at noise level 1

Dog in front of Colosseum at noise level 3

Dog in front of Colosseum at noise level 5

Dog in front of Colosseum at noise level 7

Dog in front of Colosseum at noise level 10

Dog in front of Colosseum at noise level 20

1.8 Visual Anagrams

In this part, we are finally ready to implement Visual Anagrams and create optical illusions with diffusion models. In this part, I first create an image that looks like "an oil painting of people around a campfire", but when flipped upside down will reveal "an oil painting of an old man". To do this, we will denoise an image, x, at step, t, normally with the prompt "an oil painting of an old man", to obtain noise estimate e1. But at the same time, we will flip x upside down, and denoise with the prompt "an oil painting of people around a campfire", to get noise estimate e2. We can flip e2 back, to make it right-side up, and average the two noise estimates to get the final estimate. We can then perform a reverse/denoising diffusion step with the averaged noise estimate.

An Oil Painting of People around a Campfire

An Oil Painting of an Old Man

An oil painting of a snowy mountain village

A photo of the amalfi coast

An oil painting of a snowy mountain village

An oil painting of people around a campfire

1.9 Hybrid Images

In this part we'll implement Factorized Diffusion and create hybrid images just like in project 2. In order to create hybrid images with a diffusion model we can use a similar technique as above. We will create a composite noise estimate by estimating the noise with two different text prompts, and then combining low frequencies from one noise estimate with high frequencies of the other.

Hybrid image of a skull and a waterfall

Hybrid image of a skull and the amalfi coast

Hybrid image of a dog and people around a campfire

Part B: Diffusion Models from scratch!

Part 1: Training a Single-Step Denoising UNet

I used the below diagram to construct the various classes that I need to create the UNet:

Unconditional UNet

I used the below diagram to construct the UnconditionalUNet Class:

Standard UNet Operations

Now, we will train the model to perform denoising. The objective is to train a denoiser to denoise a noisy image, which is essentially the product of a noise level of 0.5 applied to a clean image. I use the MNIST dataset via torchvision.datasets.MNIST with flags to access both the training and test sets. Then, I train only on the training set over 5 epochs. Before I create the dataloader, I shuffle the dataset and use the recommended batch size of 256. I only noise the image batches when fetched from the dataloader so that in every epoch the network will see new noised images thus improving generalization. For the model, I use the UNet architecture previously defined with recommended hidden dimension D = 128. For the optimizer, I use the Adam optimizer with learning rate of 1e-4.

Varying levels of noise on MNIST digits

Results on digits from the test set after 1 epoch of training

Results on digits from the test set after 5 epochs of training

Our denoiser was trained on MNIST digits noised with noise level of 0.5. Now, let's see how the denoiser performs on different noise levels that it wasn't trained for. I visualized the denoiser results on test set digits with varying levels of noise = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0].

Results on digits from the test set with varying noise levels

Training Loss Curve

Part 2: Training a Diffusion Model

Time-Conditioned UNet

We need a way to inject scalar t into our UNet model to condition it. There are many ways to do this but this is how I implemented it:

Time-Conditioned UNet

FCBlock for conditioning

Training time-conditioned UNet

Sampling from time-conditioned UNet

Epoch 5

Epoch 20

Time-Conditioned UNet training loss curve

Class-Conditioned UNET

To make the results better and give us more control for image generation, we can also condition our UNet on the class of the digit 0-9. This will require adding 2 more FCBlocks to our UNet but for the class-conditioning vector, you make it a one-hot vector instead of a single scalar. Since we still want our UNet to work without it being conditioned on the class, we implement dropout where 10% of the time, we drop the class conditioning vector by setting it to 0. Training for this section will be the same as time-only, with the only difference being the conditioning vector and doing unconditional generation periodically. The sampling process is the same as part A, where we saw that conditional results aren't good unless we use classifier-free guidance so I used CFG with gamma = 5.0 for this part.

Training class-conditioned UNet

Sampling from class-conditioned UNet

Epoch 5

Epoch 20

Class-Conditioned UNet training loss curve