CS180 Project 5A: The Power of Diffusion Models!

Josh Barua

Part 0: Setup

In part 0, I load the DeepFloyd IF diffusion model and generate 3 images from 3 prompts. The model generates images in two stages: the first stage produces images of size 64x64 and the second stage takes the outputs of the first stage and generates images of size 256x256. I use the seed 1203 and will continue to do so for the rest of the project.

3 generated images at 20 inference steps (stage 1 top, stage 2 bottom)

Image 1
Image 2

The outputs are quite faithful to the prompts, and the generated images tend to be in a cartoonish style. Further, the generated images from stage 2 are both very sharp and smoothed out. I explored the influence of num_inference_steps parameter on the prompt "an oil painting of a snowy mountain village" which I have visualized below:

2 inference steps (left), 200 inference steps (middle), 400 inference steps (right)

Image 1 Image 2 Image 2
Image 1 Image 2 Image 2

With very few inference steps the image still resembles noise. As I increase the number of inference steps the image continues to be denoised until it hits a "sweet spot". However, as I increase the number of inference steps beyond this sweet spot, the generated image starts over-smoothing and artifacts begin to appear.

Part 1.1: Implementing the Forward Process

Part 1.1 required me to implement the forward process of diffusion, which progressively adds more and more noise to the image. Below I have visualized some samples of the forward process:

Image 1 Image 1 Image 1 Image 1

Part 1.2: Classical Denoising

Part 1.2 required me to use classical techniques like Gaussian blur filtering to remove the noise added by the forward process. Below I have visualized the results using a sigma value of 3 and a kernel size of 5x5:

Image 1

Part 1.3: One-Step Denoising

In part 1.3, I use a pretrained UNet model to denoise the images. We solve for x_0 from x_t which effectively recover the noise added between x_0 and x_t, and subtracts it all from x_t in one step.

Image 1
Image 1
Image 1

The one-step denoising method performs significantly better than the classical techniques, but we still see that as more noise is added the quality of the denoised image goes down.

Part 1.4: Iterative Denoising

In part 1.4, I implemented an iterative denoising teachnique which more closely aligns with the design of diffusion models as opposed to one-step denoising. To speed up the process and save compute, I use strided timesteps (i.e., I start at step 990, and skip 30 steps each time). Below I have visualized the images at different intermediate steps in the denoising process:

Image 1

Now I visualize the results from the 3 different denoising approaches I have explored so far:

Image 1

Part 1.5: Diffusion Model Sampling

In part 1.5, I use the same iterative denoising function from above (except starting at timestep 0 with the image as pure noise). I use the prompt "a high quality photo" to see what type of images the model now generates. Below I have visualized the 5 generated images:

Image 1

The generated images are of decent quality but are quite blurry contrary to the prompt.

Part 1.6: Classifer Free Guidance

In part 1.6, I use classifier free guidance (CFG) to improve the quality of the images. I compute both a conditional noise estimate and an unconditional one. When I set gamma to be 0, I use the unconditional noise estimate, and when gamma is 1, I use the conditional noise estimate. However, the image gets better when I set gamma to be greather than 1.

Image 1

Part 1.7: Image-to-image Translation

In part 1.7, I now use the iterative denoise function with CFG to gradually generate images that become increasingly similar to our original image. As t increases, the image that is denoised gets closer and closer to our original image (i.e. less noise to begin with). Below I have visualized the results for 3 images:

Campanile

Image 1

Samoyed

Image 1

Prof. Malik

Image 1

Part 1.7.1: Editing Hand-Drawn and Web Images

In part 1.7.1, I use the procedure above to take a nonrealistic image and project it onto the natural image manifold. For the hand-drawn images, I had difficulty using the provided code due to my local environment so I drew the images on my ipad and uploaded them instead. Below I have visualized three of these results:

Pikachu Web Image

Image 1

Hand-Drawn Tree

Image 1

Hand-Drawn Sun

Image 1

Part 1.7.2: Inpainting

In part 1.7.2, I used the iterative denoising with CFG function to perform inpainting. I define a binary mask that is 0 where I want the original image and 1 where I want to fill with the new generated image. Below I have visualized three results:

Campanile Inpainted

Image 1

Sun Inpainted

Image 1

Prof. Malik Inpainted

Image 1

Part 1.7.3: Text-Conditioned Image-to-image Translation

In part 1.7.3, instead of using the prompt "a high quality photo" to translate from the source image to a random image, I perform text-conditional image-to-image translation. As t decreases, I expect the generated images to be more similar to the original image and less similar to the prompt.

Campanile w/ prompt "a rocket ship"

Image 1

Samoyed w/ prompt "a photo of a dog"

Image 1

Sun w/ prompt "an oil painting of a snowy mountain village"

Image 1

Part 1.8: Visual Anagrams

In part 1.8, I generate visual anagrams by averaging noise estimates between two prompts. The expected output is that when I flip the image, you should see the second prompt. Below I have visualized the results for 3 anagrams:

Prompt 1 (left), Prompt 2 (right) -- the "(flipped)" was not in the prompt

Image 1
Image 1
Image 1

Part 1.9: Hybrid Images

In part 1.9, I generate hybrid images, which are images with a high frequency component (visible from close up) and a low frequency component (visible from far away). I achieve this by applying high-pass and low-pass filters to the noise estimates of the two prompts.

Image 1
Image 1
Image 1

CS180 Project 5B: Diffusion Models from Scratch!

Part 1.1: Implementing the UNet

In part 1.1, I implement the UNet architecture:

Image 1

Part 1.2: Using the UNet to Train a Denoiser

In part 1.2, I use the UNet to train a denoiser. I first pass in training pairs (z, x), where z is the noisy MNIST image, and x is the clean MNIST image. For each clean image x, we can add noise controlled by sigma to get z. Below I have visualized the process of adding noise with varying sigma:

Image 1

Part 1.2.1: Training

In part 1.2.1, I train the model according to the following hyperparameters: batch size of 256, 5 epochs of training, Adam with learning rate of 1e-4, number of hidden dimensions = 128, sigma = 0.5

Loss Curve

Image 1

Results after 1 epoch

Image 1

Results after 5 epoch

Image 1

Part 1.2.2: Out-of-Distribution Testing

In part 1.2.2, I use different sigma values than the model was trained on to test whether the model can generalize:

Image 1

Part 2.1: Adding Time Conditioning to UNet

In part 2.1, I modify the original architecture by adding time conditioning:

Image 1

Part 2.2: Training

In part 2.2, I train the model according to the following hyperparameters: batch size of 128, 20 epochs of training, Adam with learning rate of 1e-3 and exponential learning rate decay scheduler with gamma=0.1^(1.0/num_epochs), number of hidden dimensions = 64:

Image 1

Part 2.3: Sampling from the UNet

In part 2.3, I implement the sampling algorithm to test how well the UNet is able to generate new samples. Below I have visualized the results at different checkpoints:

Image 1
Image 1

The model does a decent job at generating new samples, but it appears that it hasn't learned different digits just yet. Thus, in the next section I will also condition on the class of each digit to make this as robust as possible.

Part 2.4: Adding Class-Conditioning to UNet

In part 2.4, I add information about the class of the digit I want the model to generate to improve its output. I still want the model to be capable of generation without the class condition, so I add a dropout rate of 0.1 where I set the condition vector to zero. Below I have visualized the loss curve from training:

Image 1

Part 2.5: Sampling from the Class-Conditioned UNet

In part 2.5, I sample from the model using the same process, but this time I add a class condition. I also incorporate classifier-free guidance from part A since that experiment taught us that models struggle to generate good outputs when conditioned without CFG.

Image 1
Image 1

Conclusion

I learned a lot from this project and it was super interesting seeing how we could achieve the same kinds of cool images (e.g. hybrid images) using both classical image processing techniques and stable diffusion! The components of the project on UNet also gave me a better intuition for why we upsample and downsample images to capture a diverse range of representations, from fine to coarse features.