In part 0, I load the DeepFloyd IF diffusion model and generate 3 images from 3 prompts. The model generates images in two stages: the first stage produces images of size 64x64 and the second stage takes the outputs of the first stage and generates images of size 256x256. I use the seed 1203 and will continue to do so for the rest of the project.
The outputs are quite faithful to the prompts, and the generated images tend to be in a cartoonish style. Further, the generated images from stage 2 are both very sharp and smoothed out. I explored the influence of num_inference_steps
parameter on the prompt "an oil painting of a snowy mountain village" which I have visualized below:
With very few inference steps the image still resembles noise. As I increase the number of inference steps the image continues to be denoised until it hits a "sweet spot". However, as I increase the number of inference steps beyond this sweet spot, the generated image starts over-smoothing and artifacts begin to appear.
Part 1.1 required me to implement the forward process of diffusion, which progressively adds more and more noise to the image. Below I have visualized some samples of the forward process:
Part 1.2 required me to use classical techniques like Gaussian blur filtering to remove the noise added by the forward process. Below I have visualized the results using a sigma value of 3 and a kernel size of 5x5:
In part 1.3, I use a pretrained UNet model to denoise the images. We solve for x_0 from x_t which effectively recover the noise added between x_0 and x_t, and subtracts it all from x_t in one step.
The one-step denoising method performs significantly better than the classical techniques, but we still see that as more noise is added the quality of the denoised image goes down.
In part 1.4, I implemented an iterative denoising teachnique which more closely aligns with the design of diffusion models as opposed to one-step denoising. To speed up the process and save compute, I use strided timesteps (i.e., I start at step 990, and skip 30 steps each time). Below I have visualized the images at different intermediate steps in the denoising process:
Now I visualize the results from the 3 different denoising approaches I have explored so far:
In part 1.5, I use the same iterative denoising function from above (except starting at timestep 0 with the image as pure noise). I use the prompt "a high quality photo" to see what type of images the model now generates. Below I have visualized the 5 generated images:
The generated images are of decent quality but are quite blurry contrary to the prompt.
In part 1.6, I use classifier free guidance (CFG) to improve the quality of the images. I compute both a conditional noise estimate and an unconditional one. When I set gamma to be 0, I use the unconditional noise estimate, and when gamma is 1, I use the conditional noise estimate. However, the image gets better when I set gamma to be greather than 1.
In part 1.7, I now use the iterative denoise function with CFG to gradually generate images that become increasingly similar to our original image. As t increases, the image that is denoised gets closer and closer to our original image (i.e. less noise to begin with). Below I have visualized the results for 3 images:
In part 1.7.1, I use the procedure above to take a nonrealistic image and project it onto the natural image manifold. For the hand-drawn images, I had difficulty using the provided code due to my local environment so I drew the images on my ipad and uploaded them instead. Below I have visualized three of these results:
In part 1.7.2, I used the iterative denoising with CFG function to perform inpainting. I define a binary mask that is 0 where I want the original image and 1 where I want to fill with the new generated image. Below I have visualized three results:
In part 1.7.3, instead of using the prompt "a high quality photo" to translate from the source image to a random image, I perform text-conditional image-to-image translation. As t decreases, I expect the generated images to be more similar to the original image and less similar to the prompt.
In part 1.8, I generate visual anagrams by averaging noise estimates between two prompts. The expected output is that when I flip the image, you should see the second prompt. Below I have visualized the results for 3 anagrams:
In part 1.9, I generate hybrid images, which are images with a high frequency component (visible from close up) and a low frequency component (visible from far away). I achieve this by applying high-pass and low-pass filters to the noise estimates of the two prompts.
In part 1.1, I implement the UNet architecture:
In part 1.2, I use the UNet to train a denoiser. I first pass in training pairs (z, x), where z is the noisy MNIST image, and x is the clean MNIST image. For each clean image x, we can add noise controlled by sigma to get z. Below I have visualized the process of adding noise with varying sigma:
In part 1.2.1, I train the model according to the following hyperparameters: batch size of 256, 5 epochs of training, Adam with learning rate of 1e-4, number of hidden dimensions = 128, sigma = 0.5
In part 1.2.2, I use different sigma values than the model was trained on to test whether the model can generalize:
In part 2.1, I modify the original architecture by adding time conditioning:
In part 2.2, I train the model according to the following hyperparameters: batch size of 128, 20 epochs of training, Adam with learning rate of 1e-3 and exponential learning rate decay scheduler with gamma=0.1^(1.0/num_epochs), number of hidden dimensions = 64:
In part 2.3, I implement the sampling algorithm to test how well the UNet is able to generate new samples. Below I have visualized the results at different checkpoints:
The model does a decent job at generating new samples, but it appears that it hasn't learned different digits just yet. Thus, in the next section I will also condition on the class of each digit to make this as robust as possible.
In part 2.4, I add information about the class of the digit I want the model to generate to improve its output. I still want the model to be capable of generation without the class condition, so I add a dropout rate of 0.1 where I set the condition vector to zero. Below I have visualized the loss curve from training:
In part 2.5, I sample from the model using the same process, but this time I add a class condition. I also incorporate classifier-free guidance from part A since that experiment taught us that models struggle to generate good outputs when conditioned without CFG.
I learned a lot from this project and it was super interesting seeing how we could achieve the same kinds of cool images (e.g. hybrid images) using both classical image processing techniques and stable diffusion! The components of the project on UNet also gave me a better intuition for why we upsample and downsample images to capture a diverse range of representations, from fine to coarse features.