Overview

This project is split into two parts: 5A (The Power of Diffusion Models) and 5B (Diffusion Models from Scratch).

In project 5A, I have done the following:

  1. Setup
  2. Implementing the Forward Process
  3. Classical Denoising
  4. One-Step Denoising
  5. Iterative Denoising
  6. Diffusion Model Sampling
  7. Classifier-Free Guidance (CFG)
  8. Image-to-image Translation
    1. Editing Hand-Drawn and Web Images
    2. Inpainting
    3. Text-Conditional Image-to-image Translation
  9. Visual Anagrams
  10. Hybrid Images

In project 5B, I have done the following:

  1. Training a Single-Step Denoising UNet
    1. Implementing the UNet
    2. Using the UNet to Train a Denoiser
  2. Training a Diffusion Model
    1. Adding Time Conditioning to UNet
    2. Training the UNet
    3. Sampling from the UNet
    4. Adding Class-Conditioning to and Training the UNet
    5. Sampling from the Class-Conditioned UNet

Above: "an oil painting of an old man"
Below: "an oil painting of people around a campfire"

This is an example of a visual anagram!


5A Part 0: Setup

getting things ready to go

In this section, I created my Hugging Face Hub access token, instantiate the DeepFloyd IF diffusion model, and used the precomputed text embeddings to generate our first images. I set my seed to 180, which was done by calling torch.cuda.manual_seed(seed) and torch.manual_seed(seed).

The prompts used to generate the images are ['an oil painting of a snowy mountain village', 'a man wearing a hat', 'a rocket ship'] (from left to right), and the corresponding images with num_inference_steps set to 20 and 40 are shown below (left and right halves). The first row and second row correspond to images generated from DeepFloyd's stage_1 and stage_2 objects, respectively.

snowy village, stage 1, 20 steps
man wearing hat, stage 1, 20 steps
rocket ship, stage 1, 20 steps
snowy village, stage 1, 40 steps
man wearing hat, stage 1, 40 steps
rocket ship, stage 1, 40 steps
snowy village, stage 2, 20 steps
man wearing hat, stage 2, 20 steps
rocket ship, stage 2, 20 steps
snowy village, stage 2, 40 steps
man wearing hat, stage 2, 40 steps
rocket ship, stage 2, 40 steps

The images generated with a higer number of inference steps tended to show more detailed results, as we can see from left and right halves of the grid of images above. For the entirety of this project, I used a random seed of 180.


5A Part 1: Sampling Loops

Writing sampling loops for the pretrained DeepFloyd denoisers

1.1 Implementing the Forward Process

In this part, we implement the noisy_im = forward(im, t) function. Down below, I have the test image at noise levels [250, 500, 750].

Berkeley Campanile

Noisy Campanile at t=250

Noisy Campanile at t=500

Noisy Campanile at t=750

1.2 Classical Denoising

We now try to denoise these images using classical methods. Down below are the noisy images from above with Gaussian blur filtering.

Noisy Campanile at t=250

Noisy Campanile at t=500

Noisy Campanile at t=750

Gaussian Blur Denoising at t=250

Gaussian Blur Denoising at t=500

Gaussian Blur Denoising at t=750

1.3 One-Step Denoising

This time, we use the pretrained diffusion model to denoise. This diffusion model was trained with text conditioning, so we use the prompt "a high quality photo."

Noisy Campanile at t=250

Noisy Campanile at t=500

Noisy Campanile at t=750

One-Step Denoised Campanile at t=250

One-Step Denoised Campanile at t=500

One-Step Denoised Campanile at t=750

1.4 Iterative Denoising

The denoising UNet worked much better, but we can still improve via iterative denoising. Down below are the noisy image every 5th loop of denoising in the iterative_denoise function, the final predicted clean image, the predicted clean image using only a single denoising step, and the predicted clean image using gaussian blurring.

Noisy Campanile at t=90

Noisy Campanile at t=240

Noisy Campanile at t=390

Noisy Campanile at t=540

Noisy Campanile at t=690

Iteratively Denoised Campanile

One-Step Denoised Campanile

Gaussian Blurred Campanile

1.5 Diffusion Model Sampling

This time, we use the iterative_denoise function and pass in random noise. Here are 5 results.

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

1.6 Classifier-Free Guidance (CFG)

In order to improve on image generation, we incorporate CFG, which is implemented in the iterative_denoise_cfg function. Here are another 5 samples using the prompt "a high quality photo" with a CFG scale of gamma=7.

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

1.7 Image-to-image Translation

In this part, we add variuos amounts of noise to our input image and pass it into iterative_denoise_cfg. Adding more noise causes the model to "hallucinate" new things, forcing it to be "creative" -- one way to think about it is that the denoising process "forces" a noisy image back onto the manifold of natural images.

SDEdit with i_start=1

SDEdit with i_start=3

SDEdit with i_start=5

SDEdit with i_start=7

SDEdit with i_start=10

SDEdit with i_start=20

Campanile

SDEdit with i_start=1

SDEdit with i_start=3

SDEdit with i_start=5

SDEdit with i_start=7

SDEdit with i_start=10

SDEdit with i_start=20

Tiramisu/Cake

SDEdit with i_start=1

SDEdit with i_start=3

SDEdit with i_start=5

SDEdit with i_start=7

SDEdit with i_start=10

SDEdit with i_start=20

Golden Gate Bridge

1.7.1 Editing Hand-Drawn and Web Images

We now try the same with nonrealistic image. I have chosen a cartoon drawing of a table and have also drawn three images myself.

Table at i_start=1

Table at i_start=3

Table at i_start=5

Table at i_start=7

Table at i_start=10

Table at i_start=20

Table Clipart

Star and Moon at i_start=1

Star and Moon at i_start=3

Star and Moon at i_start=5

Star and Moon at i_start=7

Star and Moon at i_start=10

Star and Moon at i_start=20

Star and Moon

Car at i_start=1

Car at i_start=3

Car at i_start=5

Car at i_start=7

Car at i_start=10

Car at i_start=20

Car

Flower at i_start=1

Flower at i_start=3

Flower at i_start=5

Flower at i_start=7

Flower at i_start=10

Flower at i_start=20

Flower in a pot

1.7.2 Inpainting

This time, we do inpainting with a mask. We create a new image that has the same content where m=0 and new content wherever m=1. It is pretty cool how the picture of Campanile became a picture of a person wearing a Campanile-dress. Otherwise, the inpainting worked out very nicely for the deer and the sunset.

Campanile

Mask

Hole to Fill

Campanile Inpainted

Deer

Mask

Hole to Fill

Deer Inpainted

Sunset

Mask

Hole to Fill

Sunset Inpainted

1.7.3 Text-Conditional Image-to-image Translation

This time, we guide the projection with a text prompt. Down below, I have a picture of the Campanile morphing to a rocket, a cat morphing to a dog, and a donut morphing to a painting of people sitting around a campfire.

Rocket Ship at i_start=1

Rocket Ship at i_start=3

Rocket Ship at i_start=5

Rocket Ship at i_start=7

Rocket Ship at i_start=10

Rocket Ship at i_start=20

Campanile

Dog at i_start=1

Dog at i_start=3

Dog at i_start=5

Dog at i_start=7

Dog at i_start=10

Dog at i_start=20

Cat

Campfire at i_start=1

Campfire at i_start=3

Campfire at i_start=5

Campfire at i_start=7

Campfire at i_start=10

Campfire at i_start=20

Donut

1.8 Visual Anagrams

Now, we create visual anagrams by obtaining the noise estimate for the upright and upside down images, averaging them, and performing a reverse/denoising diffusion step with the averaged noise estimate. The first anagram is of "an oil painting of an old man" and "an oil painting of people around a campfire", the second anagram is of "a rocket ship" and "a pencil", and the third anagram is of "a photo of the amalfi cost" and "a photo of a dog".

an oil painting of an old man

an oil painting of people around a campfire

a rocket ship

a pencil

a photo of the amalfi coast

a photo of a dog

1.9 Hybrid Images

This time, we implement Factorized Diffusion and create hybrid images. We create three images down below (the first description is when the image is far away, and the second description is when the image is close up). I tried making many images of the barista and the dog, and one observation that I had was that oftentimes it was either the case that the dog image never really appeared or was incorporated into the barista's apron.

Hybrid image of a skull and a waterfall

Hybrid image of a rocketship and a snowy village

Hybrid image of a barista and dog


5B Part 1: Training a Single-Step Denoising UNet

building our own UNet

1.1 Implementing the UNet

Our unconditional UNet architecture is as follows

1.2 Using the UNet to Train a Denoiser

Once we are done implementing the UNet, we train it on the MNIST dataset. Down below, we show a visualization of the noising process using sigma = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0], training loss curve plot every few iterations during the whole training process, sample results on the test set after the first and the fifth epoch, and sample results on the test set with out-of-distribution noise levels after the model is trained.

visualization of the noising process

training loss curve plot

sample results on the test set after the first epoch

sample results on the test set after the the fifth epoch

sample results on the test set with out-of-distribution noise levels

5B Part 2: Training a Diffusion Model

implementing DDPM

2.1 Adding Time Conditioning to UNet

We now inject a time scalar t into our UNet to condition it. The architecture is as follows.

2.2 Training the UNet and 2.3 Sampling from the UNet

Down below, we have the training loss curve plot for the time-conditioned UNet over the whole training process and sampling results for the time-conditioned UNet for 5 and 20 epochs.

I originally had used the same conditioning variable across each batch, rather than having one variable correspond to each datapoint. This caused my training loss curve to look fairly off, which I have also included. Also note that the axis should say 20 epochs, not 5 epochs; this is a typo.

incorrect training loss curve

correct training loss curve

incorrect training loss curve (smoothed)

correct training loss curve (smoothed)

sampling results for 5 epochs

sampling results for 20 epochs

2.4 Adding Class-Conditioning to UNet and 2.5 Sampling from the Class-Conditioned UNet

To make the results better and give us more control for image generation, we can also condition our UNet on the class of the digit 0-9. Down below are the training loss curve plot for the class-conditioned UNet over the whole training process, and sampling results for the class-conditioned UNet for 5 and 20 epochs (4 instances for each digit). Once again, the loss curve has a typo and should say 20 epochs.

training loss curve

sampling results for the class-conditioned UNet for 5 epochs

sampling results for the class-conditioned UNet for 20 epochs


Acknowledgements

This project is part of the Fall 2024 offering of CS180: Intro to Computer Vision and Computational Photography, at UC Berkeley. This website template is modified from HTML5 UP.