<

Project 5: Fun With Diffusion Models!

Yuze Zhang

Part A

Part 0: Single-Step Denoising UNet

Objective: Train a UNet for single-step denoising using MNIST dataset.

In this part, we will set up the environment to use DeepFloyd, a two-stage diffusion model by Stability AI. To begin, create a Hugging Face account and log in, then accept the license on the DeepFloyd/IF-I-XL-v1.0 model card. Download the precomputed text embeddings to avoid GPU memory issues. The model will generate images from text prompts, with "a high quality photo" serving as a neutral, unconditional prompt for this part. The task is to generate images for three prompts, reflect on the outputs, experiment with num_inference_steps, The random seed is set to YOUR_SEED = 1.

Test Set Results

Image 1

Outputs with num_inference_steps=20

Image 1

Outputs with num_inference_steps=50

It is evident that clear and detailed text prompts significantly enhance the quality of generated outputs. Furthermore, increasing the value of num_inference_step typically leads to better results. To ensure the reproducibility of these outcomes, the random seed is fixed at 1.

Part 1: Single-Step Denoising UNet

Objective: Train a UNet for single-step denoising using MNIST dataset.

1.1 Implementing the Forward Process

In this part, the goal is to implement the forward diffusion process, which progressively adds noise to a clean image. The forward process involves sampling from a Gaussian distribution with a variance that increases over time. The image is scaled by the cumulative product of alphas (alphas_cumprod), where smaller timesteps result in an image close to the original (less noise) and larger timesteps result in more noise. The task is to write the function noisy_im = forward(im, t), which takes a clean image and a timestep t as inputs, and returns the noisy image at that timestep. The test image, Campanile, is resized to 64x64, and the function is applied to generate noisy images at timesteps [250, 500, 750], which is displayed to observe the progression of noise.

Image 1

original

Image 1

Noisy at t=250

Image 1

Noisy at t=500

Image 1

Noisy at t=750

1.2 Classical Denoising

In this part, the goal is to apply classical denoising techniques, specifically Gaussian blur filtering, to the noisy images generated at timesteps [250, 500, 750]. Gaussian blur is a smoothing filter that attempts to reduce noise by averaging pixels within a local neighborhood, but it is unlikely to produce good results due to the complexity of noise in the diffusion process. The task is to apply the torchvision.transforms.functional.gaussian_blur function to each noisy image and display the denoised images with the corresponding noisy images. This will illustrate the challenge of denoising these images using traditional methods.

Image 1

Noisy Image at t=250

Image 1

Noisy Image at t=500

Image 1

Noisy Image at t=750

Image 1

Gaussian Blur Denoising
Image at t=250

Image 1

Gaussian Blur Denoising
Image at t=500

Image 1

Gaussian Blur Denoising
Image t=750

1.3 One-Step Denoising

Part 1.3 involves using the pretrained UNet model to denoise noisy images generated at timesteps [250, 500, 750]. The model estimates the noise in the image based on the timestep and a provided text prompt embedding. The noisy image is passed through the UNet to predict the noise, which is then removed to recover an estimate of the original image. The denoised image, along with the original and noisy images, are visualized for comparison.

Image 1

Original

Image 1

Noisy Image at t=250

Image 1

Noisy Image at t=500

Image 1

Noisy Image at t=750

Image 1

Original

Image 1

One Step Denoising
Image at t=250

Image 1

One Step Denoising
Image at t=500

Image 1

One Step Denoising
Image t=750

The U-Net can be effective to some extent, but as 𝑑 increases, the sampling results deteriorate.

1.4 Iterative Denoising

Part 1.4 focuses on implementing iterative denoising for diffusion models. Instead of running the model 1000 times, the process can be sped up by skipping steps. The goal is to create a list of timesteps (strided_timesteps), starting at the noisiest image and reducing the noise with Unet step by step. The formula for each denoising step involves a interpolation between the noisy image and a clean image estimate, adjusting for the noise at each timestep.

Image 1

Noisy Image at t=690

Image 1

Noisy Image at t=540

Image 1

Noisy Image at t=390

Image 1

Noisy Image at t=240

Image 1

Noisy Image at t=90

Image 1

Original

Image 1

Iterative Denoising
Image

Image 1

One Step Denoising
Image

Image 1

Gaussian Blurred
Denoising Image

1.5 Diffusion Model Sampling

Part 1.5 involves using the iterative_denoise function to generate images from scratch by denoising random noise. By setting i_start = 0 and initializing the image with pure noise, the model progressively refines the noisy image into a coherent result. The task requires generating 5 images based on the prompt "a high quality photo."

Image 1

Sample 1

Image 1

Sample 2

Image 1

Sample 3

Image 1

Sample 4

Image 1

Sample 5

1.6 Classifier-Free Guidance (CFG)

Part 1.6 focuses on improving image quality through Classifier-Free Guidance (CFG). CFG enhances the diffusion process by computing both conditional and unconditional noise estimates. The final noise estimate is a weighted combination of these two, by using model_output = scale * noise_est + (1 - scale) * uncond_noise_est. This technique improves the quality of generated images but reduces their diversity.

Image 1

Sample with CFG 1

Image 1

Sample with CFG 2

Image 1

Sample with CFG 3

Image 1

Sample with CFG 4

Image 1

Sample with CFG 5

The sampling results using CFG are much better than the previous sampling results.

1.7 Image-to-image Translation

Part 1.7 focuses on image-to-image translation using the diffusion model with Classifier-Free Guidance (CFG). The goal is to take a noisy version of a test image and iteratively denoise it, gradually making the image resemble the original with varying levels of noise. This process is based on the SDEdit algorithm, which forces a noisy image back onto the manifold of natural images, allowing for creative edits.

Test Image

Image 1

SDEdit i_start=1

Image 1

SDEdit i_start=3

Image 1

SDEdit i_start=5

Image 1

SDEdit i_start=7

Image 1

SDEdit i_start=10

Image 1

SDEdit i_start=20

Image 1

Original

Own Image

Image 1

SDEdit starting from different i_start

Image 1

Original

Image 1

SDEdit starting from different i_start

Image 1

Original

1.7.1 Editing Hand-Drawn and Web Images

Part 1.7.1 focuses on applying the image-to-image translation procedure to hand-drawn or non-realistic images, such as sketches or web images, to project them onto the natural image manifold.

Web Image
Image 1

Image at i_start=1

Image 1

Image at i_start=3

Image 1

Image at i_start=5

Image 1

Image at i_start=7

Image 1

Image at i_start=10

Image 1

Image at i_start=20

Image 1

Image from Website

Image 1

Image at i_start=1

Image 1

Image at i_start=3

Image 1

Image at i_start=5

Image 1

Image at i_start=7

Image 1

Image at i_start=10

Image 1

Image at i_start=20

Image 1

Image from Website

Hand Drawn Image
Image 1

Image at i_start=1

Image 1

Image at i_start=3

Image 1

Image at i_start=5

Image 1

Image at i_start=7

Image 1

Image at i_start=10

Image 1

Image at i_start=20

Image 1

Hand Drawn Image

Image 1

Image at i_start=1

Image 1

Image at i_start=3

Image 1

Image at i_start=5

Image 1

Image at i_start=7

Image 1

Image at i_start=10

Image 1

Image at i_start=20

Image 1

Hand Drawn Image

1.7.2 Inpainting

In part 1.7.2, the task is to implement inpainting using the diffusion model. The goal is to take an image π‘₯0, a binary mask π‘š, and create a new image that retains the content of π‘₯0 where π‘š=0 (the unmasked areas), but fills in the masked areas (where π‘š=1) with new content generated by the model. The specific method is to replace the parts with a mask value of 0 with the corresponding noisy image from the original image at time step t after each denoising step.

Test Image
Image 1

Original

Image 1

Mask

Image 1

To Replace

Image 1

Inpainting

Own Image
Image 1

Original

Image 1

Mask

Image 1

To Replace

Image 1

Inpainting

Image 1

Original

Image 1

Mask

Image 1

To Replace

Image 1

Inpainting

1.7.3 Text-Conditional Image-to-image Translation

In part 1.7.3, the task is to apply text-conditional image-to-image translation. This is similar to the previous SDEdit approach but now includes a text prompt to guide the process. The goal is to generate images that not only project onto the natural image manifold but also incorporate the characteristics of the provided text prompt.

Test Image
Image 1

Text-Conditional Image-to-image Translation ("a rocket ship")

Image 1

Original

Own Image
Image 1

Text-Conditional Image-to-image Translation ("an oil painting of a snowy mountain village")

Image 1

Original

Image 1

Text-Conditional Image-to-image Translation ("a photo of a dog")

Image 1

Original

1.8 Visual Anagrams

In part 1.8, the task is to implement Visual Anagramsβ€”a technique that creates optical illusions using diffusion models. The goal is to generate an image that appears as one thing when viewed in one orientation but transforms into something completely different when flipped upside down. The method involves estimating the noise of the image at each time step t and its vertically flipped version. The estimated noise is obtained by averaging the original noise with the flipped version after it is reversed.

Example

Image 1

An Oil Painting of an Old Man

Image 1

An Oil Painting of People around a Campfire

More Examples

Image 1

A Photo of A Dog

Image 1

A Man Wearing A Hat

Image 1

A Photo of A Hipster Barista

Image 1

A Lithograph of Waterfalls

1.9 Hybrid Images

In part 1.9, the task is to create Hybrid Images using diffusion models. Hybrid images combine the low-frequency content from one image with the high-frequency content from another, producing an image that looks like one thing from a distance and something completely different up close. The estimated noise is obtained by combining the low-frequency noise estimation of one image with the high-frequency noise estimation of another image.

Example

Image 1

Hybrid image of a skull and a waterfall

More Examples

Image 1

Hybrid image of a man and a skull

Image 1

Hybrid image of a man and a dog


Part B

Part 1: Single-Step Denoising UNet

Objective: Train a UNet for single-step denoising using MNIST dataset.

The first part of the project focuses on training a UNet model to perform single-step denoising on the MNIST dataset. The objective is to restore clean images from noisy counterparts that have been corrupted with Gaussian noise. This task involves preparing the MNIST dataset by adding noise to simulate real-world degradations. A UNet architecture is then designed to learn the mapping from noisy to clean images. During training, the model minimizes the mean squared error between the predicted and ground truth images. The results are evaluated by comparing denoised outputs at different stages of training, visualizing the loss curve, and testing the model on noise levels unseen during training. The outcomes demonstrate the UNet's ability to effectively learn and improve its denoising performance over time.

Visualization of Noising Process

Image 1

Noise on MNIST

Training Loss Curve

Image 1

Training Loss with epoch=5

Image 1

Training Loss with epoch=1

Test Set Results

Results After 1 Epoch

Results on digits from the test set after 1 epoch of training

Results After 5 Epochs

Results on digits from the test set after 5 epoch of training

Out-of-Distribution Testing

Image 1

Results on digits from the test set with varying noise levels

Part 2: Training a Diffusion Model

Objective: Train a time-conditioned UNet to iteratively denoise images.

In the second part, a time-conditioned UNet is trained to iteratively denoise images as part of a diffusion model framework. The key idea is to generate a sequence of noisy images by progressively adding Gaussian noise, then train the model to reverse this process step by step. The UNet incorporates a time-conditioning mechanism, allowing it to denoise images based on the noise level at a given timestep. The model is trained to minimize the difference between its predictions and the clean images, enabling it to reconstruct high-quality samples from noisy inputs. Training progress is tracked using a loss curve, while the quality of generated images is evaluated by sampling outputs at different stages of training. These results illustrate the model's ability to iteratively improve the image quality through the denoising process.

Training Loss Curve

Image 1

Time-Conditioned UNet training loss curve with epoch=5

Image 1

Time-Conditioned UNet training loss curve with epoch=20

Sampling Results

Diffusion Results After 5 Epochs

Diffusion Results After 5 Epochs

Diffusion Results After 20 Epochs

Diffusion Results After 20 Epochs

Class-Conditioned UNet

Objective: Train a class-conditioned UNet with classifier-free guidance.

The final part of the project extends the diffusion model by introducing class conditioning, where the model is guided to generate outputs corresponding to specific classes. This is achieved by incorporating one-hot encoded class labels as an additional input to the UNet model. During training, the model learns to associate noisy images with their respective class labels, allowing it to produce class-specific results during sampling. To further enhance the generated outputs, classifier-free guidance is applied, encouraging the model to better focus on class-related features. The effectiveness of this approach is evaluated by generating samples for each class at different training stages, with visualizations showcasing the diversity and accuracy of the generated images.

Training Loss Curve

Image 1

Class-Conditioned UNet training loss curve with epoch=5

Image 1

Class-Conditioned UNet training loss curve with epoch=20

Sampling Results

Diffusion Results After 5 Epochs

Diffusion Results After 5 Epochs

Diffusion Results After 20 Epochs

Diffusion Results After 20 Epochs

Bells & Whistles

time condition diffusion gif

Diffusion Results After 5 Epochs

Time Condition Diffusion Results After 1 Epochs

Diffusion Results After 20 Epochs

Time Condition Diffusion Results After 20 Epochs

class condition diffusion gif

Diffusion Results After 5 Epochs

Class Condition Diffusion Results After 1 Epochs

Diffusion Results After 20 Epochs

Class Condition Diffusion Results After 20 Epochs