目录

An-Introduction-to-Stable-Diffusion

An Introduction to Stable Diffusion

stable_diffusion.ipynb

1 What is Stable Diffusion?

Now, let’s go into the theoretical part of Stable Diffusion. Stable Diffusion is based on a particular type of diffusion model called Latent Diffusion , proposed in High-Resolution Image Synthesis with Latent Diffusion Models . General diffusion models are machine learning systems that are trained to denoise random gaussian noise step by step, to get to a sample of interest, such as an image. For a more detailed overview of how they work, check this colab . Diffusion models have shown to achieve state-of-the-art results for generating image data. But one downside of diffusion models is that the reverse denoising process is slow. In addition, these models consume a lot of memory because they operate in pixel space, which becomes unreasonably expensive when generating high-resolution images. Therefore, it is challenging to train these models and also use them for inference. Latent diffusion can reduce the memory and compute complexity by applying the diffusion process over a lower dimensional latent space, instead of using the actual pixel space. This is the key difference between standard diffusion and latent diffusion models: in latent diffusion the model is trained to generate latent (compressed) representations of the images. There are three main components in latent diffusion.

  1. A Variational Auto-Encoder (VAE).
  2. A U-Net .
  3. A text-encoder, e.g. CLIP’s Text Encoder .

1.1. Variational Auto-Encoder (VAE)

Variational Auto-Encoder,VAE:变分自动编码器 Auto-Encoder,AE:自动编码器 The VAE model has two parts, an encoder and a decoder.

  • The encoder is used to convert the image into a low dimensional latent representation, which will serve as the input to the U-Net model.
  • The decoder, conversely, transforms the latent representation back into an image. conversely /ˈkɑːnvɜːrsli/:adv. 相反地,反过来 During latent diffusion training, the encoder is used to get the latent representations (latents) of the images for the forward diffusion process, which applies more and more noise at each step. During inference, the denoised latents generated by the reverse diffusion process are converted back into images using the VAE decoder. As we will see during inference we only need the VAE decoder.

1.2. U-Net

The U-Net has an encoder part and a decoder part both comprised of ResNet blocks. The encoder compresses an image representation into a lower resolution image representation and the decoder decodes the lower resolution image representation back to the original higher resolution image representation that is supposedly less noisy. More specifically, the U-Net output predicts the noise residual which can be used to compute the predicted denoised image representation. To prevent the U-Net from losing important information while downsampling, short-cut connections are usually added between the downsampling ResNets of the encoder to the upsampling ResNets of the decoder. Additionally, the stable diffusion U-Net is able to condition its output on text-embeddings via cross-attention layers. The cross-attention layers are added to both the encoder and decoder part of the U-Net usually between ResNet blocks.

1.3. Text-Encoder

The text-encoder is responsible for transforming the input prompt, e.g. “An astronout riding a horse” into an embedding space that can be understood by the U-Net. It is usually a simple transformer-based encoder that maps a sequence of input tokens to a sequence of latent text-embeddings. Inspired by Imagen , Stable Diffusion does not train the text-encoder during training and simply uses an CLIP’s already trained text encoder , CLIPTextModel .

1.4. Why is latent diffusion fast and efficient?

Since the U-Net of latent diffusion models operates on a low dimensional space, it greatly reduces the memory and compute requirements compared to pixel-space diffusion models. For example, the autoencoder used in Stable Diffusion has a reduction factor of 8. This means that an image of shape (3, 512, 512) becomes (3, 64, 64) in latent space, which requires 8 × 8 = 64 times less memory. This is why it’s possible to generate 512 × 512 images so quickly, even on 16GB Colab GPUs!

1.5. Stable Diffusion during inference

Putting it all together, let’s now take a closer look at how the model works in inference by illustrating the logical flow. https://i-blog.csdnimg.cn/direct/e6df3fd18dec44b39fdb28233da210b3.png#pic_center The stable diffusion model takes both a latent seed and a text prompt as an input. The latent seed is then used to generate random latent image representations of size 64 × 64 64 \times 64 64×64 where as the text prompt is transformed to text embeddings of size 77 × 768 77 \times 768 77×768 via CLIP’s text encoder. Next the U-Net iteratively denoises the random latent image representations while being conditioned on the text embeddings. The output of the U-Net, being the noise residual, is used to compute a denoised latent image representation via a scheduler algorithm. Many different scheduler algorithms can be used for this computation, each having its pros and cons. pros and cons /proʊz ənd kɑːnz/:利弊,优缺点,正反两方面,赞成者和反对者 For Stable Diffusion, we recommend using one of:

  • PNDM scheduler (used by default).
  • K-LMS scheduler .
  • Heun Discrete scheduler .
  • DPM Solver Multistep scheduler . This scheduler is able to achieve great quality in less steps. You can try with 25 instead of the default 50! Theory on how the scheduler algorithm function is out of scope for this notebook, but in short one should remember that they compute the predicted denoised image representation from the previous noise representation and the predicted noise residual. For more information, we recommend looking into Elucidating the Design Space of Diffusion-Based Generative Models The denoising process is repeated ca. 50 times to step-by-step retrieve better latent image representations. Once complete, the latent image representation is decoded by the decoder part of the variational auto encoder. After this brief introduction to Latent and Stable Diffusion, let’s see how to make advanced use of Hugging Face Diffusers!

2 Stable Diffusion Deep Dive

Hugging Face Diffusers Stable Diffusion Deep Dive.ipynb Textual Inversion Before running the script, make sure you install the library from source: (base) yongqiang@yongqiang:/stable_diffusion_work$ git clone (base) yongqiang@yongqiang:/stable_diffusion_work$ cd diffusers/ (base) yongqiang@yongqiang:/stable_diffusion_work/diffusers$ pip install . (base) yongqiang@yongqiang:/stable_diffusion_work/diffusers$

2.1. token_embedding and position_embedding

We use a text encoder model to turn our text into a set of embeddings which are fed to the diffusion model as conditioning. We begin with tokenization:

Our text prompt

prompt = ‘A picture of a puppy’

Turn the text into a sequence of tokens

text_input = tokenizer(prompt, padding=“max_length”, max_length=tokenizer.model_max_length, truncation=True, return_tensors=“pt”) print("\ntokenizer.model_max_length:", tokenizer.model_max_length) print(“text_input[‘input_ids’].shape:”, text_input[‘input_ids’].shape) print(“text_input[‘input_ids’]:\n”, text_input[‘input_ids’]) print(“text_input[‘attention_mask’].shape:”, text_input[‘attention_mask’].shape) print(“text_input[‘attention_mask’]:\n”, text_input[‘attention_mask’]) tokenizer.model_max_length: 77 text_input[‘input_ids’].shape: torch.Size([1, 77]) text_input[‘input_ids’]: tensor([[49406, 320, 1674, 539, 320, 6829, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407]]) text_input[‘attention_mask’].shape: torch.Size([1, 77]) text_input[‘attention_mask’]: tensor([[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

See the individual tokens

print("") for t in text_input[‘input_ids’][0][:10]: # We’ll just look at the first 9 to save you from a wall of ‘<|endoftext|>’ print(t, tokenizer.decoder.get(int(t))) print("") tensor(49406) <|startoftext|> tensor(320) a tensor(1674) picture tensor(539) of tensor(320) a tensor(6829) puppy tensor(49407) <|endoftext|> tensor(49407) <|endoftext|> tensor(49407) <|endoftext|> tensor(49407) <|endoftext|> The final (output) embeddings like so:

Grab the output embeddings

output_embeddings = text_encoder(text_input.input_ids.to(torch_device))[0] print(‘output_embeddings.shape:’, output_embeddings.shape) print(‘output_embeddings:\n’, output_embeddings) print(’\ntext_encoder.text_model.embeddings:\n’, text_encoder.text_model.embeddings) The tokens are transformed into a set of input embeddings, which are then fed through the transformer model to get the final output embeddings. output_embeddings.shape: torch.Size([1, 77, 768]) output_embeddings: tensor([[[-0.3884, 0.0229, -0.0522, …, -0.4899, -0.3066, 0.0675], [ 0.0290, -1.3258, 0.3085, …, -0.5257, 0.9768, 0.6652], [ 0.6942, 0.3538, 1.0991, …, -1.5716, -1.2643, -0.0121], …, [-0.0221, -0.0053, -0.0089, …, -0.7303, -1.3830, -0.3011], [-0.0062, -0.0246, 0.0065, …, -0.7326, -1.3745, -0.2953], [-0.0536, 0.0269, 0.0444, …, -0.7159, -1.3634, -0.3075]]], grad_fn=) text_encoder.text_model.embeddings: CLIPTextEmbeddings( (token_embedding): Embedding(49408, 768) (position_embedding): Embedding(77, 768) )

  • Token embeddings The token is fed to the token_embedding to transform it into a vector. The function name get_input_embeddings here is misleading since these token embeddings need to be combined with the position embeddings before they are actually used as inputs to the model!

token embeddings

Access the embedding layer

token_emb_layer = text_encoder.text_model.embeddings.token_embedding

Vocab size 49408, emb_dim 768

print(’\ntoken_emb_layer:’, token_emb_layer)

Embed a token - in this case the one for ‘puppy’

one_token_embedding = token_emb_layer(torch.tensor(6829, device=torch_device))

768-dim representation

print(’\none_token_embedding.shape:’, one_token_embedding.shape)

print(‘one_token_embedding:\n’, one_token_embedding)

token_embeddings = token_emb_layer(text_input.input_ids.to(torch_device))

batch size 1, 77 tokens, 768 values for each

print(’\ntoken_embeddings.shape:’, token_embeddings.shape) print(’token_embeddings:\n’, token_embeddings) This single token has been mapped to a 768-dimensional vector - the token embedding. token_emb_layer: Embedding(49408, 768) one_token_embedding.shape: torch.Size([768]) token_embeddings.shape: torch.Size([1, 77, 768]) token_embeddings: tensor([[[ 0.0011, 0.0032, 0.0003, …, -0.0018, 0.0003, 0.0019], [ 0.0013, -0.0011, -0.0126, …, -0.0124, 0.0120, 0.0080], [ 0.0235, -0.0118, 0.0110, …, 0.0049, 0.0078, 0.0160], …, [ 0.0012, 0.0077, -0.0011, …, -0.0015, 0.0009, 0.0052], [ 0.0012, 0.0077, -0.0011, …, -0.0015, 0.0009, 0.0052], [ 0.0012, 0.0077, -0.0011, …, -0.0015, 0.0009, 0.0052]]], grad_fn=)

  • Positional Embeddings Positional embeddings tell the model where in a sequence a token is. Much like the token embedding, this is a set of (optionally learnable) parameters. But now instead of dealing with ~50k tokens we just need one for each position (77 total).

position embeddings

pos_emb_layer = text_encoder.text_model.embeddings.position_embedding print(’\npos_emb_layer:’, pos_emb_layer) position_ids = text_encoder.text_model.embeddings.position_ids[:, :77] position_embeddings = pos_emb_layer(position_ids) print(’\nposition_embeddings.shape:’, position_embeddings.shape) print(‘position_embeddings:\n’, position_embeddings) We can get the positional embedding for each position: pos_emb_layer: Embedding(77, 768) position_embeddings.shape: torch.Size([1, 77, 768]) position_embeddings: tensor([[[ 0.0016, 0.0020, 0.0002, …, -0.0013, 0.0008, 0.0015], [ 0.0042, 0.0029, 0.0002, …, 0.0010, 0.0015, -0.0012], [ 0.0018, 0.0007, -0.0012, …, -0.0029, -0.0009, 0.0026], …, [ 0.0216, 0.0055, -0.0101, …, -0.0065, -0.0029, 0.0037], [ 0.0188, 0.0073, -0.0077, …, -0.0025, -0.0009, 0.0057], [ 0.0330, 0.0281, 0.0289, …, 0.0160, 0.0102, -0.0310]]], grad_fn=)

  • Combining token and position embeddings Combining them in this way gives us the final input embeddings ready to feed through the transformer model:

token embeddings + position embeddings

And combining them we get the final input embeddings

input_embeddings = token_embeddings + position_embeddings print(’\ninput_embeddings.shape:’, input_embeddings.shape) print(‘input_embeddings:\n’, input_embeddings)

The following combines all the above steps

input_embeddings_alias = text_encoder.text_model.embeddings(text_input.input_ids.to(torch_device)) print(’\ninput_embeddings_alias.shape:’, input_embeddings_alias.shape) print(‘input_embeddings_alias:\n’, input_embeddings_alias) input_embeddings.shape: torch.Size([1, 77, 768]) input_embeddings: tensor([[[ 2.6770e-03, 5.2133e-03, 4.9323e-04, …, -3.1321e-03, 1.0659e-03, 3.4316e-03], [ 5.5371e-03, 1.7510e-03, -1.2381e-02, …, -1.1410e-02, 1.3508e-02, 6.8378e-03], [ 2.5356e-02, -1.1019e-02, 9.7663e-03, …, 1.9460e-03, 6.8375e-03, 1.8573e-02], …, [ 2.2781e-02, 1.3262e-02, -1.1241e-02, …, -8.0054e-03, -2.0560e-03, 8.9366e-03], [ 2.0026e-02, 1.5015e-02, -8.7638e-03, …, -4.0313e-03, 1.8487e-05, 1.0885e-02], [ 3.4206e-02, 3.5826e-02, 2.7768e-02, …, 1.4465e-02, 1.1110e-02, -2.5745e-02]]], grad_fn=) input_embeddings_alias.shape: torch.Size([1, 77, 768]) input_embeddings_alias: tensor([[[ 2.6770e-03, 5.2133e-03, 4.9323e-04, …, -3.1321e-03, 1.0659e-03, 3.4316e-03], [ 5.5371e-03, 1.7510e-03, -1.2381e-02, …, -1.1410e-02, 1.3508e-02, 6.8378e-03], [ 2.5356e-02, -1.1019e-02, 9.7663e-03, …, 1.9460e-03, 6.8375e-03, 1.8573e-02], …, [ 2.2781e-02, 1.3262e-02, -1.1241e-02, …, -8.0054e-03, -2.0560e-03, 8.9366e-03], [ 2.0026e-02, 1.5015e-02, -8.7638e-03, …, -4.0313e-03, 1.8487e-05, 1.0885e-02], [ 3.4206e-02, 3.5826e-02, 2.7768e-02, …, 1.4465e-02, 1.1110e-02, -2.5745e-02]]], grad_fn=)

  • Feeding token_embedding + position_embedding through the transformer model https://i-blog.csdnimg.cn/direct/dcf133ee2ea746aa91286c1455438239.png#pic_center def build_causal_attention_mask(bsz, seq_len, dtype): mask = torch.empty(bsz, seq_len, seq_len, dtype=dtype) mask.fill_(torch.tensor(torch.finfo(dtype).min)) # fill with large negative number (acts like -inf) mask = mask.triu_(1) # zero out the lower diagonal to enforce causality print(’\nmask.shape:’, mask.shape) print(‘mask:\n’, mask) return mask.unsqueeze(1) # add a batch dimension def get_output_embeds(input_embeddings):

CLIP’s text model uses causal mask, so we prepare it here

print(’\ninput_embeddings.shape:’, input_embeddings.shape) bsz, seq_len = input_embeddings.shape[:2] causal_attention_mask = build_causal_attention_mask(bsz, seq_len, dtype=input_embeddings.dtype)

Getting the output embeddings involves calling the model with passing output_hidden_states=True

so that it doesn’t just return the pooled final predictions

encoder_outputs = text_encoder.text_model.encoder( inputs_embeds=input_embeddings, attention_mask=None, # We aren’t using an attention mask so that can be None causal_attention_mask=causal_attention_mask.to(torch_device), output_attentions=None, output_hidden_states=True, # We want the output embs not the final output return_dict=None, )

We’re interested in the output hidden state only

output = encoder_outputs[0]

There is a final layer norm we need to pass these through

output = text_encoder.text_model.final_layer_norm(output)

And now they’re ready!

return output out_embs_test = get_output_embeds(input_embeddings) # Feed through the model with our new function print(’\nout_embs_test.shape:’, out_embs_test.shape) print(‘out_embs_test:\n’, out_embs_test) input_embeddings.shape: torch.Size([1, 77, 768]) mask.shape: torch.Size([1, 77, 77]) mask: tensor([[[ 0.0000e+00, -3.4028e+38, -3.4028e+38, …, -3.4028e+38, -3.4028e+38, -3.4028e+38], [ 0.0000e+00, 0.0000e+00, -3.4028e+38, …, -3.4028e+38, -3.4028e+38, -3.4028e+38], [ 0.0000e+00, 0.0000e+00, 0.0000e+00, …, -3.4028e+38, -3.4028e+38, -3.4028e+38], …, [ 0.0000e+00, 0.0000e+00, 0.0000e+00, …, 0.0000e+00, -3.4028e+38, -3.4028e+38], [ 0.0000e+00, 0.0000e+00, 0.0000e+00, …, 0.0000e+00, 0.0000e+00, -3.4028e+38], [ 0.0000e+00, 0.0000e+00, 0.0000e+00, …, 0.0000e+00, 0.0000e+00, 0.0000e+00]]]) out_embs_test.shape: torch.Size([1, 77, 768]) out_embs_test: tensor([[[-0.3884, 0.0229, -0.0522, …, -0.4899, -0.3066, 0.0675], [ 0.0290, -1.3258, 0.3085, …, -0.5257, 0.9768, 0.6652], [ 0.6942, 0.3538, 1.0991, …, -1.5716, -1.2643, -0.0121], …, [-0.0221, -0.0053, -0.0089, …, -0.7303, -1.3830, -0.3011], [-0.0062, -0.0246, 0.0065, …, -0.7326, -1.3745, -0.2953], [-0.0536, 0.0269, 0.0444, …, -0.7159, -1.3634, -0.3075]]], grad_fn=)

References

[1] Yongqiang Cheng,