An-Introduction-to-Stable-Diffusion
An Introduction to Stable Diffusion
stable_diffusion.ipynb
1 What is Stable Diffusion?
Now, let’s go into the theoretical part of Stable Diffusion. Stable Diffusion is based on a particular type of diffusion model called Latent Diffusion , proposed in High-Resolution Image Synthesis with Latent Diffusion Models . General diffusion models are machine learning systems that are trained to denoise random gaussian noise step by step, to get to a sample of interest, such as an image. For a more detailed overview of how they work, check this colab . Diffusion models have shown to achieve state-of-the-art results for generating image data. But one downside of diffusion models is that the reverse denoising process is slow. In addition, these models consume a lot of memory because they operate in pixel space, which becomes unreasonably expensive when generating high-resolution images. Therefore, it is challenging to train these models and also use them for inference. Latent diffusion can reduce the memory and compute complexity by applying the diffusion process over a lower dimensional latent space, instead of using the actual pixel space. This is the key difference between standard diffusion and latent diffusion models: in latent diffusion the model is trained to generate latent (compressed) representations of the images. There are three main components in latent diffusion.
- A Variational Auto-Encoder (VAE).
- A U-Net .
- A text-encoder, e.g. CLIP’s Text Encoder .
1.1. Variational Auto-Encoder (VAE)
Variational Auto-Encoder,VAE:变分自动编码器 Auto-Encoder,AE:自动编码器 The VAE model has two parts, an encoder and a decoder.
- The encoder is used to convert the image into a low dimensional latent representation, which will serve as the input to the U-Net model.
- The decoder, conversely, transforms the latent representation back into an image. conversely /ˈkɑːnvɜːrsli/:adv. 相反地,反过来 During latent diffusion training, the encoder is used to get the latent representations (latents) of the images for the forward diffusion process, which applies more and more noise at each step. During inference, the denoised latents generated by the reverse diffusion process are converted back into images using the VAE decoder. As we will see during inference we only need the VAE decoder.
1.2. U-Net
The U-Net has an encoder part and a decoder part both comprised of ResNet blocks. The encoder compresses an image representation into a lower resolution image representation and the decoder decodes the lower resolution image representation back to the original higher resolution image representation that is supposedly less noisy. More specifically, the U-Net output predicts the noise residual which can be used to compute the predicted denoised image representation. To prevent the U-Net from losing important information while downsampling, short-cut connections are usually added between the downsampling ResNets of the encoder to the upsampling ResNets of the decoder. Additionally, the stable diffusion U-Net is able to condition its output on text-embeddings via cross-attention layers. The cross-attention layers are added to both the encoder and decoder part of the U-Net usually between ResNet blocks.
1.3. Text-Encoder
The text-encoder is responsible for transforming the input prompt, e.g. “An astronout riding a horse” into an embedding space that can be understood by the U-Net. It is usually a simple transformer-based encoder that maps a sequence of input tokens to a sequence of latent text-embeddings. Inspired by Imagen , Stable Diffusion does not train the text-encoder during training and simply uses an CLIP’s already trained text encoder , CLIPTextModel .
1.4. Why is latent diffusion fast and efficient?
Since the U-Net of latent diffusion models operates on a low dimensional
space, it greatly reduces the memory and compute requirements compared to
pixel-space diffusion models. For example, the autoencoder used in Stable
Diffusion has a reduction factor of 8. This means that an image of shape (3, 512, 512)
becomes (3, 64, 64)
in latent space, which requires 8 × 8 = 64
times less memory.
This is why it’s possible to generate 512 × 512
images so quickly, even on
16GB Colab GPUs!
1.5. Stable Diffusion during inference
Putting it all together, let’s now take a closer look at how the model works
in inference by illustrating the logical flow.
The stable diffusion model takes both a latent seed and a text prompt as an
input. The latent seed is then used to generate random latent image
representations of size 64 × 64 64 \times 64 64×64 where as the text prompt
is transformed to text embeddings of size 77 × 768 77 \times 768 77×768 via
CLIP’s text encoder.
Next the U-Net iteratively denoises the random latent image representations
while being conditioned on the text embeddings. The output of the U-Net,
being the noise residual, is used to compute a denoised latent image
representation via a scheduler algorithm. Many different scheduler
algorithms can be used for this computation, each having its pros and cons.
pros and cons /proʊz ənd kɑːnz/:利弊,优缺点,正反两方面,赞成者和反对者
For Stable Diffusion, we recommend using one of:
- PNDM scheduler (used by default).
- K-LMS scheduler .
- Heun Discrete scheduler .
- DPM Solver Multistep scheduler . This scheduler is able to achieve great quality in less steps. You can try with 25 instead of the default 50! Theory on how the scheduler algorithm function is out of scope for this notebook, but in short one should remember that they compute the predicted denoised image representation from the previous noise representation and the predicted noise residual. For more information, we recommend looking into Elucidating the Design Space of Diffusion-Based Generative Models The denoising process is repeated ca. 50 times to step-by-step retrieve better latent image representations. Once complete, the latent image representation is decoded by the decoder part of the variational auto encoder. After this brief introduction to Latent and Stable Diffusion, let’s see how to make advanced use of Hugging Face Diffusers!
2 Stable Diffusion Deep Dive
Hugging Face Diffusers
Stable Diffusion Deep Dive.ipynb
Textual Inversion
Before running the script, make sure you install the library from source:
(base) yongqiang@yongqiang:/stable_diffusion_work$ git clone
(base) yongqiang@yongqiang:/stable_diffusion_work$ cd diffusers/
(base) yongqiang@yongqiang:/stable_diffusion_work/diffusers$ pip install .
(base) yongqiang@yongqiang:/stable_diffusion_work/diffusers$
2.1. token_embedding and position_embedding
We use a text encoder model to turn our text into a set of embeddings
which
are fed to the diffusion model as conditioning.
We begin with tokenization:
Our text prompt
prompt = ‘A picture of a puppy’
Turn the text into a sequence of tokens
text_input = tokenizer(prompt, padding=“max_length”, max_length=tokenizer.model_max_length, truncation=True, return_tensors=“pt”) print("\ntokenizer.model_max_length:", tokenizer.model_max_length) print(“text_input[‘input_ids’].shape:”, text_input[‘input_ids’].shape) print(“text_input[‘input_ids’]:\n”, text_input[‘input_ids’]) print(“text_input[‘attention_mask’].shape:”, text_input[‘attention_mask’].shape) print(“text_input[‘attention_mask’]:\n”, text_input[‘attention_mask’]) tokenizer.model_max_length: 77 text_input[‘input_ids’].shape: torch.Size([1, 77]) text_input[‘input_ids’]: tensor([[49406, 320, 1674, 539, 320, 6829, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407]]) text_input[‘attention_mask’].shape: torch.Size([1, 77]) text_input[‘attention_mask’]: tensor([[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
See the individual tokens
print("") for t in text_input[‘input_ids’][0][:10]: # We’ll just look at the first 9 to save you from a wall of ‘<|endoftext|>’ print(t, tokenizer.decoder.get(int(t))) print("") tensor(49406) <|startoftext|> tensor(320) a tensor(1674) picture tensor(539) of tensor(320) a tensor(6829) puppy tensor(49407) <|endoftext|> tensor(49407) <|endoftext|> tensor(49407) <|endoftext|> tensor(49407) <|endoftext|> The final (output) embeddings like so:
Grab the output embeddings
output_embeddings = text_encoder(text_input.input_ids.to(torch_device))[0] print(‘output_embeddings.shape:’, output_embeddings.shape) print(‘output_embeddings:\n’, output_embeddings) print(’\ntext_encoder.text_model.embeddings:\n’, text_encoder.text_model.embeddings) The tokens are transformed into a set of input embeddings, which are then fed through the transformer model to get the final output embeddings. output_embeddings.shape: torch.Size([1, 77, 768]) output_embeddings: tensor([[[-0.3884, 0.0229, -0.0522, …, -0.4899, -0.3066, 0.0675], [ 0.0290, -1.3258, 0.3085, …, -0.5257, 0.9768, 0.6652], [ 0.6942, 0.3538, 1.0991, …, -1.5716, -1.2643, -0.0121], …, [-0.0221, -0.0053, -0.0089, …, -0.7303, -1.3830, -0.3011], [-0.0062, -0.0246, 0.0065, …, -0.7326, -1.3745, -0.2953], [-0.0536, 0.0269, 0.0444, …, -0.7159, -1.3634, -0.3075]]], grad_fn=) text_encoder.text_model.embeddings: CLIPTextEmbeddings( (token_embedding): Embedding(49408, 768) (position_embedding): Embedding(77, 768) )
- Token embeddings
The token is fed to the
token_embedding
to transform it into a vector. The function nameget_input_embeddings
here is misleading since thesetoken embeddings
need to be combined with theposition embeddings
before they are actually used as inputs to the model!
token embeddings
Access the embedding layer
token_emb_layer = text_encoder.text_model.embeddings.token_embedding
Vocab size 49408, emb_dim 768
print(’\ntoken_emb_layer:’, token_emb_layer)
Embed a token - in this case the one for ‘puppy’
one_token_embedding = token_emb_layer(torch.tensor(6829, device=torch_device))
768-dim representation
print(’\none_token_embedding.shape:’, one_token_embedding.shape)
print(‘one_token_embedding:\n’, one_token_embedding)
token_embeddings = token_emb_layer(text_input.input_ids.to(torch_device))
batch size 1, 77 tokens, 768 values for each
print(’\ntoken_embeddings.shape:’, token_embeddings.shape) print(’token_embeddings:\n’, token_embeddings) This single token has been mapped to a 768-dimensional vector - the token embedding. token_emb_layer: Embedding(49408, 768) one_token_embedding.shape: torch.Size([768]) token_embeddings.shape: torch.Size([1, 77, 768]) token_embeddings: tensor([[[ 0.0011, 0.0032, 0.0003, …, -0.0018, 0.0003, 0.0019], [ 0.0013, -0.0011, -0.0126, …, -0.0124, 0.0120, 0.0080], [ 0.0235, -0.0118, 0.0110, …, 0.0049, 0.0078, 0.0160], …, [ 0.0012, 0.0077, -0.0011, …, -0.0015, 0.0009, 0.0052], [ 0.0012, 0.0077, -0.0011, …, -0.0015, 0.0009, 0.0052], [ 0.0012, 0.0077, -0.0011, …, -0.0015, 0.0009, 0.0052]]], grad_fn=)
- Positional Embeddings Positional embeddings tell the model where in a sequence a token is. Much like the token embedding, this is a set of (optionally learnable) parameters. But now instead of dealing with ~50k tokens we just need one for each position (77 total).
position embeddings
pos_emb_layer = text_encoder.text_model.embeddings.position_embedding print(’\npos_emb_layer:’, pos_emb_layer) position_ids = text_encoder.text_model.embeddings.position_ids[:, :77] position_embeddings = pos_emb_layer(position_ids) print(’\nposition_embeddings.shape:’, position_embeddings.shape) print(‘position_embeddings:\n’, position_embeddings) We can get the positional embedding for each position: pos_emb_layer: Embedding(77, 768) position_embeddings.shape: torch.Size([1, 77, 768]) position_embeddings: tensor([[[ 0.0016, 0.0020, 0.0002, …, -0.0013, 0.0008, 0.0015], [ 0.0042, 0.0029, 0.0002, …, 0.0010, 0.0015, -0.0012], [ 0.0018, 0.0007, -0.0012, …, -0.0029, -0.0009, 0.0026], …, [ 0.0216, 0.0055, -0.0101, …, -0.0065, -0.0029, 0.0037], [ 0.0188, 0.0073, -0.0077, …, -0.0025, -0.0009, 0.0057], [ 0.0330, 0.0281, 0.0289, …, 0.0160, 0.0102, -0.0310]]], grad_fn=)
- Combining token and position embeddings Combining them in this way gives us the final input embeddings ready to feed through the transformer model:
token embeddings + position embeddings
And combining them we get the final input embeddings
input_embeddings = token_embeddings + position_embeddings print(’\ninput_embeddings.shape:’, input_embeddings.shape) print(‘input_embeddings:\n’, input_embeddings)
The following combines all the above steps
input_embeddings_alias = text_encoder.text_model.embeddings(text_input.input_ids.to(torch_device)) print(’\ninput_embeddings_alias.shape:’, input_embeddings_alias.shape) print(‘input_embeddings_alias:\n’, input_embeddings_alias) input_embeddings.shape: torch.Size([1, 77, 768]) input_embeddings: tensor([[[ 2.6770e-03, 5.2133e-03, 4.9323e-04, …, -3.1321e-03, 1.0659e-03, 3.4316e-03], [ 5.5371e-03, 1.7510e-03, -1.2381e-02, …, -1.1410e-02, 1.3508e-02, 6.8378e-03], [ 2.5356e-02, -1.1019e-02, 9.7663e-03, …, 1.9460e-03, 6.8375e-03, 1.8573e-02], …, [ 2.2781e-02, 1.3262e-02, -1.1241e-02, …, -8.0054e-03, -2.0560e-03, 8.9366e-03], [ 2.0026e-02, 1.5015e-02, -8.7638e-03, …, -4.0313e-03, 1.8487e-05, 1.0885e-02], [ 3.4206e-02, 3.5826e-02, 2.7768e-02, …, 1.4465e-02, 1.1110e-02, -2.5745e-02]]], grad_fn=) input_embeddings_alias.shape: torch.Size([1, 77, 768]) input_embeddings_alias: tensor([[[ 2.6770e-03, 5.2133e-03, 4.9323e-04, …, -3.1321e-03, 1.0659e-03, 3.4316e-03], [ 5.5371e-03, 1.7510e-03, -1.2381e-02, …, -1.1410e-02, 1.3508e-02, 6.8378e-03], [ 2.5356e-02, -1.1019e-02, 9.7663e-03, …, 1.9460e-03, 6.8375e-03, 1.8573e-02], …, [ 2.2781e-02, 1.3262e-02, -1.1241e-02, …, -8.0054e-03, -2.0560e-03, 8.9366e-03], [ 2.0026e-02, 1.5015e-02, -8.7638e-03, …, -4.0313e-03, 1.8487e-05, 1.0885e-02], [ 3.4206e-02, 3.5826e-02, 2.7768e-02, …, 1.4465e-02, 1.1110e-02, -2.5745e-02]]], grad_fn=)
- Feeding
token_embedding + position_embedding
through the transformer modeldef build_causal_attention_mask(bsz, seq_len, dtype): mask = torch.empty(bsz, seq_len, seq_len, dtype=dtype) mask.fill_(torch.tensor(torch.finfo(dtype).min)) # fill with large negative number (acts like -inf) mask = mask.triu_(1) # zero out the lower diagonal to enforce causality print(’\nmask.shape:’, mask.shape) print(‘mask:\n’, mask) return mask.unsqueeze(1) # add a batch dimension def get_output_embeds(input_embeddings):
CLIP’s text model uses causal mask, so we prepare it here
print(’\ninput_embeddings.shape:’, input_embeddings.shape) bsz, seq_len = input_embeddings.shape[:2] causal_attention_mask = build_causal_attention_mask(bsz, seq_len, dtype=input_embeddings.dtype)
Getting the output embeddings involves calling the model with passing output_hidden_states=True
so that it doesn’t just return the pooled final predictions
encoder_outputs = text_encoder.text_model.encoder( inputs_embeds=input_embeddings, attention_mask=None, # We aren’t using an attention mask so that can be None causal_attention_mask=causal_attention_mask.to(torch_device), output_attentions=None, output_hidden_states=True, # We want the output embs not the final output return_dict=None, )
We’re interested in the output hidden state only
output = encoder_outputs[0]
There is a final layer norm we need to pass these through
output = text_encoder.text_model.final_layer_norm(output)
And now they’re ready!
return output out_embs_test = get_output_embeds(input_embeddings) # Feed through the model with our new function print(’\nout_embs_test.shape:’, out_embs_test.shape) print(‘out_embs_test:\n’, out_embs_test) input_embeddings.shape: torch.Size([1, 77, 768]) mask.shape: torch.Size([1, 77, 77]) mask: tensor([[[ 0.0000e+00, -3.4028e+38, -3.4028e+38, …, -3.4028e+38, -3.4028e+38, -3.4028e+38], [ 0.0000e+00, 0.0000e+00, -3.4028e+38, …, -3.4028e+38, -3.4028e+38, -3.4028e+38], [ 0.0000e+00, 0.0000e+00, 0.0000e+00, …, -3.4028e+38, -3.4028e+38, -3.4028e+38], …, [ 0.0000e+00, 0.0000e+00, 0.0000e+00, …, 0.0000e+00, -3.4028e+38, -3.4028e+38], [ 0.0000e+00, 0.0000e+00, 0.0000e+00, …, 0.0000e+00, 0.0000e+00, -3.4028e+38], [ 0.0000e+00, 0.0000e+00, 0.0000e+00, …, 0.0000e+00, 0.0000e+00, 0.0000e+00]]]) out_embs_test.shape: torch.Size([1, 77, 768]) out_embs_test: tensor([[[-0.3884, 0.0229, -0.0522, …, -0.4899, -0.3066, 0.0675], [ 0.0290, -1.3258, 0.3085, …, -0.5257, 0.9768, 0.6652], [ 0.6942, 0.3538, 1.0991, …, -1.5716, -1.2643, -0.0121], …, [-0.0221, -0.0053, -0.0089, …, -0.7303, -1.3830, -0.3011], [-0.0062, -0.0246, 0.0065, …, -0.7326, -1.3745, -0.2953], [-0.0536, 0.0269, 0.0444, …, -0.7159, -1.3634, -0.3075]]], grad_fn=)
References
[1] Yongqiang Cheng,