Diffusers documentation

T-GATE

You are viewing v0.32.0 version. A newer version v0.32.1 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

T-GATE

T-GATE accelerates inference for Stable Diffusion, PixArt, and Latency Consistency Model pipelines by skipping the cross-attention calculation once it converges. This method doesn’t require any additional training and it can speed up inference from 10-50%. T-GATE is also compatible with other optimization methods like DeepCache.

Before you begin, make sure you install T-GATE.

pip install tgate
pip install -U torch diffusers transformers accelerate DeepCache

To use T-GATE with a pipeline, you need to use its corresponding loader.

Pipeline T-GATE Loader
PixArt TgatePixArtLoader
Stable Diffusion XL TgateSDXLLoader
Stable Diffusion XL + DeepCache TgateSDXLDeepCacheLoader
Stable Diffusion TgateSDLoader
Stable Diffusion + DeepCache TgateSDDeepCacheLoader

Next, create a TgateLoader with a pipeline, the gate step (the time step to stop calculating the cross attention), and the number of inference steps. Then call the tgate method on the pipeline with a prompt, gate step, and the number of inference steps.

Let’s see how to enable this for several different pipelines.

PixArt
Stable Diffusion XL
StableDiffusionXL with DeepCache
Latent Consistency Model

Accelerate PixArtAlphaPipeline with T-GATE:

import torch
from diffusers import PixArtAlphaPipeline
from tgate import TgatePixArtLoader

pipe = PixArtAlphaPipeline.from_pretrained("PixArt-alpha/PixArt-XL-2-1024-MS", torch_dtype=torch.float16)

gate_step = 8
inference_step = 25
pipe = TgatePixArtLoader(
       pipe,
       gate_step=gate_step,
       num_inference_steps=inference_step,
).to("cuda")

image = pipe.tgate(
       "An alpaca made of colorful building blocks, cyberpunk.",
       gate_step=gate_step,
       num_inference_steps=inference_step,
).images[0]

T-GATE also supports StableDiffusionPipeline and PixArt-alpha/PixArt-LCM-XL-2-1024-MS.

Benchmarks

Model MACs Param Latency Zero-shot 10K-FID on MS-COCO
SD-1.5 16.938T 859.520M 7.032s 23.927
SD-1.5 w/ T-GATE 9.875T 815.557M 4.313s 20.789
SD-2.1 38.041T 865.785M 16.121s 22.609
SD-2.1 w/ T-GATE 22.208T 815.433 M 9.878s 19.940
SD-XL 149.438T 2.570B 53.187s 24.628
SD-XL w/ T-GATE 84.438T 2.024B 27.932s 22.738
Pixart-Alpha 107.031T 611.350M 61.502s 38.669
Pixart-Alpha w/ T-GATE 65.318T 462.585M 37.867s 35.825
DeepCache (SD-XL) 57.888T - 19.931s 23.755
DeepCache w/ T-GATE 43.868T - 14.666s 23.999
LCM (SD-XL) 11.955T 2.570B 3.805s 25.044
LCM w/ T-GATE 11.171T 2.024B 3.533s 25.028
LCM (Pixart-Alpha) 8.563T 611.350M 4.733s 36.086
LCM w/ T-GATE 7.623T 462.585M 4.543s 37.048

The latency is tested on an NVIDIA 1080TI, MACs and Params are calculated with calflops, and the FID is calculated with PytorchFID.

< > Update on GitHub