TDD: Target-Driven Distillation

Consistency Distillation with Target Timestep Selection and Decoupled Guidance

Ziyuan Guo2,
Huaxia Li2,
Nemo Chen2,
Xu Tang2,
Yao Hu2
1Zhejiang University, 2Xiaohongshu, 3Shanghai Jiao Tong University

Visual comparison among different methods. Additionally, we have released a detailed comparison between our method and TCD. Our method demonstrates advantages in both image complexity and clarity.

Abstract

Consistency distillation methods have demonstrated significant success in accelerating generative tasks of diffusion models. However, since previous consistency distillation methods use simple and straightforward strategies in selecting target timesteps, they usually struggle with blurs and detail losses in generated images. To address these limitations, we introduce Target-Driven Distillation (TDD), which (1) adopts a delicate selection strategy of target timesteps, increasing the training efficiency; (2) utilizes decoupled guidances during training, making TDD open to post-tuning on guidance scale during inference periods; (3) can be optionally equipped with non-equidistant sampling and x0 clipping, enabling a more flexible and accurate way for image sampling. Experiments verify that TDD achieves state-of-the-art performance in few-step generation, offering a better choice among consistency distillation models.

Pipeline


We propose Target-Driven Distillation (TDD), a multi-target approach that emphasizes delicately selected target timesteps during distillation processes. Our method involves three key designs: Firstly, for any timestep, it selects a nearby timestep forward that falls into a few-step equidistant denoising schedule of a predefined set of schedules (e.g. 4--8 steps), which eliminates long-distance predictions while only focusing on the timesteps we will probably pass through during inference periods under different schedules. Also, TDD incorporates a stochastic offset that further pushes the selected timestep ahead towards the final target timestep, in order to accommodate non-deterministic sampling such as γ-sampling. Secondly, while distilling classifier-free guidance (CFG) into the distilled models, to align with the standard training process using CFG, TDD additionally replaces a portion of the text conditions with unconditional (i.e. empty) prompts. With such a design, TDD is open to a proposed inference-time tuning technique on guidance scale, allowing user-specified balances between the accuracy and the richness of image contents conditioned on text prompts. Finally, TDD is optionally equipped with a non-equidistant sampling method doing short-distance predictions at initial steps and long-distance ones at later steps, which helps to improve overall image quality. Additionally, TDD adopts x0 clipping to prevent out-of-bound predictions and address the overexposure issue.

Experiment

Qualitative comparison of different methods under NFE for 4 to 8 steps.

Gallery of Stable Diffusion XL + TDD

Images generated by TDD w/o adv and TDD w/ adv

Gallery of TDD + Different Base/Lora/ControlNet

Images generated by TDD w/o adv and TDD w/ adv