LCM-Lookahead for Encoder-based Text-to-Image Personalization

1Tel Aviv University, 2NVIDIA
*Indicates Equal Contribution

TL;DR: We introduce an LCM-based approach for propagating image-space losses to personalization model training and classifier guidance.



Teaser.

Abstract

Recent advancements in diffusion models have introduced fast sampling methods that can effectively produce high-quality images in just one or a few denoising steps. Interestingly, when these are distilled from existing diffusion models, they often maintain alignment with the original model, retaining similar outputs for similar prompts and seeds. These properties present opportunities to leverage fast sampling methods as a shortcut-mechanism, using them to create a preview of denoised outputs through which we can backpropagate image-space losses. In this work, we explore the potential of using such shortcut-mechanisms to guide the personalization of text-to-image models to specific facial identities. We focus on encoder-based personalization approaches, and demonstrate that by tuning them with a lookahead identity loss, we can achieve higher identity fidelity, without sacrificing layout diversity or prompt alignment. We further explore the use of attention sharing mechanisms and consistent data generation for the task of personalization, and find that encoder training can benefit from both.

How does it work?

We fine-tune an IP-Adapter model using an LCM-based "lookahead" identity loss, consistently generated synthetic data, and a self-attention sharing module in order to improve identity preservation and prompt-alignment.

LCM-Lookahead

GAN inversion papers showed significant improvement when augmenting the standard reconstruction goal with image-space perceptual losses, such as an identity loss. We want to use similar losses for diffusion personalization, but diffusion training operates on intermediate noisy images, and their "clean" approximations are often still noisy, blurred, and full of artifacts. We overcome this limitation by leveraging the alignment between diffusion models and their distilled consistency model versions, essentially using LCM-LoRA to create cleaner "previews" of the final diffusion output.
LCM-LoRA predictions closely align with the results of a full DDPM denoising proccess. This also holds for personalized models.
The key idea is to simply use these LCM previews in place of the standard (e.g., DDPM or DDIM) single-step denoising approximations. This provides a cleaner signal in early diffusion timesteps, improving the performance of end-to-end guidance or training losses. For technical details needed to make this work, please read the paper.

Consistent Data

We fine-tune the baseline IP-Adapter using synthetic data. To generate the data, we abuse SDXL Turbo's mode collapse, which leads it to generate nearly fixed identities for sufficiently complex subject descriptions (e.g., "aboriginal australian male with narrow eyes and chubby cheeks and wavy hair and hair bun and brown hair"). This lets us create the same individual in many styles and settings, preventing the encoder from collapsing to photo-realism.
Consistent identities in multiple styles, generated by leveraging SDXL Turbo's mode collapse.

Attention Sharing

We follow recent appearance transfer and editing work, and share self-attention features between the target identity image and the image generated under the new prompt. This improves identity similarity, at the cost of some editability.
Self-attention keys and values are extracted from a copy of the denoising U-Net which operates on a noised conditioning image. These are concatenated to the keys and values of the newly generated images. The KV extraction network is optimized w/ LoRA.

Results

Personalized images generated using our approach with unseen images from the Unsplash-50 set. Our model was tuned from an IP-Adapter-Plus-Face-SDXL model.

Comparisons to Prior Work

We compare our method with prior SDXL encoder-based personalization works. For all methods except for PhotoMaker, we generated a single image for each input and prompt pair. For PhotoMaker, we generated two and chose the one that performed better. All baselines use their HuggingFace spaces implementations, with any additional controlnets or negative prompts disabled. IP-A refers to IP-Adapter-Plus-Face-SDXL (our backbone), tested with adapter scales of 0.5 and 1.0.
Comparisons to prior SDXL encoder-based personalization works on unseen images from Unsplash-50.
Comparisons to prior SDXL encoder-based personalization works on celebrity images.
Expanding the InstantID comparison chart with PhotoMaker and our own method. Since the original prompts are not public, we made a best effort to replicate them. Here, we keep the column terminology employed by the original InstantID paper. Hence, IP-A refers to IP-Adapter-SDXL, IP-A FaceID* is the experimental version of IP-Adapter-SDXL-FaceID. IP-A FaceID is the IP-Adapter-SD1.5-FaceID model, and IP-A FaceID Plus is the IP-Adapter-SD1.5-FaceID-Plus model.

Limitations

Our model attempts to preserve accesories that are unrelated to the identity and may be unwanted in follow-up generations. Since it trains on synthetic data where the target images may be stylized, it does not always default to photorealism when not prompted for an explicit style. Finally, it still struggle with rare concepts such as extreme makeup and unusual identities.

BibTeX

If you find our work useful, please cite our paper:

@misc{gal2024lcmlookahead,
    title={LCM-Lookahead for Encoder-based Text-to-Image Personalization}, 
    author={Rinon Gal and Or Lichter and Elad Richardson and Or Patashnik and Amit H. Bermano and Gal Chechik and Daniel Cohen-Or},
    year={2024},
    eprint={2404.03620},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}