OminiControl: Minimal and Universal Control for Diffusion Transformer
Abstract
In this paper, we introduce OminiControl, a highly versatile and parameter-efficient framework that integrates image conditions into pre-trained Diffusion Transformer (DiT) models. At its core, OminiControl leverages a parameter reuse mechanism, enabling the DiT to encode image conditions using itself as a powerful backbone and process them with its flexible multi-modal attention processors. Unlike existing methods, which rely heavily on additional encoder modules with complex architectures, OminiControl (1) effectively and efficiently incorporates injected image conditions with only ~0.1% additional parameters, and (2) addresses a wide range of image conditioning tasks in a unified manner, including subject-driven generation and spatially-aligned conditions such as edges, depth, and more. Remarkably, these capabilities are achieved by training on images generated by the DiT itself, which is particularly beneficial for subject-driven generation. Extensive evaluations demonstrate that OminiControl outperforms existing UNet-based and DiT-adapted models in both subject-driven and spatially-aligned conditional generation. Additionally, we release our training dataset, Subjects200K, a diverse collection of over 200,000 identity-consistent images, along with an efficient data synthesis pipeline to advance research in subject-consistent generation.
Community
๐ค Gradio Demo: https://huggingface.co./spaces/Yuanshi/OminiControl
๐ป Code: https://github.com/Yuanshi9815/OminiControl
๐ Paper: https://arxiv.org/abs/2411.15098
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Any-to-3D Generation via Hybrid Diffusion Supervision (2024)
- LaVin-DiT: Large Vision Diffusion Transformer (2024)
- Stable Flow: Vital Layers for Training-Free Image Editing (2024)
- OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal Instruction (2024)
- A Simple Approach to Unifying Diffusion-based Conditional Generation (2024)
- SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers (2024)
- Zoomed In, Diffused Out: Towards Local Degradation-Aware Multi-Diffusion for Extreme Image Super-Resolution (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend