MonoFormer: One Transformer for Both Diffusion and Autoregression
Abstract
Most existing multimodality methods use separate backbones for autoregression-based discrete text generation and diffusion-based continuous visual generation, or the same backbone by discretizing the visual data to use autoregression for both text and visual generation. In this paper, we propose to study a simple idea: share one transformer for both autoregression and diffusion. The feasibility comes from two main aspects: (i) Transformer is successfully applied to diffusion for visual generation, and (ii) transformer training for autoregression and diffusion is very similar, and the difference merely lies in that diffusion uses bidirectional attention mask and autoregression uses causal attention mask. Experimental results show that our approach achieves comparable image generation performance to current state-of-the-art methods as well as maintains the text generation capability. The project is publicly available at https://monoformer.github.io/.
Community
Hi @WenhaoWang congrats on this work!
Opened a PR to add a model card, feel free to edit/expand! https://huggingface.co./MonoFormer/MonoFormer_ImageNet_256/discussions/1
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Show-o: One Single Transformer to Unify Multimodal Understanding and Generation (2024)
- Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model (2024)
- Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining (2024)
- VAR-CLIP: Text-to-Image Generator with Visual Auto-Regressive Modeling (2024)
- Scalable Autoregressive Image Generation with Mamba (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper