ruizhaocv/Edgen · Hugging Face

EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models

This is the official model weights of the model ''Edgen'' trained by EvolveDirector. For more datails, please refer to our paper and code repo.

Setup

Requirements

Build virtual environment for EvolveDirector

# create virtual environment for EvolveDirector
conda create -n evolvedirector python=3.9
conda activate evolvedirector

# cd to the path of this repo

# install packages
pip install --upgrade pip 
pip install torch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt
pip install -U transformers accelerate diffusers SentencePiece ftfy beautifulsoup4

Usage

Inference

python Inference/inference.py --image_size=1024 \
    --t5_path "./model" \
    --tokenizer_path "./model/sd-vae-ft-ema" \
    --txt_file "text_prompts.txt" \  # put your text prompts in this file 
    --model_path "model/Edgen_1024px_v1.pth" \
    --save_folder "output/test_model"

Citation

@article{zhao2024evolvedirector,
  title={EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models},
  author={Zhao, Rui and Yuan, Hangjie and Wei, Yujie and Zhang, Shiwei and Gu, Yuchao and Ran, Lingmin and Wang, Xiang and Wu, Zhangjie and Zhang, Junhao and Zhang, Yingya and others},
  journal={arXiv preprint arXiv:2410.07133},
  year={2024}
}

Shoutouts

This code builds heavily on PixArt-$\alpha$. Thanks for open-sourcing!