How far can we go with ImageNet for Text-to-Image generation?
Lucas Degeorge, Arijit Ghosh, Nicolas Dufour, David Picard, Vicky Kalogeiton
This repo has the code and models for the paper "How far can we go with ImageNet for Text-to-Image generation?"The core idea is that text-to-image generation models typically rely on vast datasets, prioritizing quantity over quality. The usual solution is to gather massive amounts of data. We propose a new approach that leverages strategic data augmentation of small, well-curated datasets to enhance the performance of these models. We show that this method improves the quality of the generated images on several benchmarks.
Paper on Arxiv: https://arxiv.org/pdf/2502.21318
GitHub repository: https://github.com/lucasdegeorge/T2I-ImageNet
Project website: https://lucasdegeorge.github.io/projects/t2i_imagenet/
Install
To install, first create a virtual environment with python (at least 3.9), clone the repository and run
pip install -e .
More details here
Pretrained models
CAD-I model
In this repo, the model is trained with Text Augmentation only. Check the model trained with Text and Image Augmentation here
To use the pre-trained model do the following:
from pipe import T2IPipeline
pipe = T2IPipeline("Lucasdegeorge/CAD-I_TA").to("cuda")
prompt = "An adorable otter, with its sleek, brown fur and bright, curious eyes, playfully interacts with a vibrant bunch of broccoli... "
image = pipe(prompt, cfg=15)
If you just want to download the models, not the sampling pipeline, you can do:
from pipe import CAD
model = CAD.from_pretrained("Lucasdegeorge/CAD-I_TA")
DiT-I model
Coming soon ...
Prompts
Our models have been specifically trained to handle very long and detailed prompts. To get the best performance and results, we encourage you to use them with prompts that are rich in detail. Short or vague prompts may not fully utilize the model's capabilities. Example prompts:
A majestic elephant stands tall and proud in the heart of the African savannah, its wrinkled, gray skin glistening under the intense afternoon sun. The elephant's large, flapping ears and long, sweeping trunk create a sense of grace and power as it gently sways, surveying the vast, golden grasslands stretching out before it. In the distance, a herd of zebras grazes peacefully, their stripes blending with the tall, dry grass. The scene is completed by a lone acacia tree silhouetted against the setting sun, casting long, dramatic shadows across the landscape.
A classic film camera rests on a tripod, its worn leather strap and scratched metal body telling the story of countless adventures and captured moments. The camera is positioned in a scenic landscape, with rolling hills, a winding river, and a distant mountain range bathed in the soft, golden light of sunset. In the foreground, a wildflower meadow sways gently in the breeze, while the camera's lens captures the beauty and tranquility of the scene, preserving it for eternity.
A graceful flamingo stands elegantly in the shallow waters of a tranquil lagoon, its vibrant pink feathers reflecting beautifully in the still water. The flamingo's long, slender legs and curved neck create a picture of poise and balance as it dips its beak into the water, searching for food. Behind the flamingo, a lush mangrove forest stretches out, its dense foliage providing a rich habitat for various wildlife. The scene is completed by a clear blue sky and the gentle rustling of leaves in the breeze
A hearty, overstuffed sandwich sits on a wooden cutting board, its layers of fresh, crisp lettuce, juicy tomatoes, and thinly sliced deli meats peeking out from between two slices of golden-brown bread. The sandwich's tantalizing aroma fills the air, mingling with the scent of freshly baked bread and tangy mustard. In the background, a bustling deli comes to life, with shelves lined with jars of pickles, a gleaming meat slicer, and a chalkboard menu listing the day's specials. The scene is completed by the lively chatter of customers and the clinking of glasses.
A stunning oil painting of a majestic tiger hangs on the wall of a dimly-lit art gallery, its vibrant colors and intricate details drawing the viewer in. The tiger's powerful, muscular body is depicted in mid-stride, its stripes blending seamlessly with the lush jungle foliage surrounding it. The painting captures the tiger's intense, amber eyes and the subtle play of light and shadow on its fur, creating a sense of depth and movement. The background features a dense canopy of trees and a cascading waterfall, adding to the wild, untamed atmosphere of the scene.
A clever magpie perched on a rustic wooden fence post, its iridescent black and white feathers shimmering in the sunlight. The bird tilts its head, holding a shiny trinket in its beak, with a backdrop of a golden wheat field swaying gently in the breeze. A few more curios and found objects are scattered along the fence, hinting at the magpie's treasure trove hidden nearby. A clear blue sky with puffy white clouds completes the scenic countryside atmosphere.
A playful dolphin leaps gracefully out of the sparkling turquoise waters, its sleek, gray body arcing through the air before diving back into the waves with a splash. Nearby, a classic wooden sailboat glides smoothly across the ocean, its white sails billowing in the breeze. The dolphin swims alongside the boat, its joyful antics mirrored by the shimmering sunlight dancing on the water's surface. The scene is completed by a clear blue sky and the distant horizon, where the sea meets the sky
Using the Pipeline
The T2IPipeline
class provides a comprehensive interface for generating images from text prompts. Here's a detailed guide on how to use it:
Basic Usage
from pipe import T2IPipeline
# Initialize the pipeline
pipe = T2IPipeline("Lucasdegeorge/CAD-I_TA").to("cuda")
# Generate an image from a prompt
prompt = "An adorable otter, with its sleek, brown fur and bright, curious eyes, playfully interacts with a vibrant bunch of broccoli... "
image = pipe(prompt, cfg=15)
Advanced Configuration
The pipeline can be initialized with several customization options:
pipe = T2IPipeline(
model_path="Lucasdegeorge/CAD-I_TA",
sampler="ddim", # Options: "ddim", "ddpm", "dpm", "dpm_2S", "dpm_2M"
scheduler="sigmoid", # Options: "sigmoid", "cosine", "linear"
postprocessing="sd_1_5_vae",
scheduler_start=-3,
scheduler_end=3,
scheduler_tau=1.1,
device="cuda"
)
Generation Parameters
The pipeline's __call__
method accepts various parameters to control the generation process:
image = pipe(
cond="A beautiful landscape", # Text prompt or list of prompts
num_samples=4, # Number of images to generate
cfg=15, # Classifier-free guidance scale
guidance_type="constant", # Type of guidance: "constant", "linear"
guidance_start_step=0, # Step to start guidance
coherence_value=1.0, # Coherence value for sampling
uncoherence_value=0.0, # Uncoherence value for sampling
thresholding_type="clamp", # Type of thresholding: "clamp", "dynamic_thresholding", "per_channel_dynamic_thresholding"
clamp_value=1.0, # Clamp value for thresholding
thresholding_percentile=0.995 # Percentile for thresholding
)
Guidance Types
constant
: Applies uniform guidance throughout the sampling processlinear
: Linearly increases guidance strength from start to endexponential
: Exponentially increases guidance strength from start to end
Thresholding Types
clamp
: Clamps values to a fixed range usingclamp_value
dynamic
: Dynamically adjusts thresholds based on the batch statisticspercentile
: Uses percentile-based thresholding withthresholding_percentile
Advanced Parameters
For more control over the generation process, you can also specify:
x_N
: Initial noise tensorlatents
: Previous latents for continuationnum_steps
: Custom number of sampling stepssampler
: Custom sampler functionscheduler
: Custom scheduler functionguidance_start_step
: Step to start guidancegenerator
: Random number generator for reproducibilityunconfident_prompt
: Custom unconfident prompt text
Citation
If you happen to use this repo in your experiments, you can acknowledge us by citing the following paper:
@article{degeorge2025farimagenettexttoimagegeneration,
title ={How far can we go with ImageNet for Text-to-Image generation?},
author ={Lucas Degeorges and Arijit Ghosh and Nicolas Dufour and David Picard and Vicky Kalogeiton},
year ={2025},
journal ={arXiv},
}
- Downloads last month
- 19