Model Card for Renaissance Stable Diffusion

This is a Fine-tuned Stable Diffusion model on a custom dataset of {image, caption} pairs. This model has been built on top of the fine-tuning script provided by Hugging Face. This model uses the KerasCV implementation of stability.ai's text-to-image model. Unlike other open-source alternatives like Hugging Face's Diffusers, KerasCV offers advantages such as XLA compilation and mixed precision support, resulting in state-of-the-art generation speed.

Model Card for Renaissance Stable Diffusion
Table of Contents
Model Details
- Model Description
Uses
- Direct Use
- Out-of-Scope Use
Bias, Risks, and Limitations
Training Details
- Training Data
- Training Procedure
Results
Environmental Impact
Hardware
Software
Citation
Model Card Authors
Model Card Contact
How to Get Started with the Model

Model Details

Model Description

This model serves as a fine-tuned version of stability.ai's v1-4 Stable Diffusion model for generating high-quality Renaissance-style portraits. It is finetuned from KerasCV's implementation of Stable Diffusion. Keras CV is a deep learning library that is built on top of TensorFlow and Keras. It provides a number of pre-trained models for image classification, object detection, and segmentation. The Keras CV implementation of Stable Diffusion is a simple and easy-to-use way to generate images from text. To use the model, you simply need to provide a text prompt and the model will generate an image that matches the prompt. In the specific case of this fine-tuned model, upon any prompt input the model is capable of generating an image resembling that of a Renaissance era portrait.

Developed by: Martin Gasparyan, Tatev Kyosababyan
Shared by: Martin Gasparyan, Tatev Kyosababyan
Model type: Computer Vision Model
Language(s) (NLP): eng
License: creativeml-openrail-m
Parent Model: CompVis/stable-diffusion-v1-4
Resources for more information: More information needed
- GitHub Repo
- Associated Paper

Uses

Direct Use

The model is intended for research purposes only. Possible research areas and tasks include

Safe deployment of models which have the potential to generate harmful content.
Probing and understanding the limitations and biases of generative models.
Generation of artworks and use in design and other artistic processes.
Applications in educational or creative tools.
Research on generative models.

Excluded uses are described below.

Out-of-Scope Use

The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.

Misuse and Malicious Use

Using the model to generate content that is cruel to individuals is a misuse of this model. This includes, but is not limited to:

Generating demeaning, dehumanizing, or otherwise harmful representations of people or their environments, cultures, religions, etc.
Intentionally promoting or propagating discriminatory content or harmful stereotypes.
Impersonating individuals without their consent.
Sexual content without consent of the people who might see it.
Mis- and disinformation
Representations of egregious violence and gore
Sharing of copyrighted or licensed material in violation of its terms of use.
Sharing content that is an alteration of copyrighted or licensed material in violation of its terms of use.

Bias, Risks, and Limitations

Limitations

The model does not achieve perfect photorealism
The model cannot render legible text
The model does not perform well on more difficult tasks which involve compositionality, such as rendering an image corresponding to “A red cube on top of a blue sphere”
Faces and people in general may not be generated properly.
The model was trained mainly with English captions and will not work as well in other languages.
The autoencoding part of the model is lossy
The model was trained on a large-scale dataset LAION-5B which contains adult material and is not fit for product use without additional safety mechanisms and considerations.
No additional measures were used to deduplicate the dataset. As a result, we observe some degree of memorization for images that are duplicated in the training data. The training data can be searched at https://rom1504.github.io/clip-retrieval/ to possibly assist in the detection of memorized images.

Bias

While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases. Stable Diffusion v1 was trained on subsets of LAION-2B(en), which consists of images that are primarily limited to English descriptions. Texts and images from communities and cultures that use other languages are likely to be insufficiently accounted for. This affects the overall output of the model, as white and western cultures are often set as the default. Further, the ability of the model to generate content with non-English prompts is significantly worse than with English-language prompts.

Training Details

Stable Diffusion v1-4 is a latent diffusion model which combines an autoencoder with a diffusion model that is trained in the latent space of the autoencoder. During training,

Images are encoded through an encoder, which turns images into latent representations. The autoencoder uses a relative downsampling factor of 8 and maps images of shape H x W x 3 to latents of shape H/f x W/f x 4
Text prompts are encoded through a ViT-L/14 text-encoder.
The non-pooled output of the text encoder is fed into the UNet backbone of the latent diffusion model via cross-attention.
The loss is a reconstruction objective between the noise that was added to the latent and the prediction made by the UNet.

Training Data

We used 11 Renaissance portraits to train the model and created a .csv file with two columns, one for image path and the other for textual description. Dataset can be found at https://huggingface.co./datasets/morj/renaissance_portraits and can be downloaded using

curl -X GET \
     "https://datasets-server.huggingface.co/splits?dataset=morj%2Frenaissance_portraits"

Training Procedure

Note: Only the diffusion model is fine-tuned. The VAE and the text encoder are kept frozen.

Training details: The fine-tuning process involves adapting the Stable Diffusion model to the specific task of generating Renaissance-style portraits from textual descriptions.

The dataset we trained our model on can be found here. We used 11 Renaissance portraits to train the model and created a .csv file with two columns, one for image path and the other for textual description. When launching training, a diffusion model checkpoint is generated epoch-wise only if the current loss is lower than the previous one. To avoid OOM and faster training, we used an A100 GPU in Google Colab. We fine-tuned the model on two different resolutions: 256x256 and 512x512. We only varied the batch size and number of epochs for fine-tuning with these two different resolutions. The best results were obtained with 512 x 512 pixels, 72 epochs, batch size of 1 and mixed precision set to True.

Hardware: A100 GPU

Optimizer: AdamW

Batch: 1

Learning rate: warmup to 0.0001 for 10,000 steps and then kept constant

Results

Please Check out the Github Repo at https://github.com/martingasparyan/Fine-Tune-Stable-Diffusion/wiki

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: A100 PCIe 40/80GB
Hours used: 50
Cloud Provider: Google Cloud Platform
Compute Region: us-west1
Carbon Emitted: 3.75

\usepackage{hyperref}

\subsection{CO2 Emission Related to Experiments}

Experiments were conducted using Google Cloud Platform in region us-west1, which has a carbon efficiency of 0.3 kgCO$_2$eq/kWh. A cumulative of 50 hours of computation was performed on hardware of type A100 PCIe 40/80GB (TDP of 250W).

Total emissions are estimated to be 3.75 kgCO$_2$eq of which 100 percents were directly offset by the cloud provider.

%Uncomment if you bought additional offsets: %XX kg CO2eq were manually offset through \href{link}{Offset Provider}.

Estimations were conducted using the \href{https://mlco2.github.io/impact#compute}{MachineLearning Impact calculator} presented in \cite{lacoste2019quantifying}.

@article{lacoste2019quantifying, title={Quantifying the Carbon Emissions of Machine Learning}, author={Lacoste, Alexandre and Luccioni, Alexandra and Schmidt, Victor and Dandres, Thomas}, journal={arXiv preprint arXiv:1910.09700}, year={2019} }

Hardware

A100 PCIe 40/80GB

Software

Google Colab, Jupyter Lab

Model Card Authors

Martin Gasparyan, Tatev Kyosababyan

Model Card Contact

[email protected], [email protected]

How to Get Started with the Model

Use the code below to get started with the model.

1. Install Dependencies

!pip install keras-cv==0.6.0 -q
!pip install -U tensorflow -q
!pip install keras-core -q

2. Imports

from textwrap import wrap
import os
import keras_cv
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow.experimental.numpy as tnp
from keras_cv.models.stable_diffusion.clip_tokenizer import SimpleTokenizer
from keras_cv.models.stable_diffusion.diffusion_model import DiffusionModel
from keras_cv.models.stable_diffusion.image_encoder import ImageEncoder
from keras_cv.models.stable_diffusion.noise_scheduler import NoiseScheduler
from keras_cv.models.stable_diffusion.text_encoder import TextEncoder
from tensorflow import keras

3. Create a base Stable diffusion Model

my_base_model = keras_cv.models.StableDiffusion(img_width=512, img_height=512)

4. Load Weights from the h5 model which is hosted on Hugging Face:

my_base_model.diffusion_model.load_weights('path/to/file/renaissance_model.h5')

5. Create a variable to hold the values of the to-be-generated image such as prompt, batch size, iterations, and seed

img = my_base_model.text_to_image(
       prompt='A woman with an enigmatic smile against a dark background',
       batch_size=1,  # How many images to generate at once
       num_steps=25,  # Number of iterations (controls image quality)
       seed=123,  # Set this to always get the same image from the same prompt
    )

6. Display the image using the function:

def plot_images(images):
    plt.figure(figsize=(5, 5))
    plt.imshow(images)
    plt.axis('off')
    
plot_images(img)

Downloads last month: 0

Inference Providers NEW

This model is not currently available via any of the supported Inference Providers.

The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.

Model tree for morj/renaissance

Base model

CompVis/stable-diffusion-v1-4

Finetuned

(1090)

this model

morj
/

renaissance