Recoloring photos with diffusers

Community Article Published October 9, 2024

Another real use case for diffusion models is to try to fix old photos, this is the first step to achieve that.

This is not something new, there'are some models already, but nothing beats the feeling of having one to use it locally. Also I found that most of the time they aren't good and you don't have any control on the final image, the beauty of this method is that you can change the final result to your liking, for example you can prompt for something in a specific color or provide a source image with the colors you want.

As all my guides, this is just a starting point and not a final product, there's going to be a lot of space for improvement after but this should be enough to be at the same level as the SOTA in recoloring images.

For the people that only wants to play and test this, I created a PoC space where you can test this techniques with your own photos. For the space and for the guide I created a custom pipeline to make it easier to use and faster than a normal pipeline, but this makes pretty much only usable for this use case.

For this guide I'll use two public domain images that I rescaled and cropped so they are square 1024px images, this is to simplify the code.

Migrant Mother by Dorothea Lange Main street, Rockville, Indiana by Arthur Rothstein
image/jpeg image/png

The keypoints for this tecnhique are:

All these combined are what allows us to get a really good result, now I'll go step by step why are all of them needed.

If we just use the model and the recolor controlnet, we get these results:

image/png image/png image/png
image/png image/png image/png

We can clearly see that the model doesn't know how to recolor the images, for the first one it doesn't detect all the people and for the second one, it loses a lot of details, specially the text. So we need to start fixing this problem.

I know that the results are oversaturated and it's fine, we need them like that and you see why later.

Using an IP Adapter

I'm going to use an IP Adapter for the sole reason of feeding information about the image to the model, this works almost the same or sometimes better than feeding it a prompt. Since we have a grayscale image, we need to disable the block that feeds the color information.

pipe.load_ip_adapter(
    "h94/IP-Adapter",
    subfolder="sdxl_models",
    weight_name="ip-adapter_sdxl_vit-h.safetensors",
    image_encoder_folder="models/image_encoder",
)

scale = {
    "up": {"block_0": [1.0, 0.0, 1.0]},
}
pipe.set_ip_adapter_scale(scale)
image/png image/png image/png
image/png image/png image/png

We get a little bit more consistency in the image, still got some weird errors like the painted hand but overall it's better, remember that this solution is not a one hit wonder, you'll have to generate a couple of times until you get the expected result.

Disabling the CFG after the second step

I did this in the previous guide but didn't write about it, so I'll explain this part in this guide.

To accelerate the generation we have mostly two choices which don't degrade the quality, the first one is to compile the model and the second one is to disable the CFG as earlier as possible. Disabling the CFG effectively doubles the speed of inference, but since we're doing it after the second step, is not double but it achieves the same speed as if we compiled the model without the slow first inference and the restrictions that comes with it.

As I see it, there's no loss in disabling the CFG after the second step, I don't see a quality degradation but it will make the generation different as if you use CFG for the whole generation.

You can read more about the effects of disabling the CFG in this paper and also in the recommended ones by the librarian bot in the same page.

I do this inside the custom pipeline but it can be done with the callbacks.

if i == 2:
    prompt_embeds = prompt_embeds[-1:]
    add_text_embeds = add_text_embeds[-1:]
    add_time_ids = add_time_ids[-1:]

    added_cond_kwargs = {
        "text_embeds": add_text_embeds,
        "time_ids": add_time_ids,
    }

    controlnet_prompt_embeds = prompt_embeds
    controlnet_added_cond_kwargs = added_cond_kwargs

    image = [single_image[-1:] for single_image in image]
    self._guidance_scale = 0.0
image/png image/png image/png
image/png image/png image/png

Using a second ControlNet

Adding a controlnet helps to make a clear demilitation between the objects, preserve the details and it also helps the model to understand better the image. I haven't tested this with a regular controlnet, but it works really good with controlnet union. We're going to use the lineart preprocessor because it adds finer lines than the others, for this I'm going to use a new library we're building for preprocessors and more utilities for images and videos like upscalers, automatic masking and frame interpolation.

The lineart preprocessor works good for this but we need to lower the resolution so the lines are thicker, this is mostly because the controlnets are trained with low resolution images.

from image_gen_aux import LineArtPreprocessor

lineart_preprocessor = LineArtPreprocessor.from_pretrained("OzzyGT/lineart").to("cuda")
lineart_image = lineart_preprocessor(image, resolution_scale=0.7)[0]
resolution_scale=1.0 resolution_scale=0.7
image/png image/png

Also for this controlnet I want to give a little freedom to the model, so we're going to set the controlnet_conditioning_scale to 0.5 and the control_guidance_end to 0.9

After this step we get these results:

image/png image/png image/png
image/png image/png image/png

We now are getting consistent results and with the correct colors almost all the time. If you compare this results with some other recolor models I think we're at the same level and personally I think these results are the best ones of the ones I've seen, even paid ones.

Still, as always, I'm not completely satisfied with the results, I can see that we lost some details, specially on the text and the tiny details of the second image, also the original images had some fine grain.

So as a final step and what I think is the most important one to get that real look and feeling, also the reason why I said that we need the oversaturated results, is to blend the original image with the generated one, the ratio of the merged images is really important and we're going to lose a lot of the color information, so the more oversatured the better.

The best ratio I found that preserves the original details is to merge them with the generated image opacity set to 0.2, but when we do this, we will see that we don't have that much color left, this is why it was important to have an oversaturated image, since we still have color, we're going to enhance the color even more using Pillow.

Here's an example of a normal merge, a saturation of 2.0 and 3.0

1.0 2.0 3.0
image/png image/png image/png

The final results for the images (3 samples for each without cherry picking):

image/png image/png image/png
image/png image/png image/png

As you can see, we get some pretty consistent results, but probably we will need to do a few tries before we get the result we want. That's why this is good, you can run it locally or in the HuggingFace space all the times you need and I made it to be fast.

Ideas for improvement after:

  • Use different models for different kind of images.
  • Play with the IP Adapter and ControlNet parameters, the only one that probably shouldn't be changed is the ReColor one.
  • Add the ability to prompt something specific, I didn't do it here, but you can maybe make the current prompt a template and let the user specify something over it, like brunette, green sweater, blue car, etc, this will allow the user to change or get specific colors for specific parts.

Hope this helps to give some more ideas for cool applications in the future.

Here's the full code:

import torch
from diffusers import (
    AutoencoderKL,
    ControlNetModel,
    StableDiffusionXLControlNetPipeline,
    TCDScheduler,
)
from diffusers.utils import load_image
from image_gen_aux import LineArtPreprocessor
from PIL import Image, ImageEnhance

from controlnet_union import ControlNetModel_Union

controlnet = [
    ControlNetModel.from_pretrained(
        "OzzyGT/ControlNet-recolorXL", torch_dtype=torch.float16, variant="fp16"
    ),
    ControlNetModel_Union.from_pretrained(
        "OzzyGT/controlnet-union-promax-sdxl-1.0",
        torch_dtype=torch.float16,
        variant="fp16",
    ),
]

vae = AutoencoderKL.from_pretrained(
    "madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16
).to("cuda")

pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
    "recoilme/ColorfulXL-Lightning",
    custom_pipeline="OzzyGT/pipeline_sdxl_recolor",
    torch_dtype=torch.float16,
    variant="fp16",
    controlnet=controlnet,
    vae=vae,
).to("cuda")
pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config)

pipe.load_ip_adapter(
    "h94/IP-Adapter",
    subfolder="sdxl_models",
    weight_name="ip-adapter_sdxl_vit-h.safetensors",
    image_encoder_folder="models/image_encoder",
)

scale = {
    "up": {"block_0": [1.0, 0.0, 1.0]},
}
pipe.set_ip_adapter_scale(scale)

pipe.enable_model_cpu_offload()

source_image = load_image(
    "https://huggingface.co./datasets/OzzyGT/testing-resources/resolve/main/recolor/migrant_mother.jpg?download=true"
)

lineart_preprocessor = LineArtPreprocessor.from_pretrained("OzzyGT/lineart").to("cuda")
lineart_image = lineart_preprocessor(source_image, resolution_scale=0.7)[0]

prompt = "high quality color photo, sharp, detailed, 4k, colorized, remastered"
negative_prompt = "blurry, low resolution, bad quality, pixelated, black and white, b&w, grayscale, monochrome, sepia"

(
    prompt_embeds,
    negative_prompt_embeds,
    pooled_prompt_embeds,
    negative_pooled_prompt_embeds,
) = pipe.encode_prompt(prompt, negative_prompt, "cuda", True)


image = pipe(
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_prompt_embeds,
    pooled_prompt_embeds=pooled_prompt_embeds,
    negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
    image=[source_image, lineart_image],
    ip_adapter_image=source_image,
    num_inference_steps=8,
    guidance_scale=2.0,
    controlnet_conditioning_scale=[1.0, 0.5],
    control_guidance_end=[1.0, 0.9],
).images

if source_image.mode != "RGBA":
    source_image = source_image.convert("RGBA")

if image.mode != "RGBA":
    image = image.convert("RGBA")

enhancer = ImageEnhance.Color(image)
image = enhancer.enhance(4.0)

alpha = image.split()[3]
alpha = alpha.point(lambda p: p * 0.20)
image.putalpha(alpha)

merged_image = Image.alpha_composite(source_image, image)

merged_image.save("recolored.png")