Diffusers
Safetensors
radedit / README.md
fepegar's picture
Update paper references (#5)
13479ec verified
metadata
license: other
datasets:
  - MIMIC-CXR
  - NIH-CXR
  - CheXpert
library_name: diffusers
extra_gated_prompt: >-
  Please confirm that you have read and agree to the following disclaimer.

  The model(s) and/or software described in this repository are provided for
  research and development use only. The model(s) and/or software are not
  intended for use in clinical decision-making or for any other clinical use,
  and performance for clinical use has not been established. You bear sole
  responsibility for any use of these model(s) and/or software, including
  incorporation into any product intended for clinical use.
extra_gated_fields:
  I have read and agree to the disclaimer: checkbox

Model card for RadEdit

Model description

RadEdit is a deep learning approach for stress testing biomedical vision models to discover failure cases. It uses a generative text-to-image model to “edit” chest X-rays by using a text description to add or remove abnormalities from a masked region of the image. These edited images can subsequently be used to test whether existing models (e.g. those for disease classification or anatomy segmentation), perform as expected under these different conditions.

RadEdit Banner

To enable this, a text-to-image latent diffusion model is trained from scratch to generate chest X-rays from either the impression section of a radiology report (a short clinically actionable outline of the main findings) or a list of radiographic observations.

RadEdit is described in detail in RadEdit: stress-testing biomedical vision models via diffusion image editing (F. Pérez-García, S. Bond-Taylor, et al., 2024).

We release the weights for the RadEdit model as well as the editing pipeline for stress-testing models.

Contents

Uses

Intended Use

The model checkpoints are intended to be used solely for (I) future research on chest X-ray generation and model stress-testing and (II) reproducibility of the experimental results reported in the reference paper. The code and model checkpoints should not be used to provide medical or clinical opinions, and is not designed to replace the role of qualified medical professionals in appropriately identifying, assessing, diagnosing or managing medical conditions. Users remain responsible for any outputs generated by the model.

Primary Intended Use

The primary intended use is to support AI researchers reproducing and building on top of this work. RadEdit and its associated models should be helpful for exploring various biomedical stress-testing tasks via image editing or generation.

Out-of-Scope Use

Any deployed use case of the model, commercial or otherwise, is out of scope. Although we evaluated the models using a broad set of publicly-available research benchmarks, the models and evaluations are intended for research use only and not intended for deployed use cases.

Data

RadEdit was trained on the following public deidentified chest X-ray datasets. Only the frontal view chest X-rays are used, totalling 487,680 training images. For MIMIC-CXR the impression section of the radiology report (a short clinically actionable outline of the main findings) is used as the input text to the model. For The NIH-CXR and CheXpert, a list of all abnormalities present in an image as indicated by the labels, e.g., “Cardiomegaly. Pneumothorax.” is used as the input text.

MIMIC-CXR

The MIMIC-CXR dataset contains 377,110 image-report pairs from 227,827 radiology studies. A patient may have multiple studies, whereas each study may contain multiple chest x-ray (CXR) images taken at different views. We follow the standard partition and use the first nine subsets (P10-P18) for training and validation, while reserving the last (P19) for testing.

NIH-CXR

The NIH-CXR dataset contains 112,120 X-ray images with 8 automatically generated disease labels from 30,805 unique patients. Since there is no official validation split, we create a random train/validation split, ensuring that no patient appears in both sets.

CheXpert

The CheXpert dataset contains 224,316 chest X-ray images from 65,240 patients together with automatically generated labels indicating the presence of 14 observations in radiology reports. We use the official train/validation split.

Biases, Risks and Limitations

The model was developed using English corpora, and thus may be considered English-only. The model is evaluated on a narrow set of biomedical benchmark tasks, described in the RadEdit paper. As such, it is not suitable for use in any clinical setting. Under some conditions, the model may make inaccurate predictions and display limitations, which may require additional mitigation strategies. In particular, the model is likely to carry many of the limitations of the models from which it is derived, Stable Diffusion v1.5, BioViL-T, and SDXL-VAE. In particular, the SDXL-VAE (which is used to compress images prior to training the diffusion model) can exhibit artefacts in its reconstructions which can make generated images identifiable from real images. See Figure 12 in Taming Transformers for High-Resolution Image Synthesis for examples of such artefacts. While evaluation has included clinical input, this is not exhaustive; model performance will vary in different settings and is intended for research use only.

Further, the model inherits the biases from the training datasets. These datasets come from hospitals in the United States; therefore, it might be biased towards population in the training data. Underlying biases of the training datasets may not be well characterized. A substantial proportion of the training data comes from inpatient medical record; samples from the model are thus reflective of this population. Due to the automated procedure used to obtain pathology labels, erroneous labels may have been used to train the model, which may affect its performance.

The RadEdit editing pipeline is not applicable to all stress testing scenarios. For example, testing segmentation models’ behaviour to cardiomegaly (enlarged heart) is not possible as this would require segmentation masks to be changed. Other limitations of the editing procedure are discussed in the RadEdit paper.

Other limitations:

  • The model does not achieve perform photorealism.
  • Model outputs may include errors.
  • The model can fail to produce aligned outputs for more complex prompts.
  • The model can fail to produce outputs matching the text input; particularly if the text differs substantially from the training data.
  • When using the model for image editing, unwanted visual changes may be made.

Getting Started

This repository provides the weights for the U-Net model. The VAE, text encoder, tokenizer, and scheduler have to be loaded separately and combined into the generation pipeline:

from transformers import AutoModel, AutoTokenizer
from diffusers import AutoencoderKL, DDIMScheduler, StableDiffusionPipeline, UNet2DConditionModel

# Load the UNet model
unet_loaded = UNet2DConditionModel.from_pretrained("microsoft/radedit", subfolder="unet")

# Load all other components of the stable diffusion pipeline
vae = AutoencoderKL.from_pretrained("stabilityai/sdxl-vae")
text_encoder = AutoModel.from_pretrained(
    "microsoft/BiomedVLP-BioViL-T",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "microsoft/BiomedVLP-BioViL-T",
    model_max_length=128,
    trust_remote_code=True,
)
scheduler = DDIMScheduler(
    beta_schedule="linear",
    clip_sample=False,
    prediction_type="epsilon",
    timestep_spacing="trailing",
    steps_offset=1,
)

generation_pipeline = StableDiffusionPipeline(
    vae=vae,
    text_encoder=text_encoder,
    tokenizer=tokenizer,
    unet=unet_loaded,
    scheduler=scheduler,
    safety_checker=None,
    requires_safety_checker=False,
    feature_extractor=None,
)
generation_pipeline.to("cuda")

Sampling Chest X-Rays

The generation pipeline can be used to sample images via the following

import torch

prompts = [
    "Small right-sided pleural effusion",
    "No acute cardiopulmonary process",
    "Small left-sided pleural effusion",
    "Large right-sided pleural effusion",
    "Bilateral pleural effusions",
    "Large left-sided pleural effusion",
]

torch.manual_seed(0)
images = generation_pipeline(
    prompts,
    num_inference_steps=100,
    guidance_scale=7.5,
).images

RadEdit Samples

Editing

To load the RadEdit editing pipeline, we convert the generation pipeline into the custom pipeline in pipeline.py

from diffusers import DiffusionPipeline
radedit_pipeline = DiffusionPipeline.from_pipe(
    pipeline,
    custom_pipeline="microsoft/radedit",
)

Following this, RadEdit can be used to edit an input_image using two masks: the edit_mask which defined the region we wish the editing prompt to be applied to, and the fixed_mask which defined the region where any edits are prevented from taking place.

prompt = 'No acute cardiopulmonary process'
arrays = radedit_pipeline_loaded(
    prompt,
    weights=[7.5],
    image=input_img,
    edit_mask=input_mask,
    keep_mask=fixed_mask,
    num_inference_steps=200,
    invert_prompt='',
    skip_ratio=0.3,
)

Training details

We train the U-Net for 300 epochs, monitoring validation loss to avoid overfitting. During training we regularly evaluate a number of different metrics which assess the quality, diversity and alignment between prompt and generation, including FID, precision/recall/density/coverage, and CLIP score to ensure that samples are high quality and diverse.

Environmental impact

  • Hardware type: NVIDIA V100 GPUs
  • Hours used: 318 hours/GPU × 1 nodes × 8 GPUs/node = 2544 GPU-hours
  • Cloud provider: Azure
  • Compute region: West US 2
  • Carbon emitted: 229 kg CO₂ eq.

Compute infrastructure

RadEdit was trained on Azure Machine Learning.

Software

We used SimpleITK and Pydicom for processing of DICOM files.

Citation

BibTeX:

@inproceedings{perez-garcia_bond-taylor_radedit,
    title        = {{RadEdit}: Stress-Testing Biomedical Vision Models via Diffusion Image Editing},
    author       = {P{\'e}rez-Garc{\'i}a, Fernando and Bond-Taylor, Sam and Sanchez, Pedro P. and van Breugel, Boris and Castro, Daniel C. and Sharma, Harshita and Salvatelli, Valentina and Wetscherek, Maria T. A. and Richardson, Hannah and Lungren, Matthew P. and Nori, Aditya and Alvarez-Valle, Javier and Oktay, Ozan and Ilse, Maximilian},
    year         = 2025,
    booktitle    = {Computer Vision -- ECCV 2024},
    publisher    = {Springer Nature Switzerland},
    address      = {Cham},
    pages        = {358--376},
    isbn         = {978-3-031-73254-6},
    editor       = {Leonardis, Ale{\v{s}} and Ricci, Elisa and Roth, Stefan and Russakovsky, Olga and Sattler, Torsten and Varol, G{\"u}l},
    abstract     = {Biomedical imaging datasets are often small and biased, meaning that real-world performance of predictive models can be substantially lower than expected from internal testing. This work proposes using generative image editing to simulate dataset shifts and diagnose failure modes of biomedical vision models; this can be used in advance of deployment to assess readiness, potentially reducing cost and patient harm. Existing editing methods can produce undesirable changes, with spurious correlations learned due to the co-occurrence of disease and treatment interventions, limiting practical applicability. To address this, we train a text-to-image diffusion model on multiple chest X-ray datasets and introduce a new editing method, RadEdit, that uses multiple image masks, if present, to constrain changes and ensure consistency in the edited images, minimising bias. We consider three types of dataset shifts: acquisition shift, manifestation shift, and population shift, and demonstrate that our approach can diagnose failures and quantify model robustness without additional data collection, complementing more qualitative tools for explainable AI.}
}

APA:

Pérez-García, F., Bond-Taylor, S., Sanchez, P. P., van Breugel, B., Castro, D. C., Sharma, H., … Ilse, M. (2025). RadEdit: Stress-Testing Biomedical Vision Models via Diffusion Image Editing. In A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, & G. Varol (Eds.), Computer Vision -- ECCV 2024 (pp. 358–376). Cham: Springer Nature Switzerland.

Model card contact

Sam Bond-Taylor ([email protected]).