MSTS: A Multimodal Safety Test Suite for Vision-Language Models
Abstract
Vision-language models (VLMs), which process image and text inputs, are increasingly integrated into chat assistants and other consumer AI applications. Without proper safeguards, however, VLMs may give harmful advice (e.g. how to self-harm) or encourage unsafe behaviours (e.g. to consume drugs). Despite these clear hazards, little work so far has evaluated VLM safety and the novel risks created by multimodal inputs. To address this gap, we introduce MSTS, a Multimodal Safety Test Suite for VLMs. MSTS comprises 400 test prompts across 40 fine-grained hazard categories. Each test prompt consists of a text and an image that only in combination reveal their full unsafe meaning. With MSTS, we find clear safety issues in several open VLMs. We also find some VLMs to be safe by accident, meaning that they are safe because they fail to understand even simple test prompts. We translate MSTS into ten languages, showing non-English prompts to increase the rate of unsafe model responses. We also show models to be safer when tested with text only rather than multimodal prompts. Finally, we explore the automation of VLM safety assessments, finding even the best safety classifiers to be lacking.
Community
🚀 Today, we are releasing MSTS, a new Multimodal Safety Test Suite for Vision-Language Models! MSTS is exciting because it tests for safety risks created by multimodality. Each prompt consists of a text + image that only in combination reveals their full unsafe meaning. Many thanks to my great co-authors @Paul @g8a9 @avparrish @PSaiml @Bertievidgen !
All of MSTS is permissively licensed and available now. Check out the MSTS preprint for more details, or go to GitHub/HuggingFace to access the dataset. Feel free to share and use our work:
paper: https://arxiv.org/abs/2501.10057
code: https://github.com/paul-rottger/msts-multimodal-safety
dataset: https://huggingface.co./datasets/felfri/MSTS
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- RapGuard: Safeguarding Multimodal Large Language Models via Rationale-aware Defensive Prompting (2024)
- PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models (2025)
- TextMatch: Enhancing Image-Text Consistency Through Multimodal Optimization (2024)
- Guiding Medical Vision-Language Models with Explicit Visual Prompts: Framework Design and Comprehensive Exploration of Prompt Variations (2025)
- Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency (2025)
- VLSBench: Unveiling Visual Leakage in Multimodal Safety (2024)
- AEIOU: A Unified Defense Framework against NSFW Prompts in Text-to-Image Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper