arxiv:2410.14669

NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples

Published on Oct 18

· Submitted by

BaiqiL on Oct 21

Upvote

Authors:

Baiqi Li ,

Zhiqiu Lin ,

Wenxuan Peng ,

Jean de Dieu Nyandwi ,

Daniel Jiang ,

Zixian Ma ,

Simran Khanuja ,

Ranjay Krishna ,

Graham Neubig ,

Deva Ramanan

Abstract

Vision-language models (VLMs) have made significant progress in recent visual-question-answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However, are these models truly effective? In this work, we show that VLMs still struggle with natural images and questions that humans can easily answer, which we term natural adversarial samples. We also find it surprisingly easy to generate these VQA samples from natural image-text corpora using off-the-shelf models like CLIP and ChatGPT. We propose a semi-automated approach to collect a new benchmark, NaturalBench, for reliably evaluating VLMs with 10,000 human-verified VQA samples. Crucially, we adopt a vision-centric design by pairing each question with two images that yield different answers, preventing blind solutions from answering without using the images. This makes NaturalBench more challenging than previous benchmarks that can be solved with commonsense priors. We evaluate 53 state-of-the-art VLMs on NaturalBench, showing that models like LLaVA-OneVision, Cambrian-1, Llama3.2-Vision, Molmo, Qwen2-VL, and even GPT-4o lag 50%-70% behind human performance (over 90%). We analyze why NaturalBench is hard from two angles: (1) Compositionality: Solving NaturalBench requires diverse visio-linguistic skills, including understanding attribute bindings, object relationships, and advanced reasoning like logic and counting. To this end, unlike prior work that uses a single tag per sample, we tag each NaturalBench sample with 1 to 8 skill tags for fine-grained evaluation. (2) Biases: NaturalBench exposes severe biases in VLMs, as models often choose the same answer regardless of the image. Lastly, we apply our benchmark curation method to diverse data sources, including long captions (over 100 words) and non-English languages like Chinese and Hindi, highlighting its potential for dynamic evaluations of VLMs.

View arXiv page View PDF Add to collection

Community

BaiqiL

Paper author Paper submitter Oct 21

🚀 Make Vision Matter in Visual-Question-Answering (VQA)!
Introducing NaturalBench, a vision-centric VQA benchmark (NeurIPS'24) that challenges vision-language models with pairs of simple questions about natural imagery. 🌍📸

Here’s what we found after testing 53 models (including GPT-4o, Llama3.2, Qwen2VL, and Molmo):
1️⃣ All models struggle: They perform only 10-20% above random chance, while human accuracy exceeds 90%!

2️⃣ Models appear strong in previous benchmarks like MME/ScienceQA by exploiting their strong language bias. However, even a blind ChatGPT (without vision) can outperform vision models on these benchmarks.

3️⃣ Debiasing is crucial: Most models prefer "Yes" far more than "No" — correcting this bias can nearly double performance, even for GPT-4o.

Paper: https://arxiv.org/abs/2410.14669
Dataset: https://huggingface.co./datasets/BaiqiL/NaturalBench
Website: https://linzhiqiu.github.io/papers/naturalbench/

Work led by CMU & UW with Baiqi Li, Zhiqiu Lin, Wenxuan Peng, Jean de Dieu Nyandwi, Daniel Jiang, Zixian Ma, Simran Khanuja, Ranjay Krishna, Graham Neubig, Deva Ramanan

BaiqiL

Paper author Paper submitter Oct 21

Popular VQA benchmarks like MME, MMMU, MMBench, and ScienceQA are prone to blind solutions. For example, models can exploit language bias to answer questions like “What is the capital of Massachusetts?” (“Boston”) without looking at the image.
To solve this, NaturalBench pairs two images with two questions that require opposite answers, preventing blind models from succeeding.

BaiqiL

Paper author Paper submitter Oct 21

•

edited Oct 21

*NaturalBench is collected using a simple pipeline from datasets like Flickr30K by (1) identifying image-text pairs that CLIP fails to match and (2) prompting ChatGPT to generate questions with different answers for each image.
Since NaturalBench avoids perturbing images or questions, it creates natural adversarial samples—questions about natural images that are easy for humans but challenge models.

*While previous VQA benchmarks can be solved by fine-tuning a blind GPT-3.5, NaturalBench cannot!
Most open-source models score only 10–20% above chance, and even GPT-4o (vision-finetuned) falls ~50% behind humans.

*Vision-language models show strong answer biases, often favoring “Yes” over “No” regardless of the input image/question. Correcting these biases can boost top models' performance by 2-3x, making NaturalBench a valuable testbed for future debiasing efforts.

*NaturalBench offers 1-8 skill tags per question for a fine-grained evaluation of compositional reasoning across dimensions like object, attribute, relationship, reasoning, and more.