The Art of Saying No: Contextual Noncompliance in Language Models
Abstract
Chat-based language models are designed to be helpful, yet they should not comply with every user request. While most existing work primarily focuses on refusal of "unsafe" queries, we posit that the scope of noncompliance should be broadened. We introduce a comprehensive taxonomy of contextual noncompliance describing when and how models should not comply with user requests. Our taxonomy spans a wide range of categories including incomplete, unsupported, indeterminate, and humanizing requests (in addition to unsafe requests). To test noncompliance capabilities of language models, we use this taxonomy to develop a new evaluation suite of 1000 noncompliance prompts. We find that most existing models show significantly high compliance rates in certain previously understudied categories with models like GPT-4 incorrectly complying with as many as 30% of requests. To address these gaps, we explore different training strategies using a synthetically-generated training set of requests and expected noncompliant responses. Our experiments demonstrate that while direct finetuning of instruction-tuned models can lead to both over-refusal and a decline in general capabilities, using parameter efficient methods like low rank adapters helps to strike a good balance between appropriate noncompliance and other capabilities.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors (2024)
- OR-Bench: An Over-Refusal Benchmark for Large Language Models (2024)
- MOSSBench: Is Your Multimodal Language Model Oversensitive to Safe Queries? (2024)
- Self and Cross-Model Distillation for LLMs: Effective Methods for Refusal Pattern Alignment (2024)
- How Far Can In-Context Alignment Go? Exploring the State of In-Context Alignment (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper