Trading Inference-Time Compute for Adversarial Robustness
Abstract
We conduct experiments on the impact of increasing inference-time compute in reasoning models (specifically OpenAI o1-preview and o1-mini) on their robustness to adversarial attacks. We find that across a variety of attacks, increased inference-time compute leads to improved robustness. In many cases (with important exceptions), the fraction of model samples where the attack succeeds tends to zero as the amount of test-time compute grows. We perform no adversarial training for the tasks we study, and we increase inference-time compute by simply allowing the models to spend more compute on reasoning, independently of the form of attack. Our results suggest that inference-time compute has the potential to improve adversarial robustness for Large Language Models. We also explore new attacks directed at reasoning models, as well as settings where inference-time compute does not improve reliability, and speculate on the reasons for these as well as ways to address them.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Robustness of Large Language Models Against Adversarial Attacks (2024)
- Smoothed Embeddings for Robust Language Models (2025)
- On Adversarial Robustness and Out-of-Distribution Robustness of Large Language Models (2024)
- Adversarial Vulnerabilities in Large Language Models for Time Series Forecasting (2024)
- RobustBlack: Challenging Black-Box Adversarial Attacks on State-of-the-Art Defenses (2024)
- On Adversarial Robustness of Language Models in Transfer Learning (2024)
- Mitigating Adversarial Attacks in LLMs through Defensive Suffix Generation (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper