Pay Less Attention to Function Words for Free Robustness of Vision-Language Models
Abstract
Function-word De-Attention (FDA) mitigates adversarial attacks on robust VLMs by differentially subtracting function-word cross-attention, improving robustness with minimal performance trade-offs.
To address the trade-off between robustness and performance for robust VLM, we observe that function words could incur vulnerability of VLMs against cross-modal adversarial attacks, and propose Function-word De-Attention (FDA) accordingly to mitigate the impact of function words. Similar to differential amplifiers, our FDA calculates the original and the function-word cross-attention within attention heads, and differentially subtracts the latter from the former for more aligned and robust VLMs. Comprehensive experiments include 2 SOTA baselines under 6 different attacks on 2 downstream tasks, 3 datasets, and 3 models. Overall, our FDA yields an average 18/13/53% ASR drop with only 0.2/0.3/0.6% performance drops on the 3 tested models on retrieval, and a 90% ASR drop with a 0.3% performance gain on visual grounding. We demonstrate the scalability, generalization, and zero-shot performance of FDA experimentally, as well as in-depth ablation studies and analysis. Code will be made publicly at https://github.com/michaeltian108/FDA.
Community
We had an interesting yet explorable observation that lowering the attention on function words of VLMs increaes robustness and zero-shot performance on several datasets/models/tasks, casuing little or no performance drops, surpasing SOTA adversarial training methods including TeCoA and FARE.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- FPT-Noise: Dynamic Scene-Aware Counterattack for Test-Time Adversarial Defense in Vision-Language Models (2025)
- Enhancing CLIP Robustness via Cross-Modality Alignment (2025)
- Self-Calibrated Consistency can Fight Back for Adversarial Robustness in Vision-Language Models (2025)
- CoDefend: Cross-Modal Collaborative Defense via Diffusion Purification and Prompt Optimization (2025)
- Assimilation Matters: Model-level Backdoor Detection in Vision-Language Pretrained Models (2025)
- Backdoor Attacks on Open Vocabulary Object Detectors via Multi-Modal Prompt Tuning (2025)
- Q-MLLM: Vector Quantization for Robust Multimodal Large Language Model Security (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper