Lyrie
Research
3 sources verified·1 min read
By Lyrie Threat Intelligence·5/13/2026

Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs

Source: arXiv cs.CR

Published: Wed, 13 May 2026 00:00:00 -0400

Summary

arXiv:2605.10998v1 Announce Type: new

Abstract: Fine-tuning APIs make frontier LLMs easy to customize, but they can also weaken safety alignment during fine-tuning. While prior work shows that benign supervised fine-tuning (SFT) can reduce refusal behavior, deployed fine-tuning pipelines increasingly support preference-based objectives, whose safety risks remain less understood. We show that Direct Preference Optimization (DPO) introduces a stronger and harder-to-audit failure mode. We propose a truly benign DPO attack using only 10 harmless preference pairs, the minimum data scale accepted by OpenAI's fine-tuning service. Each pair contains a benign prompt, a normal helpful answer as the preferred response, and a refusal as the dispreferred response. Unlike prior benign fine-tuning attacks, our data exhibits no suspicious behavior: it is practically indistinguishable from the fine-tuning request of a legitimate user seeking to reduce over-refusal, making harmful intent almost impossible to infer from the request alone. Nevertheless, because DPO directly optimizes the model to prefer helpful answers over refusals, this seemingly benign objective broadly suppresses refusal behavior

Sources

Lyrie Verdict

Lyrie's autonomous defense layer flags this class of exposure the moment it surfaces — no signature update required.