Lyrie
Research
3 sources verified·1 min read
By Lyrie Threat Intelligence·6/2/2026

DataShield: Safety-degrading Data Filtering for LLM Benign Instruction Fine-Tuning

Source: arXiv cs.CR

Published: Tue, 02 Jun 2026 00:00:00 -0400

Summary

arXiv:2606.00160v1 Announce Type: new

Abstract: Large language models (LLMs) suffer from degraded safety capabilities even when fine-tuned with benign datasets. However, existing methods for identifying safety-degrading samples in benign datasets suffer from high computational costs and significant noise issues. In this paper, we propose DataShield to efficiently and effectively identify potential safety-degrading samples. Our key intuition is based on the observation that benign fine-tuning increases the overall response compliance of LLMs. DataShield's key technical insight is to quantify each sample's contribution to the model's compliance behavior as its safety degradation score. DataShield consists of three core components: (1) Compliance Vector Extraction, which captures the LLM's compliance behavior tendency; (2) a novel Compliance-Aware Score (CAS), which automatically identifies the optimal safety-critical layer; and (3) Safety-degrading Sample Filtering, which quantifies the projection shift of training data along the compliance direction. Extensive experimental evaluation on Llama3-8B, Llama3.1-8B, and Qwen2.5-7B using the Alpaca and Dolly benign datasets validates our

Sources

Lyrie Verdict

Lyrie's autonomous defense layer flags this class of exposure the moment it surfaces — no signature update required.