Lightweight Vulnerability Detection from Code Metrics and Token Features
Source: arXiv cs.CR
Published: Thu, 07 May 2026 00:00:00 -0400
Summary
arXiv:2605.04260v1 Announce Type: new
Abstract: Vulnerability detection for C/C++ code increasingly relies on heavy representations such as code graphs and deep models, while many practical workflows still benefit from fast and reproducible ranking baselines for human triage. This preprint studies a lightweight function-level vulnerability triage pipeline that combines sparse token n-grams from raw function text with a small set of inexpensive code metrics, including NLOC, approximate cyclomatic complexity, token count, maximum brace depth, and parameter count. We use TF-IDF token features and a class-weighted logistic regression classifier, avoiding deep learning, transformers, and program graphs.
Using the Devign function-level labels, we evaluate random and cross-project settings, including a FFmpeg-to-QEMU transfer experiment. We emphasize precision-recall AUC and Recall@10% as ranking-oriented metrics for skewed or triage-oriented workloads. On the random split, the best combined variant reaches PR-AUC 0.642 and Recall@10% 0.161, while cross-project generalization is substantially harder, with PR-AUC around 0.436. We further report ablations, test-only identifier-renaming r
Sources
Lyrie Verdict
Lyrie's autonomous defense layer flags this class of exposure the moment it surfaces — no signature update required.