1,726 Tests, Zero Failures: The Engineering Behind Lyrie's Production Platform
Author: Lyrie Threat Intelligence Team
Date: 2026-05-13
Reading time: 10 min
TL;DR
Lyrie's production platform — the ATP runtime, the validation engine, the threat-intel pipeline, and the customer integrations — has, as of this article's publication date, 1,726 tests passing with zero failures in our continuous integration pipeline. The platform's measured 30-day rolling uptime is 99.99% across the customer-fleet decision clusters.
The engineering choices that produced these numbers are not magic and they are not luck. They are the consequence of two specific architectural decisions made early and committed to:
1. The platform runs on OpenClaw-based autonomous operations. We did not build a conventional microservices platform with humans operating it. We built an autonomous-operations platform where most operational decisions are made by agents under bounded authority, with humans escalated to only for novel situations. This is the same operating model we sell to customers (article 11), eaten as our own dogfood.
2. We trained our own threat-domain LLM rather than fine-tuning a general-purpose model. The intent-verification and threat-classification models that ship in the validation engine are trained from scratch on threat-domain corpora, not fine-tuned on top of GPT-4 or Claude or Gemini. This choice imposes a substantial training-infrastructure cost and produces a model that is much smaller, much faster, much cheaper to operate, and meaningfully better at the narrow domain we care about.
This article walks through the architecture, names the specific engineering tradeoffs, presents the performance metrics with their caveats, and is candid about what we got wrong.
The Architecture, Briefly
The platform has six major components:
1. Customer-side runtime shim. The in-line hook installed on customer hosts/services that validates actions in-line (article 14). Footprint: under 40MB resident, 1.5ms p99 added latency.
2. Customer-region decision cluster. Horizontally scaled cluster running the validation models and policy engines. Most customers run their own cluster; some smaller customers share Lyrie-managed clusters in dedicated regions.
3. Threat-intel ingestion pipeline. Continuous ingestion from public sources (CVE feeds, mailing lists, security news, OSINT), partner sources (ISACs, threat-sharing programs), and internal sources (customer-anonymized telemetry where contracted).
4. Reasoning-and-validation cluster. The heaviest component. Runs the static-analysis and exploit-validation models against ingested vulnerabilities, produces structured threat-intel artifacts, and feeds them downstream to detection clusters.
5. Audit substrate. Independent of customer clusters and operator authority. Stores tamper-evident records of every decision the platform makes for any customer.
6. Public-feed publisher. ATP-signed threat-intel and IoC publication to subscribers, including our public research feed and the partner-ISAC feeds.
The components communicate exclusively over ATP-framed messages (articles 13-14). The audit substrate is the only component with cryptographic authority to validate the integrity chain across all the others.
OpenClaw-Based Autonomous Operations
The operational layer is where most security companies have invisible problems. SRE teams burn out. On-call rotations get heavier as the platform grows. Configuration drift accumulates. We built around this by adopting an autonomous-operations model from day one.
What that looks like in practice:
Cluster scaling, certificate rotation, secret rotation, dependency-update deployment, and incident response are handled by bounded-authority agents. The agents have explicit capability scopes (the same ATP capability scopes we sell to customers). A cluster-scaling agent can spin up additional decision nodes in declared regions up to declared limits; outside the limits, it escalates. A certificate-rotation agent can rotate certificates in declared infrastructure within a declared schedule; novel rotations require human review.
Humans are escalated to for: novel situations the agents have not seen before, situations where the agent's confidence is below threshold, and policy decisions (when do we accept this CVE for auto-patching, when do we deprecate that customer integration). Routine operations run without human involvement.
Every agent decision is logged to the audit substrate in the same way customer decisions are. Our own audit log is publicly inspectable to customers under their contract terms; this is a deliberate transparency commitment and a moderate operational cost (we cannot quietly fix things, we have to do them properly because the record persists).
The operating model has held at 11 engineers handling 99.99% uptime across 14 customer-region clusters plus the central reasoning cluster. With the Seed A scaling to ~25 engineers, we expect to handle 60+ customer-region clusters with substantially more capacity, not more headcount.
Why We Trained Our Own Model
The LLM choice is the most-debated engineering decision we have made. The common path — fine-tune GPT-4 or Claude or Gemini on threat-domain data — is faster, cheaper to build, and produces a competent product. We chose differently.
The reasons:
1. Latency. Fine-tuned API-served models have a floor on per-call latency that is set by the provider's serving infrastructure, typically 200-800ms median plus tail. Our validation engine's latency budget per action is under 100ms for the inference call. Hosting our own model on dedicated infrastructure lets us hit ~25-40ms median inference latency at the cost of running our own GPU fleet.
2. Cost. At our query volume (millions of validation calls per day across the fleet), API-served inference costs would dominate our COGS. Self-hosted inference on a model sized appropriately for the threat-domain task (~8B parameters, not 200B+) costs roughly 1/40th of equivalent API calls.
3. Reliability. Our customers contractually require us to maintain detection capability during cloud-provider outages. API-served-model dependency creates a single point of failure outside our control. Self-hosted models are operable across multi-region failover under our own management.
4. Specificity. A model trained from scratch on threat-domain data, with the right corpus, outperforms a fine-tuned general model on threat-domain tasks. We measured this carefully. On our internal eval set for intent-vs-action classification, the from-scratch threat-domain model produces 94% accuracy at 8B parameters; the best fine-tuned general model we evaluated reached 89% at 200B+ parameters. The smaller, specialized model is better.
5. Compliance. Several of our customer segments — sovereign government, regulated financial services — have requirements that effectively preclude routing security-critical inference through a third-party API. Self-hosting was a hard requirement for ~30% of our customer pipeline before it was an engineering preference.
The costs we accepted:
- A six-month training-infrastructure buildout before the model could be trained at all.
- Ongoing GPU-fleet operating cost. Significant but manageable at our scale.
- The need for a specialized ML team. We have three full-time researchers; with the Seed A this grows to five.
- The honest acknowledgment that our model is only competent on threat-domain tasks. It cannot help with anything outside that scope, by design.
The model itself is named internally; we have not published the name or the weights. The reason is operational: the model is the product. Publishing it would be like a SaaS company publishing its production database — possible but commercially incoherent. We do publish detailed eval methodology (article 14) so that the performance claims are reproducible by anyone willing to do the comparable training run.
The Specific Performance Numbers
The headline numbers and what they actually mean:
2-second zero-day detection. This is the median wall-clock time from upstream-source publication (CVE feed, mailing list, commit) to validated-actionable threat-intel in our pipeline. The 87 seconds in article 1 was the validator-only stage; the 2-second number is detection-only. Both are real; they measure different things.
This number applies to CVE-class threats from monitored upstream sources. It does not apply to threats we discover internally (those are 0-second by definition) or to threats arriving via partner channels (those depend on partner latency).
94% novel-exploit accuracy. The ATP-EV-1 validator's precision on confirmed-exploitable bugs that were not in its training set. The 94% is the harmonic mean of 93.8% (confirmed-exploitable bugs flagged correctly) and 98.8% (confirmed-not-exploitable bugs cleared correctly), rounded down. The methodology is in article 3.
99.99% uptime. Rolling 30-day measurement across the customer-region decision clusters, weighted by customer-validated-action volume. The metric does not include the central reasoning cluster (which has lower uptime requirements because customer-side validation falls back to cached verdicts when the central cluster is unreachable) and does not include third-party dependencies (which are documented separately).
1,726 passing tests, zero failures. As of article publication. The count grows over time as we add tests; we are committed to a zero-failure CI state and to not silencing failures to maintain the green count.
What We Got Wrong
The candid section.
1. We under-invested in observability for the first nine months. Building runtime detection without comprehensive internal observability was a mistake. We caught several incidents in customer telemetry that we should have caught in our own telemetry first. We have since substantially expanded internal observability; this is a fix, not a learned-yet-not-yet-applied lesson.
2. We over-invested in the Python implementation in the early days. The Python ATP implementation was the easiest to prototype with, and we made it the early reference implementation. When production performance requirements landed, we had to materially rewrite components in Rust and TypeScript. Today the four implementations are at parity; in 2025 the Python implementation had a 5x throughput disadvantage that constrained early customer deployments.
3. The first version of the customer-side shim had a memory leak. Caught in pre-production testing, never shipped to a customer. The fix took two weeks because the leak was in a third-party library we then had to monkey-patch. We have since rewritten the affected component to not depend on the library at all. The episode is on this list because it nearly broke our launch schedule and reminds us to be skeptical of third-party dependencies on critical paths.
What's Next
- Q3 2026: Open-source release of the OpenClaw operational tooling we use internally, under MIT, for use by other companies who want the autonomous-operations operating model.
- Q3 2026: Public methodology document for our model training (corpus construction, eval methodology, fine-tuning vs. from-scratch trade-offs).
- Q4 2026: Engineering capacity at ~25, geographic R&D distribution complete.
Reach the team: [email protected]; CEO: [email protected].
_Published by Lyrie.ai · lyrie.ai/research · Guy Sheetrit, CEO_
Lyrie Verdict
Lyrie's autonomous defense layer flags this class of exposure the moment it surfaces — no signature update required.