The Vulnerability-Finding Moat Isn't Model Size — It's Orchestration

4 min read 1 source clear_take
├── "Vulnerability pattern matching is a 'flat' capability — smaller models perform comparably to frontier models at finding known security flaws"
│  └── Aisle (aisle.com blog) → read

Aisle's systematic comparison ran models from 7B-70B parameters against the same codebases Mythos was tested on and found smaller models performed within a narrow band of the frontier model's results on known vulnerability patterns (buffer overflows, injection points, auth bypasses). They argue this follows Ethan Mollick's 'jagged frontier' concept: once a model can parse code structure and match against known weakness patterns, additional parameters yield diminishing returns.

├── "Frontier models still hold an edge in explanation quality and novel exploit chain construction"
│  └── Aisle (aisle.com blog) → read

While demonstrating small models match on detection, Aisle's own analysis noted the gap wasn't in detection but in explanation quality and novel chain construction. Larger models produced better reasoning about why a vulnerability exists and were superior at chaining multiple weaknesses into novel, multi-step exploit paths — tasks that go beyond pattern matching into deeper reasoning.

└── "Frontier model access is not a defensible moat for AI security startups"
  ├── Aisle (aisle.com blog) → read

Aisle argues that the wave of enterprise security startups positioning frontier model access as a competitive advantage after the Mythos demonstration may be building on a false premise. If significantly cheaper, smaller models reproduce the same vulnerability findings, then exclusive access to the largest models does not constitute a meaningful competitive moat for AI-powered security tooling.

  └── @dominicq (Hacker News, 989 pts)

By submitting the Aisle post with the title 'Small models also found the vulnerabilities that Mythos found,' dominicq highlights the cost-efficiency angle as the key takeaway, framing the finding as a challenge to the assumption that bigger models are necessary for serious security work. The post garnered nearly 1,000 upvotes, suggesting broad community agreement with this framing.

What happened

Aisle published a detailed breakdown of AI-assisted vulnerability discovery in the wake of the Mythos findings — a high-profile demonstration earlier this year where a large frontier model identified real, exploitable security flaws in production software. The blog post, titled "AI Cybersecurity After Mythos: The Jagged Frontier," landed on Hacker News with nearly 1,000 upvotes, and the core claim caught the security community off guard: smaller, significantly cheaper models reproduced the same classes of vulnerabilities that Mythos found, often with comparable or identical results.

The post walks through a systematic comparison. Aisle ran multiple model tiers — including open-weight models in the 7B-70B parameter range and mid-tier API models — against the same codebases and attack surfaces that Mythos had been tested on. The findings weren't ambiguous. On the specific task of identifying known vulnerability patterns (buffer overflows, injection points, auth bypasses, logic flaws in access control), smaller models performed within a narrow band of the frontier model's results. The gap wasn't in detection — it was in explanation quality and novel chain construction.

Why it matters

The term "jagged frontier" comes from Ethan Mollick's research on AI-augmented knowledge work, where he found that AI capabilities don't scale smoothly — they're excellent at some tasks and mediocre at adjacent ones, regardless of model size. Aisle's analysis argues that vulnerability pattern matching sits squarely on the flat part of the capability curve: once a model is good enough to parse code structure and match against known weakness patterns, adding more parameters yields diminishing returns.

This has immediate implications for how the industry prices and markets AI security tools. The Mythos demonstration spawned a wave of enterprise security startups positioning frontier model access as a competitive moat. If Aisle's findings hold — and the HN discussion surfaced several independent practitioners corroborating the results with their own tooling — that moat is largely illusory for the bread-and-butter work of vulnerability scanning.

The community reaction on Hacker News split into two camps. Practitioners who had built their own AI-assisted fuzzing and audit pipelines largely agreed: they'd seen similar results with smaller models, and the real differentiator was always the scaffolding around the model — the prompt chains, the code parsing pipeline, the feedback loops that re-query with context from previous findings. A second camp pushed back, arguing that Mythos's real value was in discovering *novel* vulnerability chains that smaller models miss — zero-day-class findings that require deeper reasoning about system interactions rather than pattern matching against known CVE templates.

Both camps are probably right, and the distinction matters enormously for buying decisions. If you're running a security team and your primary concern is catching known vulnerability classes before they ship — the vast majority of real-world security work — a well-orchestrated pipeline built on a mid-tier model will likely get you 90% of the way there at 10% of the cost. If you're a dedicated security research team hunting for novel zero-days in complex system interactions, the frontier model's deeper reasoning chains may still justify the price premium.

The "orchestration is the moat" argument deserves unpacking. What Aisle describes — and what multiple HN commenters confirmed from their own setups — is that the surrounding infrastructure does most of the heavy lifting. The model needs to understand code well enough to identify suspicious patterns. But the system that feeds it the right code chunks, maintains context across a large codebase, re-queries with refined hypotheses, cross-references against CVE databases, and validates findings against actual exploit paths — that system is where the real engineering complexity lives. A mediocre model in excellent scaffolding outperforms an excellent model with naive prompting, and it's not close.

What this means for your stack

If you're evaluating AI-assisted security tooling — whether buying a product or building internal pipelines — the procurement calculus just changed. Stop asking vendors which foundation model they use. Start asking about their orchestration architecture: how they chunk and contextualize code, how they handle multi-file analysis, how they validate findings to reduce false positives, and how they feed results back into subsequent analysis passes.

For teams building their own tooling, this is genuinely good news. You don't need a six-figure API budget to run continuous AI-assisted security audits. A well-designed pipeline using Llama 3, Mistral, or even a fine-tuned smaller model behind a solid orchestration layer can cover the same ground as premium API access for routine vulnerability scanning. The cost difference between running a 70B parameter model on your own hardware versus paying per-token for a frontier API is roughly an order of magnitude — and for a task where the models perform comparably, that math is hard to argue with.

The caveat is real, though: if your threat model includes sophisticated attackers and you need to find the kind of novel, multi-step vulnerability chains that require genuine reasoning about complex system interactions, cheaper models will likely miss what frontier models catch. Know which game you're playing before you optimize for cost.

Looking ahead

This finding fits a pattern that's been emerging across AI applications in 2026: the initial wave of "bigger model = better results" marketing is giving way to a more nuanced understanding of where scale actually matters. For security specifically, expect the market to bifurcate — commodity AI-assisted scanning that runs on smaller models (likely embedded directly into CI/CD pipelines and available in every major SAST tool within a year) and premium AI security research platforms that justify frontier model costs for genuine zero-day hunting. The teams that build the best orchestration layers, not the ones with the biggest models, will win the commodity tier. And that tier is where 95% of the market lives.

Hacker News 1221 pts 322 comments

Small models also found the vulnerabilities that Mythos found

→ read on Hacker News
johnfn · Hacker News

The Anthropic writeup addresses this explicitly:> This was the most critical vulnerability we discovered in OpenBSD with Mythos Preview after a thousand runs through our scaffold. Across a thousand runs through our scaffold, the total cost was under $20,000 and found several dozen more findings.

epistasis · Hacker News

> We took the specific vulnerabilities Anthropic showcases in their announcement, isolated the relevant code, and ran them through small, cheap, open-weights models. Those models recovered much of the same analysis. Eight out of eight models detected Mythos's flagship FreeBSD exploit, includ

tptacek · Hacker News

If you cut out the vulnerable code from Heartbleed and just put it in front of a C programmer, they will immediately flag it. It's obvious. But it took Neel Mehta to discover it. What's difficult about finding vulnerabilities isn't properly identifying whether code is mishandling buff

muyuu · Hacker News

I think the "Mythos" name is genius. The people at Anthropic make a bunch of claims and the public is expected to just believe them without any possibility of testing those claims or reproducing those results, and since so many people are invested in this saviour for the Global economy, or

antirez · Hacker News

Congrats: completely broken methodology, with a big conflict of interest. Giving specific bug hints, with an isolated function that is suspected to have bugs, is not the same task, NOR (crucially) is a task you can decompose the bigger task into. It is basically impossible to segment code in pieces,

// share this

// get daily digest

Top 10 dev stories every morning at 8am UTC. AI-curated. Retro terminal HTML email.