I just finished reading Anthropic's Project Glasswing: an initial update, and it sent me back to a question I keep circling. How does vulnerability management actually work in most organisations, and is it ready for what is coming?
Anthropic, working with around 50 partners including Cloudflare, Mozilla, Cisco, and Oracle, used an unreleased model, Mythos (preview), to find more than ten thousand high or critical severity vulnerabilities in a single month. Cloudflare alone found 2,000 bugs, 400 of them high or critical, at a false positive rate their team rated better than human testers. Mozilla found 271 vulnerabilities in Firefox 150, ten times what the prior model, Claude Opus 4.6, had found in Firefox 148.
Anthropic put the implication plainly:
Progress on software security used to be limited by how quickly we could find new vulnerabilities. Now it's limited by how quickly we can verify, disclose, and patch the large numbers of vulnerabilities found by AI.
I came away part excited and part uncomfortable. Excited because, for the first time, we have tools that could meaningfully help secure the internet at scale. Uncomfortable because the same tools sit in attackers' hands, and because the volume of code now being shipped by AI coding assistants, much of it vibe coded by people who barely read what the model produced, is growing far faster than the security infrastructure built to defend it. We are not ready.
The numbers were not a surprise
None of those figures surprised me. For weeks I have been handing Claude Opus 4.6 and 4.7 my own code. Real code, not scripts. Code that had already been through static analysis, a human software engineer, and a security engineer. Claude still finds bugs. Authorization issues. A chain that could be pushed into a denial of service. Code I would have shipped, picked apart by a model in minutes.
Attackers got the same compression
The attacker side is moving in the same direction. Mandiant's M-Trends 2026 puts the mean time to exploit at negative seven days. That figure is the gap between when a vulnerability is publicly disclosed and when it is first seen being exploited. A negative number means exploitation came first: a growing share of vulnerabilities are used as zero days, found by attackers and weaponised before the vendor or the public knew the bug existed.
For the vulnerabilities that do get disclosed before exploitation, the gap has nearly collapsed. Verizon's 2025 Data Breach Investigations Report found that for new critical vulnerabilities in edge devices and VPNs, the median time between disclosure and mass exploitation was zero days. Disclosed in the morning, exploited at scale by the afternoon.
XBOW, an autonomous AI pentester, reached number one on HackerOne's global leaderboard last summer with more than two thousand reputation points in about ninety days. Nat Friedman, an investor in the company, said it cleanly:
We are now in the era of machines hacking machines.
Now look at your patch cadence
Microsoft shipped roughly 1,130 CVEs across Patch Tuesdays in 2025, the second largest year on record. October alone carried 172 fixes, four of them zero days. NIST reports that CVE submissions to the National Vulnerability Database grew 263 percent between 2020 and 2025. CISA's Known Exploited Vulnerabilities catalog grew another 20 percent last year. Edgescan's 2025 report puts the average time to remediate a high or critical application flaw at 74 days. Veracode's average fix time across customers is now 252 days, up 47 percent since 2020.
A monthly cycle was always a compromise. It used to be a respectable one, and most organisations were working hard to hit it. Set it against the numbers above, and the bar has moved.
The asset problem underneath
A second problem hides under the first. Most legacy organisations grew their technology organically: mergers, projects, departments buying their own tools. The asset list is long, the ownership is fuzzy, and the appetite for downtime is small. You cannot patch everything in fourteen days, and pretending you can wastes the team.
Prioritisation is the only honest answer. Work from the Cyentia Institute and the EPSS group has shown that fixing by CVSS severity alone is about as effective as picking CVEs at random. Use EPSS to predict exploitation likelihood, layer the CISA KEV catalog on top, and use SSVC, the decision model from Carnegie Mellon's Software Engineering Institute, to drive a clear action for each finding. CVSS becomes one input, not the verdict.
What belongs on next year's roadmap
So what should a 2026 programme actually invest in?
Build the DevOps and DevSecOps foundation first
Without it, nothing else works. Pipelines that build, test, and deploy reliably are the prerequisite for fast patching. If you cannot ship a patch on a Tuesday afternoon, the timeline conversation is academic.
Automate dependency patching
Dependabot or Renovate with auto-merge rules for low-risk updates removes thousands of items from the human queue. GitHub's own data shows that projects with auto-merge remediate more consistently than projects that rely on manual review.
Prioritise ruthlessly
EPSS, then KEV, then SSVC, in that order, with CVSS as a sanity check. Anything in CISA KEV with internet exposure gets patched on a clock measured in days. The long tail goes through SSVC and gets a defined action, even if that action is "track."
Put LLMs into SAST and pentesting
Not to replace your existing tools. I have personally found LLMs better than traditional SAST for some classes of bug, especially logic flaws and authorization chains. They miss things SAST catches, and SAST misses things they catch. They are complementary, not competitive. A recent academic paper, SAST-Genius, pairs Semgrep with a fine-tuned LLM and cuts Semgrep's false positives by 91 percent on their test set. That is the direction of travel.
Keep it on your own hardware if you need to
If you do not want to send your source code or pentest data to a third party, you do not have to. Host the smaller open models yourself. Google's Gemma 4, released in April 2026 under Apache 2.0, runs on a single consumer GPU at its 26-billion-parameter mixture-of-experts size. Alibaba's Qwen3-Coder 30B-A3B, also Apache 2.0, runs in about 15 gigabytes of RAM. Wire them into Ollama or LM Studio, point Vulnhuntr or PentestGPT at them, and the data never leaves your network. You can prove the value on a single workstation in a week. Scaling it to production is a real enterprise project, but the path is well understood.
The takeaway
AI has compressed the time from "vulnerability exists" to "vulnerability is exploited" to nearly zero, and in many cases below zero. Defenders need the same compression on the other side of the equation. Build the pipes, automate the patches, prioritise what matters, and put a model on your own hardware.
The teams that do this in 2026 will spend 2027 on harder problems. The teams that do not will spend it on incident response.