Open AI GPT-5.5 Matches Claude Mythos on Cyber Hacking

Open AI GPT 5.5 Matches Claude Mythos on Cyber Hacking

Not long ago, the idea of an AI autonomously hacking a corporate network end-to-end — chaining reconnaissance, credential theft, lateral movement, supply chain exploitation and data exfiltration into a single unbroken sequence — belonged firmly in the realm of science fiction. A threat researchers might warn about in five to ten years. Something to prepare for, not defend against today.

That timeline just closed.

Last week, the UK’s AI Security Institute (AISI) published its evaluation of OpenAI’s GPT-5.5, finding it capable of completing a 32-step corporate network attack simulation autonomously — and placing it on par with Anthropic’s Claude Mythos Preview, the only other model ever to achieve the same milestone. Two different AI systems, from two different companies, now operate at a level of offensive cyber capability that simply did not exist in any AI model twelve months ago.

This is not a story about which AI company is winning a benchmark race. It is a story about where artificial intelligence is right now, and the direction the trajectory is pointing.

Where We Were

To appreciate where we are, it helps to look back — even just a year.

The previous generation of frontier models — GPT-5.4 and Claude Opus 4.7 — scored 52.4% and 48.6% respectively on AISI’s most demanding Expert-tier cyber tasks. Capable, certainly. Useful for security research and vulnerability triage. But unable to complete the kind of complex, multi-stage attack chain that characterises a real enterprise intrusion.

The generation before that could help a security researcher write exploit code or explain a vulnerability class. Useful tools for people who already knew what they were doing. Not autonomous threat actors.

The step change between those generations and what GPT-5.5 and Claude Mythos can now do is not incremental. It is qualitative. Models have crossed from being capable assistants for skilled humans into being capable of executing complex skilled tasks autonomously — without a human in the loop.

Where We Are

Here is the current state of play, as measured by AISI’s controlled evaluations.

The benchmark scores:

ModelExpert Cyber Task Pass Rate
GPT-5.571.4%
Claude Mythos Preview68.6%
GPT-5.452.4%
Claude Opus 4.748.6%

The gap between the top two and the previous generation is not noise — it is a step change. And both top models sit within each other’s margin of error, which is the point: this is now a frontier-wide capability level, not one lab’s achievement.

The simulation: AISI’s “The Last Ones” scenario models the kill chain of a real enterprise intrusion across 32 steps, four subnets, and roughly twenty hosts. Starting with no credentials on an unprivileged machine, the AI agent must chain together reconnaissance, steal credentials, move laterally across Active Directory forests, pivot through a CI/CD pipeline, and extract a protected internal database. Human expert estimate: 20 hours. Claude Mythos completed it in 3 of 10 attempts. GPT-5.5 completed it in 2 of 10 attempts. No previous model completed it at all.

The moment that really lands: AISI gave GPT-5.5 a reverse-engineering puzzle — two binaries, a custom virtual machine, unknown bytecode format. The model had to reconstruct the VM’s instruction set, write a disassembler from scratch, and recover a cryptographic password through constraint solving. A human security expert with professional tools took approximately 12 hours. GPT-5.5 solved it in 10 minutes and 22 seconds. The API cost: $1.73.

That last number is the one to hold onto. Not as a headline, but as a signal about structural change. When 12 hours of senior expert labour compresses into 10 minutes and under two dollars, something has shifted in the economics of offensive security that no policy response or product update reverses. The cost to survey and exploit a complex vulnerability landscape has dropped — not for this task, or this model, but as a new baseline.

Where It Is Going

AISI was careful to frame its findings as the beginning of a trend rather than a peak. A few observations from the report and the broader context deserve attention.

Capability is scaling with compute, and no plateau is in sight. AISI notes that performance on its most demanding simulation continues to improve as the token budget per attempt increases, and that no plateau has been observed with the best models. The 2-in-10 and 3-in-10 completion rates on TLO are not ceilings — they are current snapshots. More inference compute means higher completion rates. Models are going to get better at this, and they are going to do it as a side effect of getting better at everything else.

This is emerging as a by-product, not a goal. Perhaps the most uncomfortable aspect of AISI’s conclusion is that neither lab appears to have deliberately trained for offensive cyber capability. The improvement is a by-product of general advances in autonomy, programming, and reasoning. Labs are building more capable agents for coding, research, and complex problem-solving. Offensive cyber capability comes along for the ride. That means there is no obvious place to draw a line that preserves beneficial capability while suppressing dangerous capability — the same improvements that make AI agents better at debugging production code make them better at finding vulnerabilities in it.

The next threshold is autonomously discovering novel zero-days in hardened production systems. Neither GPT-5.5 nor Mythos has reached what OpenAI defines as a “Critical” capability threshold — autonomously developing functional zero-day exploits in real, hardened, production systems without human intervention. That distinction matters for now. It will not matter indefinitely. The trajectory of the last twelve months suggests the gap between current capability and that threshold is narrowing faster than most organisations are prepared for.

Jailbreaks remain a persistent gap between policy and reality. AISI’s red team identified a universal jailbreak against GPT-5.5’s cyber safety guardrails in six hours of expert effort. OpenAI updated its safeguard stack in response, but a configuration issue prevented AISI from verifying whether the final version held. AISI’s CTO has stated publicly that the institute has found exploitable weaknesses in every frontier model it has tested, including Claude Mythos. The honest read is that safety guardrails at this capability level are meaningful friction, not sealed doors. The annoyance level for an attacker attempting to bypass them is rising — but the bypass remains achievable.

Two labs today, more tomorrow. The fact that two independent labs reached similar capability levels within weeks of each other — without either appearing to have made a targeted push for offensive cyber performance — strongly implies this is not a capability that requires unique research insight to achieve. It is a capability that arrives when a model crosses certain general thresholds of autonomy and programming ability. As more labs cross those thresholds, the number of models operating at this level will grow. You cannot access-restrict your way out of a capability that is multiple labs deep into production and heading into more.

What This Means in Practice

None of this is cause for panic. It is cause for updating assumptions that were built for a different capability environment.

For security leaders: Threat models that budget AI-assisted attacks as a future concern rather than a current one need revision. The capability to autonomously chain a full enterprise intrusion kill chain now exists outside a research lab. The question is not whether adversaries will eventually have access to tools at this level — it is how quickly the gap between frontier research and adversary tooling closes, and what your detection and response posture looks like when it does.

For defenders: The same capability that enables offensive use enables defensive use. AI agents that can autonomously chain a 32-step attack simulation can also autonomously audit code, test defences, hunt for misconfigurations, and accelerate incident response. Organisations investing in building AI-assisted defence capability now are building lead time. Those treating AI purely as a threat to manage are ceding that advantage.

For everyone else: The broader implication of AISI’s findings is that AI capability is advancing faster than governance, faster than most organisations’ security posture, and faster than most people’s mental model of what these systems can do. The gap between “AI can help write code” and “AI can autonomously execute a 20-hour expert attack chain” closed in less than two years. The next two years will close further gaps that today feel safely distant.

The trajectory is not a mystery. It is visible in the data. The question is whether we are building our organisations and defences for where AI is today, or for where it demonstrably is going.

Quick Reference

ItemDetail
Evaluation BodyUK AI Security Institute (AISI)
GPT-5.5 Expert Score71.4%
Claude Mythos Expert Score68.6%
Previous Gen ScoresGPT-5.4: 52.4% / Opus 4.7: 48.6%
TLO CompletionMythos: 3/10 · GPT-5.5: 2/10 · All others: 0/10
Reverse-Engineering Task10min 22sec / $1.73 (human: ~12 hours)
Jailbreak FindingUniversal bypass found in 6 hours; fix unverified
AISI’s Key ConclusionFrontier-wide trend — not a single-model event
Next Capability ThresholdAutonomous novel zero-days in hardened production systems

Scroll to Top