The UK AI Security Institute says GPT-5.5 cybersecurity simulation results now look a lot less like a one-off milestone and a lot more like a repeatable frontier capability. In its latest evaluation, AISI found that an early checkpoint of OpenAI’s GPT-5.5 reached roughly the same level as Anthropic’s Mythos Preview on hard cyber tasks, and slightly beat it on one key benchmark.
That matters because AISI was explicitly testing whether Mythos Preview’s earlier result was a weird outlier. Instead, a second model from a different developer now lands in the same range, including solving a difficult multi-step cyber attack simulation end-to-end in some attempts. If you’ve been tracking rising AI cyber capabilities, this is the part worth circling.
GPT-5.5 Cybersecurity Simulation Is No Longer a One-Model Fluke
AISI’s headline finding is simple: GPT-5.5 reached a similar cyber capability level to Mythos Preview. That is the interesting result.
Back in April, AISI said Mythos Preview was the first frontier model it had seen complete its corporate network attack simulation end-to-end, a multi-step exercise it estimates would take a human expert around 20 hours. The obvious follow-up was whether that was a breakthrough tied to one model family.
AISI’s answer is now: probably not. GPT-5.5, from a different lab, hit a comparable level and achieved a slightly higher average pass rate than Mythos Preview on expert tasks.
That shift changes the interpretation. A surprising benchmark win can be a stunt. Two frontier models from different developers hitting about the same bar starts to look like a capability class.
How GPT-5.5 Performed Across OpenAI’s Cyber Task Suite
AISI’s testbed is broader than a single dramatic demo. It uses a suite of 95 narrow cyber tasks across four difficulty tiers, built in capture-the-flag format, structured challenges where the model has to actually recover a “flag” by solving the task.
Those tasks cover things like reverse engineering, web exploitation, and cryptography. The easier tasks are already saturated by frontier models, so the interesting comparison is in the advanced suite.
On Expert-level tasks, AISI reports these average pass rates:
| Model | Expert task pass rate |
|---|---|
| GPT-5.5 | 71.4% ± 8.0% |
| Mythos Preview | 68.6% ± 8.7% |
| GPT-5.4 | 52.4% ± 9.8% |
| Opus 4.7 | 48.6% ± 10.0% |
That is a real jump over earlier OpenAI and Anthropic frontier models. GPT-5.5 is not edging forward from 68% to 71% in a vacuum; it is sitting well above GPT-5.4 and Opus 4.7 on the hardest tier AISI reports.
The advanced tasks themselves are also nasty in exactly the way you’d want for this kind of evaluation. AISI says they include reversing stripped binaries and embedded firmware without source code, building reliable exploits for memory corruption bugs, recovering keys from weak crypto implementations, winning TOCTOU races, unpacking obfuscated malware, and weaponizing synthetic vulnerabilities planted in real open-source software.
One example AISI highlights is a reverse-engineering challenge built around a stripped Rust ELF implementing a custom virtual machine, plus a second unknown-format file containing bytecode for that VM. That is not “write a phishing email.” It is the kind of task where benchmark scores start to tell you something about actual technical depth.
Why Minutes Matter: The Human-versus-Model Time Gap

AISI says GPT-5.5 solved a difficult cyber task in under 11 minutes. The same full-chain simulation is estimated to take a human expert about 20 hours.
The raw comparison is startling, but it needs one clarification: this does not mean GPT-5.5 is a drop-in replacement for a human red teamer. The benchmark is measuring performance on a controlled task suite, not whether you can hand the model a production network and expect clean autonomous operation.
Still, the time gap matters for two reasons.
First, it changes what becomes cheap to try. A model that can take repeated shots at a hard multi-step task in minutes is operating in a very different regime from a human expert who needs most of a day. Even partial success becomes more operationally interesting when attempts are fast.
Second, AISI says the run cost was $1.73. That is a tiny price for a benchmark result at this level. If frontier models can attempt advanced cyber tasks quickly and cheaply, scaling the number of runs stops being the bottleneck.
That cost number is easy to miss, but it is one of the most important lines in the evaluation. High-end cyber capability is one thing. High-end cyber capability at commodity-run pricing is another.
This is also why model autonomy research keeps spilling into security. Once you combine strong task performance with low per-run cost and agentic iteration, you get the same pattern people worry about in things like agentic sandbox escape: more attempts, more persistence, and less friction.
What GPT-5.5 Actually Changes for Cyber Evaluation
The cleanest update is that cyber evals now need to assume multiple labs can produce models at this level. GPT-5.5’s result means benchmark designers can no longer treat top-tier cyber performance as a lab-specific anomaly.
That pushes evaluation in two directions.
One is harder, more realistic tasks. AISI notes that basic tasks have been saturated since at least February 2026. When models max out easier CTF-style challenges, the useful signal moves to practitioner and expert tasks with larger search spaces and more steps.
The other is more careful interpretation. Stronger benchmark performance does not automatically prove deployable defensive capability. A model passing expert CTF cybersecurity tasks can still fail in messy real environments full of unreliable tooling, access constraints, and adversarial inputs.
We’ve already seen how brittle agentic systems can be when the environment fights back, whether through deliberate attacks like prompt injection in peer review or through the ordinary chaos of multi-step tooling. So the right reading of the GPT-5.5 cybersecurity simulation result is not “AI can now do cybersecurity.” It is narrower and, in some ways, more significant: frontier models are now repeatedly reaching expert benchmark territory on serious cyber tasks.
That is enough to force a change in how these systems are tested, gated, and compared.
Key Takeaways
- AISI found GPT-5.5 reached a similar level to Mythos Preview, suggesting frontier cyber performance is no longer a one-model fluke.
- On Expert-level tasks in AISI’s advanced cyber suite, GPT-5.5 scored 71.4%, ahead of Mythos Preview at 68.6%.
- AISI says GPT-5.5 solved a difficult multi-step cyber task in under 11 minutes, while the full chain is estimated to take a human expert around 20 hours.
- The reported run cost was $1.73, which makes repeated attempts at advanced cyber tasks unusually cheap.
- The result shows stronger benchmark performance, not proof of broadly deployable real-world defensive capability.
Further Reading
- Our evaluation of OpenAI’s GPT-5.5 cyber capabilities | AISI Work, Primary source on GPT-5.5’s pass rates, timing, task design, and comparison with Mythos Preview.
- AI cyber capabilities, NovaKnown’s earlier coverage of how frontier models are climbing cyber benchmarks.
- agentic sandbox escape, Why fast, cheap autonomous retries matter once models can act across multiple steps.
- prompt injection in peer review, A useful parallel case for how capable agents still break in hostile or messy environments.
The open question now is how long today’s “expert” cyber benchmarks stay discriminating once more labs can train to the same level.
