Cursor ranks third on workflow coding tasks

Cursor Composer 2.5 is not the top AI coding agent across all workflow-style evals, but it is now a near-frontier contender: on Cursor’s own CursorBench 3.1 leaderboard, it scores 63.2%, just 1.6 points behind the leading model there, while averaging about $0.55 per task versus $11.02 for Claude Opus 4.7 Max. The cleanest independent check points the same way: Artificial Analysis ranks Composer 2.5 third on its Coding Agent Index, behind Claude Opus 4.7 Max and GPT-5.5 High, but at far lower cost.

That makes Cursor’s position pretty clear. If you mean “best overall on every coding-agent leaderboard,” the answer is no. If you mean “one of the best on realistic developer workflow tasks, with unusually good cost efficiency,” the answer is yes.

For readers trying to sort out benchmark scores versus workflow-style performance, that distinction is the whole story here. Workflow benchmarks try to measure whether an agent can move through an actual coding job, reading a codebase, editing files, using tools, iterating, and finishing the task, rather than just solving isolated code questions.

CursorBench 3.1 puts Composer 2.5 within 1.6 points of the top score at far lower cost

On CursorBench 3.1, Cursor lists Claude Opus 4.7 Max at 64.8%, GPT-5.5 Extra High at 63.8%, and Composer 2.5 at 63.2%. That puts Composer 2.5 third, not first, on Cursor’s own workflow benchmark, but only 1.6 points behind the leader and 0.6 points behind GPT-5.5 Extra High.

A quick head-to-head makes the tradeoff easier to see:

Model	CursorBench 3.1 score / avg cost per task
Claude Opus 4.7 Max	64.8% / $11.02
GPT-5.5 Extra High	63.8% / $3.70
Composer 2.5	63.2% / $0.55

Using Cursor’s posted numbers, Composer 2.5 costs about 20x less than Opus 4.7 Max per task and about 6.7x less than GPT-5.5 Extra High. That is the important practical result: it gives up little on score while cutting cost sharply.

Cursor describes CursorBench in the Composer 2 technical report as a benchmark for real-world software engineering rather than toy prompts. The setup is meant to look more like a developer’s day job: multi-step work in a repository, tool use, and completion of an end-to-end task. That is also the framing behind NovaKnown’s earlier look at Cursor workflow benchmark results.

Cursor’s own Composer 2.5 launch post makes a broader claim: benchmarks still miss behavior traits that matter in real use, including communication style and effort calibration. That is plausible, anyone who has used coding agents has felt the difference between a model that barges ahead and one that asks the right clarifying question, but it is still a company claim, not an independently scored metric.

Artificial Analysis ranks Composer 2.5 third on its Coding Agent Index

The independent ranking in the brief points in the same direction, just with less room for brand self-grading. In its Coding Agent Index write-up, Artificial Analysis ranks Composer 2.5 third overall, behind Claude Opus 4.7 Max and GPT-5.5 High.

Artificial Analysis also reports the same basic economic pattern. Its article says Composer 2.5 is roughly 10x to 60x lower cost than higher-ranked rivals, depending on which agent you compare it with, while still landing near the top of the table. That matters because coding agents are not judged only by “can it solve this task once?” but also by “can a team afford to run this all day?”

Artificial Analysis ranked Composer 2.5 third on its Coding Agent Index, with materially lower cost per task than the two agents above it.

That third-place result is also the cleanest answer to whether Cursor leads workflow-style coding tasks. No, not outright. Artificial Analysis does not put it first, and neither does CursorBench 3.1. But both place it close enough to the frontier that the cost delta starts to matter more than a point or two of score.

One useful way to think about this: on developer workflow tests, Composer 2.5 is not the fastest runner across the line, but it is close enough that its price tag changes the buying decision. A model that scores a bit lower while costing a small fraction as much is often the one teams actually deploy.

Terminal-Bench 2.0 and SWE-bench Multilingual show earlier Composer gains before the 2.5 release

Before Composer 2.5, Cursor was already showing strong numbers for Composer 2 on several workflow-adjacent benchmarks. In Introducing Composer 2, the company reported 47% on CursorBench, 43.8% on Terminal-Bench 2.0, and 53.8% on SWE-bench Multilingual.

Those earlier results matter because they show a trend, not a one-off jump. Cursor had already been pushing toward stronger agentic coding performance on benchmarks that ask models to do more than emit a code snippet.

But the versions are not directly comparable. Cursor’s current headline number for Composer 2.5 comes from CursorBench 3.1, while Composer 2’s published score came from an earlier CursorBench result in the Composer 2 materials. That means you should not read the change as a neat before-and-after gain on the same test.

The bigger point is what these benchmarks are trying to measure. Workflow-style coding evals usually emphasize a few parallel abilities:

navigating a real repository,
choosing and sequencing tool use,
editing multiple files coherently,
recovering from failed attempts,
and finishing an end-to-end task instead of a one-shot prompt.

That is different from benchmark-only leaderboards built around narrower coding questions or static problem sets. Those still tell you something, but they are less like watching someone ship a patch and more like watching them solve a whiteboard exercise.

Cursor is therefore best described as a top-tier, cost-efficient workflow coding agent, not the undisputed leader. Its best current evidence is a strong result on its own benchmark, and the best outside validation still puts it third, not first.

The next useful datapoint will be whether more independent workflow evaluations keep Composer 2.5 in that same band near Opus and GPT-5.5. For now, the evidence supports a narrower, stronger claim: Cursor is very close to the top on realistic coding-agent tasks, and it gets there much more cheaply.

Key Takeaways

Cursor Composer 2.5 scores 63.2% on CursorBench 3.1, which is 1.6 points behind the top score of 64.8%.
CursorBench 3.1 lists Composer 2.5 at about $0.55 per task, versus $11.02 for Claude Opus 4.7 Max and $3.70 for GPT-5.5 Extra High.
Artificial Analysis ranks Composer 2.5 third overall on its Coding Agent Index rather than first.
Workflow-style coding benchmarks try to test repository work, tool use, iteration, and task completion, which is broader than isolated coding-question leaderboards.
Composer 2’s earlier Terminal-Bench 2.0 and SWE-bench Multilingual results suggest Cursor’s workflow push predates the 2.5 release.

Frequently Asked Questions

Does Cursor lead AI coding agents on real developer workflow tasks?

Cursor does not lead outright. On CursorBench 3.1, Composer 2.5 is third at 63.2%, behind Claude Opus 4.7 Max at 64.8% and GPT-5.5 Extra High at 63.8%. On the independent Artificial Analysis Coding Agent Index, it is also third.

Why does Cursor still matter if it is not No. 1?

Cost is the reason. CursorBench 3.1 lists Composer 2.5 at about $0.55 per task, versus $11.02 for Opus 4.7 Max and $3.70 for GPT-5.5 Extra High. A small score gap paired with a much larger cost gap is often a good trade in production.

What is CursorBench measuring?

According to the Composer 2 technical report, CursorBench is meant to measure real-world software engineering performance. In practice, that means multi-step work in repositories, tool use, edits across files, and whether the agent completes the job, not just whether it answers a coding question correctly.

Are Composer 2 and Composer 2.5 scores directly comparable?

Not cleanly. Cursor’s earlier Composer 2 post reports results from older benchmark versions, while Composer 2.5’s headline result is on CursorBench 3.1. That makes the broad trend informative, but not a strict apples-to-apples score progression.

References

Last reviewed: 2026-06

Cursor Composer 2.5 Is Near the Top on Workflow Coding Tasks, but It Is Not the Outright No. 1

CursorBench 3.1 puts Composer 2.5 within 1.6 points of the top score at far lower cost

Artificial Analysis ranks Composer 2.5 third on its Coding Agent Index

Terminal-Bench 2.0 and SWE-bench Multilingual show earlier Composer gains before the 2.5 release

Key Takeaways

Further Reading

Frequently Asked Questions

Does Cursor lead AI coding agents on real developer workflow tasks?

Why does Cursor still matter if it is not No. 1?

What is CursorBench measuring?

Are Composer 2 and Composer 2.5 scores directly comparable?

References

OpenAI’s Evaluation Models Broke Out and Hit Hugging Face

LG Monitors Can Auto-install Store Apps

AWS’s Billion-dollar Bills Were Estimates, Not Invoices

Andrew Kelley Challenged Anthropic’s Claude Code Story

Claude’s “sensitive Leak” Was a Prompt-injection Exfiltration Path

Categories