ARC Prize didn’t just tweak a benchmark. The ARC-AGI-3 human baseline now uses a median human run per level instead of the second-best human, and it raised the per-level score cap from 100% to 115%. That sounds minor. It isn’t.
The obvious reading is that ARC-AGI-3 got easier. The better reading is that ARC Prize is redefining what “human-level” is supposed to mean: not an unusually efficient run, but typical first-pass human problem solving under the same conditions AI gets. On this benchmark, the AI-human gap is now partly a measurement question, not just a capability question.
That distinction matters because ARC-AGI-3 is explicitly trying to test novel problem-solving, not memorized competence. If you use the wrong human reference point, you end up benchmarking against a speed-run.
Why ARC-AGI-3’s human baseline changed
ARC Prize’s explanation is verified by its announcement and methodology docs: the old baseline over-weighted outlier human runs. A single unusually efficient solve on one level could become the reference point for everyone else, including AI. That is a bad fit for a benchmark meant to capture how humans learn a new task on first encounter.
The new human dataset is also much larger than the launch framing implied. ARC Prize says it ran a controlled study with 458 participants in 90-minute in-person sessions, under “first-run” conditions, with no prior exposure and no hint that this was an AI benchmark. That is verified in the human dataset announcement.
What changed, specifically, is the baseline used per level. Instead of anchoring scores to the second-best human action count, ARC Prize now uses the median human by fewest actions on that level. Their stated reason is simple: it “reflects typical proficient human performance rather than outlier runs,” reduces luck, and keeps the benchmark grounded in real play rather than theoretical optimal play. That rationale is verified in the docs.
Read as a fairness fix, this makes sense. Read as a capability update, it changes much less. ARC Prize says the scoring revision lifts both human and AI scores by only about +0.5 percentage points. That claim is verified by ARC Prize, but it is still the organization describing the effect of its own redesign, not an independent audit.
What the new scoring rule actually measures
ARC-AGI-3’s official scoring method is Relative Human Action Efficiency, or RHAE. In plain English: how efficiently did the AI solve a level relative to the human baseline for that level?
The docs give a concrete example. If the human baseline is 10 actions and the AI takes 10, the level score is 1.0, or 100%. If the AI takes 20, the score is 0.25, or 50%. If the AI takes 100, the score drops to 0.01, or 1%. Those examples are verified in the methodology docs.
The key update is the cap. Previously, a level contribution topped out at 100%. Now it can go to 115%. ARC Prize’s stated reason is that the old cap could unfairly punish a generally strong run because one weak level dragged down the aggregate; the 115% cap gives some room for above-baseline efficiency elsewhere. That rationale is verified by ARC Prize’s announcement.
Here is the strategic point: the 115% cap does not mean a system can brute-force one level and magically become human-level overall. It means the benchmark now allows modest overperformance on some levels to offset underperformance on others. That is closer to how we usually think about people. Human performance is uneven.
| Scoring element | Old interpretation | New interpretation | What it changes |
|---|---|---|---|
| Human reference point | 2nd-best human run | Median human per level | Reduces outlier influence |
| Per-level cap | 100% | 115% | Allows modest overperformance credit |
| Intended meaning | Near-best human efficiency | Typical proficient first-run play | More realistic human comparator |
The internet argument that this was “carefully crafted” to help AI is unverified speculation. The evidence we do have points the other way: ARC Prize says both human and AI scores rose by roughly the same small amount. Without an independent re-scoring analysis across major model submissions, there is no basis to claim the change was secretly targeted at model performance.
Why the update matters for the AI-human capability gap
At launch, ARC Prize framed ARC-AGI-3 as a benchmark where humans solve 100% of environments and frontier AI scores are below 1%. That is verified in the launch post and technical report. The benchmark’s central claim was never that humans are perfect in the speed-run sense. It was that humans can figure out every environment under first-run conditions.
That is the part many readers missed. The benchmark is measuring solvability plus efficiency, not just binary completion. Changing the ARC-AGI-3 human baseline changes the interpretation of 100%, because 100% now means matching or exceeding median human-baseline action efficiency across all levels, not matching an unusually efficient human outlier.
That makes the benchmark look less like a contest against elite puzzle speed and more like a test of normal human adaptation. In that sense, the update probably makes ARC-AGI-3 fairer for humans, not meaningfully easier for AI.
It also sharpens a point that gets lost in broader AGI debates: benchmarks are not neutral windows into intelligence. They are measurement systems with design choices. When ARC Prize changes the baseline, it changes what the numbers mean. That does not make the benchmark useless. It makes it honest.
This is also why raw fluency remains a distraction. A model that sounds confident can still fail badly on novel interactive tasks, a gap we’ve covered before in AI Misconceptions: Why Fluency Isn’t Competence Today. ARC-AGI-3 is trying, imperfectly, to isolate that difference.
What remains unresolved about ARC-AGI-3
Some important questions are still open.
First, the benchmark’s human study is much stronger than a casual demo, but it is still ARC Prize’s own study. The participant count of 458 is verified. The claim that this captures a robust human reference point is plausible, but external replication would help.
Second, the benchmark still blends two ideas: can you solve it at all? and how efficiently did you solve it? That is defensible, but it means the AI-human gap can move because models got better, because humans were re-measured, or because the scoring rule changed. Those are not the same story.
Third, there is still a conceptual dispute about what “human-level” should mean here. Median human? Best human? Every human? ARC Prize has now taken a side: typical proficient first-run human performance. That is a coherent choice. It is not the only possible one.
Fourth, ARC-AGI-3 still says almost nothing on its own about job loss, deployment readiness, or enterprise automation. A model closing the benchmark gap would be evidence of stronger adaptation to novel tasks. It would not automatically answer whether the system is reliable enough in production, or whether its errors can be controlled, a separate issue we’ve looked at in Reduce LLM Hallucinations?.
My prediction is straightforward: within the next 12 months, model labs will cite improved ARC-AGI-3 numbers as evidence of progress, but the more important fight will be over which human reference point counts. The benchmark battle is moving from “what did the model score?” to “what exactly does that score normalize against?” That is a healthier argument. It is also a much harder one to spin.
Key Takeaways
- ARC-AGI-3 human baseline now uses the median human per level, not the second-best human run.
- The new rule is meant to measure typical first-pass human problem solving, not near-outlier efficiency.
- The 115% cap lets strong performance on some levels offset weaker performance on others.
- ARC Prize says the change raises both human and AI scores by only about +0.5 points.
- The AI-human gap on ARC-AGI-3 is now more explicitly about measurement design as well as model capability.
Further Reading
- Measuring Human Performance on ARC-AGI-3 | ARC Prize, ARC Prize’s announcement of the updated baseline, new dataset, and rationale for the scoring change.
- ARC-AGI-3 Scoring Methodology, Official docs for Relative Human Action Efficiency and the median-based scoring baseline.
- Announcing ARC-AGI-3 | ARC Prize, Launch framing, initial benchmark results, and what ARC-AGI-3 is trying to measure.
- ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence, Technical report with benchmark design, human-testing details, and early AI-vs-human results.
ARC-AGI-3 did not suddenly become a softer test. It became a clearer one. The next argument won’t be whether models are improving; it will be whether we’re finally measuring the right version of “human” in the first place.
