ARC-AGI-3 Human Baseline Now Means Median Not Outlier

ARC Prize didn’t just tweak a benchmark. The ARC-AGI-3 human baseline now uses a median human run per level instead of the second-best human, and it raised the per-level score cap from 100% to 115%. That sounds minor. It isn’t.

The obvious reading is that ARC-AGI-3 got easier. The better reading is that ARC Prize is redefining what “human-level” is supposed to mean: not an unusually efficient run, but typical first-pass human problem solving under the same conditions AI gets. On this benchmark, the AI-human gap is now partly a measurement question, not just a capability question.

That distinction matters because ARC-AGI-3 is explicitly trying to test novel problem-solving, not memorized competence. If you use the wrong human reference point, you end up benchmarking against a speed-run.

Why ARC-AGI-3’s human baseline changed

ARC Prize’s explanation is verified by its announcement and methodology docs: the old baseline over-weighted outlier human runs. A single unusually efficient solve on one level could become the reference point for everyone else, including AI. That is a bad fit for a benchmark meant to capture how humans learn a new task on first encounter.

The new human dataset is also much larger than the launch framing implied. ARC Prize says it ran a controlled study with 458 participants in 90-minute in-person sessions, under “first-run” conditions, with no prior exposure and no hint that this was an AI benchmark. That is verified in the human dataset announcement.

What changed, specifically, is the baseline used per level. Instead of anchoring scores to the second-best human action count, ARC Prize now uses the median human by fewest actions on that level. Their stated reason is simple: it “reflects typical proficient human performance rather than outlier runs,” reduces luck, and keeps the benchmark grounded in real play rather than theoretical optimal play. That rationale is verified in the docs.

Read as a fairness fix, this makes sense. Read as a capability update, it changes much less. ARC Prize says the scoring revision lifts both human and AI scores by only about +0.5 percentage points. That claim is verified by ARC Prize, but it is still the organization describing the effect of its own redesign, not an independent audit.

What the new scoring rule actually measures

ARC-AGI-3’s official scoring method is Relative Human Action Efficiency, or RHAE. In plain English: how efficiently did the AI solve a level relative to the human baseline for that level?

The docs give a concrete example. If the human baseline is 10 actions and the AI takes 10, the level score is 1.0, or 100%. If the AI takes 20, the score is 0.25, or 50%. If the AI takes 100, the score drops to 0.01, or 1%. Those examples are verified in the methodology docs.

The key update is the cap. Previously, a level contribution topped out at 100%. Now it can go to 115%. ARC Prize’s stated reason is that the old cap could unfairly punish a generally strong run because one weak level dragged down the aggregate; the 115% cap gives some room for above-baseline efficiency elsewhere. That rationale is verified by ARC Prize’s announcement.

Here is the strategic point: the 115% cap does not mean a system can brute-force one level and magically become human-level overall. It means the benchmark now allows modest overperformance on some levels to offset underperformance on others. That is closer to how we usually think about people. Human performance is uneven.

Scoring element	Old interpretation	New interpretation	What it changes
Human reference point	2nd-best human run	Median human per level	Reduces outlier influence
Per-level cap	100%	115%	Allows modest overperformance credit
Intended meaning	Near-best human efficiency	Typical proficient first-run play	More realistic human comparator

The internet argument that this was “carefully crafted” to help AI is unverified speculation. The evidence we do have points the other way: ARC Prize says both human and AI scores rose by roughly the same small amount. Without an independent re-scoring analysis across major model submissions, there is no basis to claim the change was secretly targeted at model performance.

Why the update matters for the AI-human capability gap

At launch, ARC Prize framed ARC-AGI-3 as a benchmark where humans solve 100% of environments and frontier AI scores are below 1%. That is verified in the launch post and technical report. The benchmark’s central claim was never that humans are perfect in the speed-run sense. It was that humans can figure out every environment under first-run conditions.

That is the part many readers missed. The benchmark is measuring solvability plus efficiency, not just binary completion. Changing the ARC-AGI-3 human baseline changes the interpretation of 100%, because 100% now means matching or exceeding median human-baseline action efficiency across all levels, not matching an unusually efficient human outlier.

That makes the benchmark look less like a contest against elite puzzle speed and more like a test of normal human adaptation. In that sense, the update probably makes ARC-AGI-3 fairer for humans, not meaningfully easier for AI.

It also sharpens a point that gets lost in broader AGI debates: benchmarks are not neutral windows into intelligence. They are measurement systems with design choices. When ARC Prize changes the baseline, it changes what the numbers mean. That does not make the benchmark useless. It makes it honest.

This is also why raw fluency remains a distraction. A model that sounds confident can still fail badly on novel interactive tasks, a gap we’ve covered before in AI Misconceptions: Why Fluency Isn’t Competence Today. ARC-AGI-3 is trying, imperfectly, to isolate that difference.

What remains unresolved about ARC-AGI-3

Some important questions are still open.

First, the benchmark’s human study is much stronger than a casual demo, but it is still ARC Prize’s own study. The participant count of 458 is verified. The claim that this captures a robust human reference point is plausible, but external replication would help.

Second, the benchmark still blends two ideas: can you solve it at all? and how efficiently did you solve it? That is defensible, but it means the AI-human gap can move because models got better, because humans were re-measured, or because the scoring rule changed. Those are not the same story.

Third, there is still a conceptual dispute about what “human-level” should mean here. Median human? Best human? Every human? ARC Prize has now taken a side: typical proficient first-run human performance. That is a coherent choice. It is not the only possible one.

Fourth, ARC-AGI-3 still says almost nothing on its own about job loss, deployment readiness, or enterprise automation. A model closing the benchmark gap would be evidence of stronger adaptation to novel tasks. It would not automatically answer whether the system is reliable enough in production, or whether its errors can be controlled, a separate issue we’ve looked at in Reduce LLM Hallucinations?.

My prediction is straightforward: within the next 12 months, model labs will cite improved ARC-AGI-3 numbers as evidence of progress, but the more important fight will be over which human reference point counts. The benchmark battle is moving from “what did the model score?” to “what exactly does that score normalize against?” That is a healthier argument. It is also a much harder one to spin.

Key Takeaways

ARC-AGI-3 human baseline now uses the median human per level, not the second-best human run.
The new rule is meant to measure typical first-pass human problem solving, not near-outlier efficiency.
The 115% cap lets strong performance on some levels offset weaker performance on others.
ARC Prize says the change raises both human and AI scores by only about +0.5 points.
The AI-human gap on ARC-AGI-3 is now more explicitly about measurement design as well as model capability.

ARC-AGI-3 Human Baseline Now Means Median Not Outlier

Why ARC-AGI-3’s human baseline changed

What the new scoring rule actually measures

Why the update matters for the AI-human capability gap

What remains unresolved about ARC-AGI-3

Key Takeaways

Further Reading

ByteDance Eyes Inference Silicon; Cloudflare Automates First-Pass Reviews; SQLite Creeps Into Agent Runtime

Run local LLMs by choosing the stack, not the app

YouTube Automates AI Labels; Signal Backups Become Bait; Waymo Debuts Ojai; Pay Tel Exposed Caller IDs; Anthropic Rewrite Claim Lacks Proof

PostHog Defaults to Training; AI Reports Swamp Maintainers; Zero Days Go Public; Atomic Manufacturing Gets Evidence

Coding Models Hit Price-War Mode; Microsoft Pulls Back Claude Code; Dutch State Blocks DigiD Supplier Deal

Categories