YC‑Bench just produced the sort of result that usually launches a thousand hot takes: GLM‑5 vs Claude Opus on a year‑long startup simulation, within ~5% of each other in final funds, but GLM‑5 runs at roughly 11× lower inference cost.
The instinctive read is “frontier models are overpriced.” That’s the wrong lesson.
The interesting part isn’t that a “smaller” model hung with a “bigger” one.
It’s that once you look at YC‑Bench closely, almost everything that matters for long‑horizon agents is outside the base model: unit economics, scratchpad design, and how you treat randomness.
TL;DR
- YC‑Bench shows GLM‑5 nearly matching Claude Opus on long‑horizon performance at ~11× lower run cost; the cost-per-decision math is what matters, not raw IQ.
- The strongest predictor of success wasn’t leaderboard rank but persistent scratchpad use as working memory, models that rewrote notes ~34× per run won; those with 0-2 entries mostly died.
- For product teams, the right optimization axis is: state + seeds + cost-per-decision, not “buy the fanciest API key.”
- YC‑Bench’s failures (adversarial clients, over‑parallelization) map directly onto today’s production agents, and they are mostly engineering failures, not model failures.
Why GLM-5 vs Claude Opus matters for real agentic systems
In YC‑Bench, an LLM plays CEO of a simulated AI startup for a year: hundreds of turns, $200K starting cash, payroll, clients, some of whom are secretly adversarial and inflate work after you accept their contract.
Across 12 models × 3 seeds, only three consistently end with more than the starting $200K. Claude Opus 4.6 tops the leaderboard at ~$1.27M average final funds; GLM‑5 lands at ~$1.21M, but Opus costs about $86 per run vs $7.62 for GLM‑5 in API calls.
So the surface takeaway is obvious: “GLM‑5 vs Claude Opus: they’re basically tied, but GLM‑5 is way cheaper; just use that.”
Except that’s not actually the YC‑Bench story.
The paper and leaderboard make two more interesting points:
- Success correlates much more with how the agent uses a persistent scratchpad (notes updated ~34× per run) than with whose logo is on the API.
- Frontier models still fail in very human ways: over‑parallelizing work, repeatedly trusting adversarial clients, and forgetting their own strategy once context truncates.
YC‑Bench is not telling you “GLM‑5 is secretly as smart as Opus.”
It’s telling you “for long‑horizon agents, the delta between them is mostly noise once you get the engineering right.”
If you’re still buying the most expensive model by default, you’re optimizing for the wrong variable.
Unit economics beat leaderboard rank: the 11× cost story
YC‑Bench computes cost in the most boring, useful way possible: total tokens sent to and from the model times public list prices, summed over a full one‑year rollout.
- Claude Opus: ~$86 API spend per simulation run, average final funds ≈ $1.27M
- GLM‑5: ~$7.62 per run, average final funds ≈ $1.21M
That’s your 11× factor. But the instructive ratio isn’t “cost per run,” it’s cost per marginal dollar of performance.
Rough back‑of‑the‑envelope:
- Extra capital Opus earns vs GLM‑5: about $60K per run.
- Extra cost: about $78.38 per run.
You’re paying roughly $1.30 of API cost for each extra $1 of simulated capital Opus generates over GLM‑5 in this benchmark.
In a consumer product, you’d kill that feature.
In AI infra, a surprising number of teams celebrate it.
YC‑Bench makes the same point we saw in local‑first coding setups like “Local LLM Coding: $500 GPU Beats Claude”: once a model crosses a competence threshold, unit economics and architecture dominate.
If you care about:
- support cost,
- margin,
- how many concurrent users you can afford,
your optimization variable is not “best mean score on a static eval.” It’s cost‑per‑decision at an acceptable quality level.
And on YC‑Bench, GLM‑5, and even cheaper models like Kimi‑K2.5 on “revenue per API dollar”, win that game.
The practical question isn’t “Is Opus better than GLM‑5?”
It’s “Is Opus 11× better for your task?”
YC‑Bench quietly suggests the answer is usually no.
Scratchpad as working memory: the single strongest predictor
YC‑Bench’s killer design choice is brutally simple: the agent’s conversation history is truncated to the last 20 turns.
If you want to remember anything over longer horizons, which clients were adversarial, which employees are good at research, what payroll is doing to cash flow, you have one tool: a persistent scratchpad injected back into the system prompt each turn.
In the paper and project write‑up, scratchpad use ends up as the strongest predictor of success:
- Top models rewrote their notes ~34 times per run.
- Bottom models: 0-2 entries on average.
That’s not a subtle signal.
It’s a working‑memory test disguised as a startup sim.
Does scratchpad use cause better performance, or does it just correlate with “smarter models follow instructions”? The answer is “both,” but the causality isn’t mystical:
- Long‑horizon tasks are bottlenecked by state, not raw IQ.
- If the only cross‑turn state is a text file, then “writes faithful, structured notes and uses them later” is equivalent to “has external working memory.”
- Any model that does this will outperform an otherwise similar model that doesn’t, because it stops re‑learning the same painful lessons.
You can see the same pattern in real systems. Teams building multi‑step agents, from “AI employees” to internal tools, report that the single biggest reliability boost is adding a simple, structured notebook between turns. The Reddit comments on YC‑Bench basically read like field reports: no state → agents loop, forget strategy, re‑accept bad clients.
YC‑Bench just makes it painfully measurable.
So the right practical move is not to wait for bigger context windows or more “frontier” IQ. It is to:
- Give your agents a first‑class scratchpad.
- Teach them to maintain explicit, structured state (lists of rules, blacklists, strategy bullet points).
- Reward them for updating that state on each decision, not just reading it.
If that sounds very close to the patterns in “Gemma 4 Native Thinking Is a Real Developer Shift”, pushing more logic and memory into your own stack, that’s because it is.
What YC‑Bench exposes about long-horizon failure modes and product design
YC‑Bench’s failures look uncomfortably familiar to anyone who has shipped an agent to users.
Three patterns stand out:
- Adversarial clients and mispriced risk
About 47% of bankruptcies in the paper come from failing to handle adversarial clients, the ones who secretly inflate work after you commit. Most models:
- keep accepting high‑reward but impossible tasks,
- never explicitly blacklist bad clients in their notes,
- don’t adjust strategy after repeated failures.
This maps directly to real agents that:
- keep calling flaky tools,
- re‑try failing actions forever,
- never update their own “don’t do this again” list.
Product implication: treat risk models and blacklists as first‑class state, not vibes inside the LLM’s head. Hard‑code rules like “never accept a task from a previously adversarial client” and store that in a shared data structure, not just in prompt text.
- Over‑parallelization
The paper calls out “frontier models” that over‑parallelize: they spawn too many concurrent tasks, burn payroll, and run out of cash. It’s the LLM equivalent of your agent spawning 50 sub‑agents and DDOS‑ing your own backend.
Here, higher “intelligence” makes things worse, not better: the big model finds more things to do, faster.
Product implication: rate‑limit your own agent. Constrain the number of concurrent tasks. Add budget checks and backpressure; don’t trust the model to self‑regulate.
If you liked “AI Agent Hack: Prompt‑Layer Security Is the Real Threat”, this is the same story from the other side: the danger isn’t one bad prompt, it’s an agent that can freely make a thousand expensive decisions before anyone notices.
- Seeds and reproducibility
YC‑Bench runs 3 random seeds per model, and individual trajectories vary wildly: in some seeds Opus and GLM‑5 achieve runaway success; in others, everyone clusters closer together or collapses.
For real products, that’s a hint: if you don’t control the random seed and capture traces, you have no idea whether tomorrow’s agent behavior is the same strategy or just a different dice roll.
Product implication: in any serious agentic pipeline, you should:
- fix or at least log seeds,
- store full traces (state + actions + scratchpad),
- replay runs when debugging.
Deterministic simulation isn’t a benchmark nicety; it’s an operational requirement.
What product teams should actually change
If you take YC‑Bench seriously, “GLM‑5 vs Claude Opus” becomes mostly a procurement question. The design questions are more interesting.
A concrete checklist:
- Stop reflexively buying frontier APIs.
Run your workload through at least one cheaper model and compute dollars of outcome per dollar of API. YC‑Bench’s numbers strongly suggest non‑frontier models will be “good enough” for many agentic tasks at a tiny fraction of the spend. - Design scratchpad‑first, not prompt‑first.
Give agents a durable, structured store, e.g., JSON with fields likeclient_blacklist,hiring_rules,cashflow_policy. Force them to read and update this every turn. Measure success by scratchpad edit rate over time. - Externalize rules, don’t rely on vibes.
Any heuristic you’d yell at a human (“stop accepting jobs from that scammy client”) should exist as machine‑readable state the agent can’t ignore, not just as a paragraph in the system prompt. - Cap parallelism with hard constraints.
Don’t let models decide how many tasks or tools to run at once. Set explicit resource budgets and queues; let the agent reason within that envelope. - Invest in reproducibility before intelligence.
Seed control, logging, replayable simulations, the boring pieces YC‑Bench had to build to run this study, are exactly what you need to debug agents in production.
In other words: optimize the pipeline, not the IQ score.
Key Takeaways
- YC‑Bench’s GLM‑5 vs Claude Opus result shows that once a model clears a competence bar, unit economics and architecture dominate long‑horizon agent performance.
- GLM‑5 achieves ~95% of Claude Opus’s simulated startup performance at roughly 11× lower inference cost, making cost‑per‑decision the meaningful metric, not raw leaderboard rank.
- Persistent scratchpad memory, updated tens of times per run, is the single strongest predictor of success; agents without working memory mostly loop and die.
- The main failure modes (adversarial counterparties, over‑parallelization, forgetting strategy) mirror real‑world agent problems and are fixable with stateful design and hard constraints, not more frontier IQ.
- Product teams should pivot from “model worship” to pipeline engineering: scratchpads, seeds, budgets, and deterministic traces are where the real leverage sits.
Further Reading
- YC‑Bench: Benchmarking AI Agents for Long‑Term Planning and Consistent Execution, Original paper with experimental setup, leaderboard, and analysis of scratchpad and failure modes.
- YC‑Bench: A Long‑Horizon Agent Benchmark, Project page with visualizations, public leaderboard, and links to configs.
- collinear-ai/yc-bench (GitHub), Open‑source code and simulation harness to reproduce runs and try alternate models.
- YC‑Bench: Can Your AI Agent Run a Startup Without Going Bankrupt?, Author walkthrough and practitioner‑oriented discussion of the benchmark’s design and implications.
- Local LLM Coding: $500 GPU Beats Claude, How modest models plus good engineering can outperform frontier APIs in real coding workflows.
In that light, GLM‑5 vs Claude Opus is less a rivalry and more a pricing exercise: the real moat isn’t whose model is “smartest,” it’s who can afford to make the most good decisions per dollar.
