NVIDIA Rubin Performance: Why 'Only 2×' Misses the Point

A guy on Reddit squints at an NVIDIA slide, sees “2×” at the edge of a curve, and declares that NVIDIA Rubin performance is disappointing, “only 2× at max throughput.”

In the comments, someone quietly points out that the y‑axis is TPS per megawatt. Not raw tokens per second. Efficiency.

TL;DR

The “Rubin is only 2×” meme is a chart‑reading error sitting on top of a real confusion: which operating point are we talking about?
Rubin’s interesting win isn’t peak FLOPS; it’s tokens per megawatt for interactive, long‑context workloads once you design the system and software around it.
If you buy GPUs on headline “up to 5×” numbers, you’re gambling; if you buy on token/MW at your latency target, you’re doing hardware economics like an adult.

NVIDIA Rubin performance: 2× isn’t the whole story

Let’s compress the facts into one paragraph.

NVIDIA’s own developer blog shows Vera Rubin NVL72 delivering “up to 10× higher token factory throughput per megawatt” than Blackwell NVL72 on a Kimi‑K2 reasoning workload at comparable interactivity, while other parts of the same chart look more like ~2× gains at the far right “max throughput” end. The Reddit post grabbed that “2×” corner, ignored the efficiency axis, and announced that Rubin secretly under‑delivers. Meanwhile, NVIDIA’s marketing headlines still say “up to 5× inference performance” and “10× lower cost per token” for some scenarios, especially massive‑context, agentic workloads. All of these statements can be true at once, if you care about where on the curve you’re operating.

Here’s the uncomfortable bit: most teams don’t actually know where on that curve they live.

They think they’re “max throughput.” They’re usually not.

They’re somewhere in the messy middle, constrained by latency SLOs, context windows, and networking, not by raw math.

Why “up to” claims and charts diverge (operating points matter)

Picture two clusters in adjacent racks.

Same model, same number of users, same GPU count.

In rack A, the infra team chases a benchmark: giant batches, loose latency targets, everything tuned to push tokens per second to the ceiling.

In rack B, product demanded snappy chat and agentic workflows. Batches are smaller. Requests bounce through tools. Contexts are huge. The system never gets near that clean “max throughput” corner.

Rubin vs Blackwell looks very different in those two rooms.

NVIDIA’s charts quietly admit this. They don’t plot a single number; they sweep along an axis that trades latency per user against throughput and cost per token. At interactive points, the part of the curve where you’re not allowed to make users wait seconds for the first token, Rubin NVL72 jumps to those “up to 10× tokens/MW” gains.

It’s when you slide all the way out to “I will batch everything, users be damned” that the ratio sinks toward ~2×.

So the Reddit post accidentally asks a good question in the wrong way:

“Is Rubin really ‘only 2×’ faster at max throughput?”

The better question is: Why were you planning to operate at that point in the first place?

Most real products don’t live there. Their operating point is set by UX, SLAs, and traffic patterns, knobs marketing slides can’t see.

Tokens per megawatt vs peak FLOPS: the efficiency argument

Schematic showing latency on the horizontal axis with three operating points (Low latency, Interactive, Max throughput); highlights a large Rubin vs Blackwell gap at Interactive and a small gap at Max throughput, with a shaded band labeled where most products run.

Rubin’s architecture is weird on purpose.

You get Rubin GPUs with swollen HBM4 bandwidth and capacity. You get Rubin CPX, a prefilling monster explicitly “purpose‑built to handle million‑token coding and generative video applications,” as NVIDIA puts it. You get NVLink‑6, fatter nodes, and a system like Vera Rubin NVL72 that they happily sell as a “token factory.”

If you’re still evaluating that using peak FLOPS, you’re using a stop watch to measure the ocean.

For LLM inference at scale, three numbers matter more:

Tokens/sec at your latency target (per user, not just per cluster)
Tokens per megawatt at that same target
Cost per token once you price in the whole rack, networking, memory, power, cooling, and the GPUs themselves

Peak PFLOPS tells you the ceiling if everything else disappears.

Tokens/MW tells you how much work you can actually keep doing once the power company, the interconnect, and your bill of materials show up.

NVIDIA’s own claim, “up to 10× higher token factory throughput per megawatt” for a reasoning model, is them finally saying the quiet part out loud: the scarce resource is no longer TOPS, it’s watts and rack slots.

NextPlatform’s analysis pushes this further. In their simulations, swapping Blackwell for Rubin in 64‑GPU systems wired with copper doesn’t magically multiply throughput. The interconnect and power envelope blunt a lot of the theoretical math. You only get the big wins when system design (optics, topology, software) shifts alongside the silicon.

So “Rubin throughput 2×” isn’t a gotcha.

It’s a hint that you’re staring at the wrong axis.

What buyers and engineers should benchmark before upgrading

Flowchart checklist: Pin latency SLOs → Recreate traffic shape → Measure tokens and megawatts → Probe edges → Factor software maturity.

If you’re an engineer or procurement lead, this all collapses to a painful, practical question:

“How do we decide whether Rubin is worth buying for us?”

Not for Kimi’s K2 reasoning benchmark. Not for NVIDIA’s slide deck.

For your workload.

Here’s a testing recipe that respects reality rather than marketing:

Pin your latency SLOs.
First‑token latency, time‑to‑answer, tail behavior (p95/p99). Decide what’s non‑negotiable. Your real operating point lives inside those constraints.
Recreate your traffic shape.
Don’t benchmark with a single steady stream. Use your actual mix: chat, long‑context retrieval, quirky agent chains, spikes. Rubin CPX’s big win only appears when those million‑token sequences show up.
Measure tokens/sec and tokens/MW at that SLO.
For each candidate system, Blackwell rack, Rubin rack, Rubin+CPX, maybe even “one size smaller with optics”, collect:
- tokens/sec per user
- total system power (GPUs, switches, CPUs, cooling if you can get it)
- derived tokens/MW and cost per token
The hardware economics curve you care about isn’t “FLOPS vs dollars”; it’s “tokens at SLA vs dollars and megawatts.”

Link this back to how you already think about hardware economics: effective cost per unit of useful work, not per unit of theoretical capacity.
Probe the edges.
Run one set of tests closer to “max throughput” (heavier batching, more relaxed latency) and another closer to “snappier UX.” Watch how the Rubin‑vs‑Blackwell gap changes. You will likely see:
- modest gains near the max‑throughput extreme
- much larger gains in the “feels fast” region your users actually notice
Factor software maturity into the decision.
NVIDIA’s own engineers hint that some of Rubin’s advantage is still “software‑locked”, kernels, TensorRT‑LLM, scheduler tricks that mature over time. If your upgrade horizon is 12-24 months, assume Rubin CPX efficiency improves as those layers catch up.

After that exercise, you may very well conclude that Rubin “only” gives you 2× at your self‑inflicted benchmark corner.

But it might give you 5-10× cheaper tokens where your product actually lives.

That’s the curve you’re buying.

And it’s why engineers should stop treating NVIDIA Rubin performance as a single scalar and start treating it as a map of operating points.

Key Takeaways

The “NVIDIA admits to only 2×” meme comes from misreading a TPS/MW chart and ignoring that Rubin’s gains vary hugely across operating points.
Rubin’s real edge shows up in tokens per megawatt for interactive, long‑context workloads, not in a single “Rubin vs Blackwell” FLOPS ratio.
Marketing “up to 5×” numbers and Reddit “only 2×” screenshots are both incomplete; what matters is tokens/sec and tokens/MW at your latency SLOs.
Before upgrading, teams should benchmark with real traffic shapes, fixed latency targets, and full‑rack power to derive cost per token, not just peak throughput.