Rebuttal Experiments Are Breaking Peer Review Right Now

A lot of people in AI quietly agree on one thing about rebuttal experiments: they make their papers better. More checks, more baselines, more datasets, what’s not to like?

Except a growing number of authors are saying the opposite: rebuttal experiments are making their papers worse.

TL;DR

Rebuttal experiments mostly satisfy reviewer psychology, not scientific necessity, randomized trials show they only weakly affect decisions.
The real risk isn’t “wasted compute,” it’s warping otherwise clean papers into rushed, messy Franken-experiments that age badly.
Authors need a playbook: treat rebuttal asks as negotiation, not commandments, and reframe or refuse experiments that are risky, off‑scope, or structurally impossible.

Why rebuttal experiments ballooned (and why that matters)

The consensus story is simple: journals and conferences want rigor, so reviewers push for more evidence during rebuttal, and papers get stronger.

The actual policies encourage this. Nature’s editorial criteria explicitly say that when referees raise technical concerns, “further experiments or technical work are usually required to address some or all of the referees’ concerns.” That “usually required” quietly migrated from obvious gaps (“your main claim isn’t supported”) to “this would be nice to know.”

In machine learning, this collided with two trends:

A cultural crackdown on “low‑effort” reviews, nobody wants to be the person writing “no major concerns” anymore.
Cheap-ish compute and large models that can, in theory, run “one more ablation” on ten more backbones and three more datasets.

Put those together and you get what the Reddit thread described: even on solid accepts, authors are routinely asked for 5-10 extra numbers or plots in a one‑week rebuttal window. Often for “what if” scenarios reviewers are simply curious about.

The key question isn’t whether this is annoying.

It’s whether this torrent of rebuttal experiments actually changes decisions, or just distorts the science.

Empirically, the effect is modest at best. A randomized controlled trial on reviewer anchoring and author responses found mixed evidence that rebuttals move reviewer scores in a meaningful way. Authors kill themselves for extra runs; reviewers mostly move from “weak accept” to “weak accept but with a warm feeling.”

So we’ve built a system where:

Experiments added under maximum time pressure
With minimal opportunity to debug, replicate, or interpret
That rarely change acceptance outcomes

…become part of the permanent scientific record.

That asymmetry, huge pressure ex ante, almost no correction ex post, is where the real harm starts.

When rebuttal experiments help, and when they make a paper worse

There are exactly two kinds of rebuttal experiments:

Decision‑relevant: If the result goes one way, you’d reject; another way, you’d accept.
Curiosity‑driven: The result is interesting, but it doesn’t change your recommendation.

Only the first category belongs in a rebuttal. Most of what authors are being asked for is the second.

You can test this quickly:

If this experiment totally flops, would a reasonable reviewer still say the core claim is supported?

If yes, it’s curiosity.
If no, you probably had a real hole in the paper.

The Reddit comments describe what happens when this distinction is ignored. Reviewers ask for a “two‑week compute” experiment with “five days for rebuttal.” Authors rush a half‑baked run, and if the performance dips or is noisy, that one shaky number gets weaponized as a “gotcha.”

You’ve just taken a clean, well‑bounded claim and stapled on a random failure mode you don’t understand.

In the worst cases, it’s not just messy, it’s dangerous. Retraction Watch documented cases where authors, under pressure, doctored images for a rebuttal letter. That’s the pathological extreme of the same dynamic: when rebuttal becomes a high‑stakes, low‑time, must‑impress event, the incentives drift from “be accurate” to “get this over the line.”

And if you think, “Well, we can always correct later,” the literature disagrees.

A study summarized by Ars Technica found that once a paper is widely cited, rebuttals almost never change how it’s treated: “once a paper has been published and widely cited, it’s almost impossible to contradict it with new evidence.” Our correction machinery is slow and often ineffective; errors can sit in the record for years.

So a rushed rebuttal experiment that:

Is underpowered or misconfigured
Lives in an appendix nobody revisits
Never gets formally corrected

…can steer the field more than the months‑long, careful experiments in the original submission.

The thing you added to “strengthen the paper” becomes the part future work cites as evidence of a limitation.

How authors should respond to rebuttal asks (practical rules)

Researcher working late at a cluttered desk, illustrating the time pressure and low-quality conditions under which many rebuttal experiments are run.

The obvious (and bad) strategy is: “Do everything the reviewers ask, as fast as possible.”

The better strategy is: negotiate scope.

Think of rebuttal experiments as contract terms. Some you accept, some you modify, some you reject, but you always explain why in professional, non‑defensive language.

Concrete rules:

Run only decision‑relevant experiments

In your rebuttal, make the decision logic explicit:

“We agree this check is important and have run it; the result is X, which supports our main claim.”

vs.

“We believe this request is orthogonal to our core contribution. Even if performance were lower under this setting, the central claim, Y, would still hold, so we defer this to future work.”

You’re not refusing at random; you’re classifying.
Time‑bounded honesty

If an experiment is plausible but exceeds the rebuttal window:

“The suggested experiment requires ~2 weeks of compute; the rebuttal window is 5 days. We therefore cannot provide a reliable result without compromising quality. We plan to include this in a camera‑ready or follow‑up and have clarified this in the text.”

Notice the move: you invoke quality risk, not inconvenience.
Downgrade curiosity to text edits

When a reviewer’s request is “explore every nearby axis,” offer narrative instead of numbers:

“We agree that extension to dataset D is interesting. We have added a discussion (Sec. 5) explaining why we expect similar trends and outlining limitations if distributional shift is larger than in our current benchmarks.”

You answer the intellectual question without turning the paper into a grab‑bag of half‑run experiments.
Never add barely‑checked plots that undermine your own story

If you do run something and it looks off, the right move is often:

“Preliminary runs under setting Z produced unstable results; we believe investigating this properly requires more careful tuning than fits the rebuttal window. We therefore chose not to include these exploratory numbers to avoid over‑interpreting noisy data.”

Half‑truthful numbers age worse than admitted uncertainty.
Use alignment with other reviewers as leverage

If one reviewer wants a big deviation and others don’t:

“We note that Reviewers 2 and 3 did not identify this as critical for acceptance. We agree it is an interesting direction, but given the above constraints, we propose to treat it as future work rather than a requirement for the current submission.”

Area chairs are risk‑averse; framing a request as idiosyncratic makes it easier for them to side with restraint.

None of this guarantees acceptance. What it does is shift the frame from “authors dodged work” to “authors made principled trade‑offs about what is scientifically necessary.”

Which is exactly what reviewers should be doing, but often aren’t.

What rebuttal experiments reveal about peer‑review incentives

The temptation is to blame individual reviewers. Some of them are absolutely optimizing for their own paper’s odds, as one top Reddit comment joked: “If they accept this paper they’re more likely to reject mine… I absolutely need to maximize my odds.”

But the deeper problem is institutional.

Editors and program chairs reward visible thoroughness (“asked for 10 checks”) over invisible judgment (“paper was already sufficient”).
Policies like Nature’s “further experiments are usually required” encode “more is safer” into the process.
The correction system is broken enough that everyone over‑weights pre‑publication and under‑weights post‑publication critique.

The result is a weird equilibrium:

We pretend peer review is the one chance to “get the science right,” so we cram curiosity, edge cases, and wish‑lists into a week‑long rebuttal window.
We then publish all that rushed work as if it were equally considered.
And when later work shows issues, the correction channels are so clunky that almost nobody pays attention.

Rebuttal experiments aren’t just an annoyance; they’re a symptom of a system that trusts performative exhaustiveness more than calibrated sufficiency.

One interesting counter‑pressure is transparency. eLife’s model of publishing the full review reports and author responses means that readers can see which experiments were added under pressure, and which requests authors successfully resisted. That doesn’t fix incentives overnight, but it does change the optics: reviewers have to own their “what if you also did…” impulses in public.

In AI, where rebuttal experiments and even model‑driven “auto‑research” are starting to appear in the workflow, this tension will get sharper. As more of the grind is automated, the temptation to stuff rebuttals with low‑value checks will rise, unless we’re explicit about what those checks are actually for.

Key Takeaways

Most rebuttal experiments are curiosity‑driven, not decision‑relevant, and empirical work shows they only weakly move reviewer scores.
Rushed experiments can permanently distort the literature, because post‑publication rebuttals and corrections rarely change how papers are treated.
Authors should negotiate, not obey: classify requests, refuse those that are off‑scope or time‑infeasible, and offer text edits instead of risky runs.
Reviewer incentives favor visible thoroughness over judgment, leading to a culture where “more plots” is safer than “this is already enough.”
Transparent peer review models, where reviews and rebuttals are published, create social pressure against weaponized wish‑lists of extra experiments.