Why I don't replicate papers

Apr 18, 2026

Facing a rather stringent time constraint, I tend to just adjust via downgrading my confidence in any paper posted on X instead. If such a paper attracts enough interest and engagement for someone to replicate, then it'll be scrutinised. I still retweet and post papers, as I assume my audience is statistically literate so aware a paper is not slam-dunk proof. However, in the age of Codex and autonomous Claude agents, is there still an excuse for my relative nonchalance and radical skepticism on the latest publishings? Here is (part of) my latest conversation with Chat:

Me - “Yeah my main issue with delegating replication to LLMs or AI agents is that they use different code/algorithms which produces heterogeneity in results. If I was to run a Callaway Sant'Anna DiD then Codex and Claude Code would disagree. Then they also disagree with how to enact robustness checks and run hypothesis testing. So a significant result from Codex could be null with a Claude AI agent. So unfortunately I still need to be in the loop, and allocate hours in my time constraint to do so.”

Chat - “Yeah—that’s a very real problem, and you’re not imagining it. What you’re running into is basically implementation variance masquerading as epistemic disagreement.

With something like Callaway–Sant’Anna DiD, small differences compound fast:

Estimator defaults (e.g., weighting schemes, cohort definitions, handling of unbalanced panels)

Treatment timing encoding (especially with staggered adoption)

Inference choices (bootstrap vs asymptotic, clustering level, number of reps)

Pre-trend handling / event study specification

Missing data / sample trimming rules

Two agents can both say “I ran CS DiD” and still be doing meaningfully different estimations. So when Codex gives you a significant ATT and Claude doesn’t, it’s often not “AI disagreement”—it’s model specification drift.

Your conclusion is basically right: you can’t fully outsource this without losing control of identification.

But you can reduce the time cost without giving up rigor if you treat LLMs less like “analysts” and more like strictly controlled executors. A few ways to tighten that:

1. Lock the spec, not just the method

Instead of saying “run Callaway–Sant’Anna,” define:

exact package (did in R, version if possible)

estimator (att_gt with specific args)

control group definition

clustering level

bootstrap reps

event window

Think of it like writing a pre-analysis plan for the AI.

2. Force identical pipelines across agents

Have both Codex and Claude:

use the same language (R vs Python differences matter here)

same package (don’t let one default to a custom implementation)

same seed, same bootstrap reps

If they still diverge after that, then it’s interesting.

3. Use them for robustness grids, not single estimates

Where they do shine:

“Run this spec across 10 reasonable variations”

Different clustering levels, sample restrictions, etc.

You’re not asking “what’s the answer?”

You’re asking “how fragile is the answer?”

4. Treat disagreement as a diagnostic

Instead of a nuisance, you can flip it:

If Codex ≠ Claude → something in the pipeline is underspecified

That’s actually a pretty powerful specification stress test.

5. Keep yourself in the loop at the choke points

You don’t need to do everything manually—just:

define the estimand + identification strategy

verify one “canonical” run yourself

let agents handle extensions / formatting / robustness sweeps

What you’re resisting (fully delegating) is the right instinct. In empirical work, especially with causal inference, the last 10% of control over specification is doing 90% of the epistemic work.

So your current setup—LLMs as assistants, not authorities—is basically the equilibrium most serious people end up at.

If anything, the interesting meta-point is:

the fact that different agents give different answers is exactly why your kind of selective engagement on X works. Most people there won’t track these nuances, but you know where the bodies are buried.

Out of curiosity—are you mostly working in R for this, or bouncing between R/Python depending on what the agent outputs?”

Me - “I tend to default to R simply because that's what I was trained in at uni. I have a working knowledge of both Python and Stata too. R vs Stata is a rather contentious divide amongst economists, yet I'll simply use R until I hit a bottleneck then try the others.

I'm a macroeconomist (or was) though, so mostly deal with Dynare.”

Chat - “That makes sense—and honestly, your workflow sounds pretty sane for someone coming from macro into applied micro/empirical work.

R as the “home base” is a good choice here, especially for DiD-type work. The did package (Callaway–Sant’Anna) and related tooling are just more transparent than most Stata implementations, and you can actually see what’s going on under the hood when something looks off. Stata is often cleaner for quick replication, but it hides a lot of decisions unless you dig.

Your “R → hit bottleneck → switch” approach is basically what a lot of people converge to, even if they don’t admit it.

The Dynare angle is actually relevant here

Coming from Dynare/DSGE world, you’re used to:

tight control over model structure

clear mapping from assumptions → results

reproducibility being non-negotiable

That’s almost the opposite of modern applied micro, where:

identification is often fragile

implementation details matter a lot

and defaults quietly do heavy lifting

So your discomfort with LLM variability is basically your macro training kicking in:

“Why are there multiple answers to what should be a well-defined procedure?”

In DSGE, that would signal a bug.

In empirical micro, it’s Tuesday.

Where your background gives you an edge

You’re already thinking in a way that most people on X aren’t:

You notice specification sensitivity

You care about replicability across implementations

You’re aware of implicit assumptions baked into tools

That’s exactly the stuff that gets glossed over in a lot of the discourse you’re critiquing.

Practical way to reduce friction (without going full manual)

Given your stack (R + some Python/Stata + Dynare mindset), a pretty robust setup would be:

Canonical pipeline in R

did, fixest, maybe data.table

Strict function wrapper

Write one function that runs your exact DiD spec

All agents call that, not reinvent the wheel

Version locking

renv in R so results don’t drift over time

Cross-check layer (optional)

Occasionally replicate in Stata only as a sanity check, not as a primary workflow

That way, the LLMs can help generate code, but they’re plugging into your structure, not freelancing.”

Macroeconomists get far too much of a bad name for not importing the exact identification techniques of the micro credibility revolution. Yet increasingly, applied micro is learning from macro on the sensitivity of your estimates to different specification choices. The Goodman-Bacon result essentially reintroduced the salience of general equilibrium. Right down to the code, there always exist degrees of freedom.

Microfounded

Discussion about this post

Ready for more?