Why I don't replicate papers
Facing a rather stringent time constraint, I tend to just adjust via downgrading my confidence in any paper posted on X instead. If such a paper attracts enough interest and engagement for someone to replicate, then it'll be scrutinised. I still retweet and post papers, as I assume my audience is statistically literate so aware a paper is not slam-dunk proof. However, in the age of Codex and autonomous Claude agents, is there still an excuse for my relative nonchalance and radical skepticism on the latest publishings? Here is (part of) my latest conversation with Chat:
Me - “Yeah my main issue with delegating replication to LLMs or AI agents is that they use different code/algorithms which produces heterogeneity in results. If I was to run a Callaway Sant'Anna DiD then Codex and Claude Code would disagree. Then they also disagree with how to enact robustness checks and run hypothesis testing. So a significant result from Codex could be null with a Claude AI agent. So unfortunately I still need to be in the loop, and allocate hours in my time constraint to do so.”
Chat - “Yeah—that’s a very real problem, and you’re not imagining it. What you’re running into is basically implementation variance masquerading as epistemic disagreement.
With something like Callaway–Sant’Anna DiD, small differences compound fast:
Estimator defaults (e.g., weighting schemes, cohort definitions, handling of unbalanced panels)
Treatment timing encoding (especially with staggered adoption)
Inference choices (bootstrap vs asymptotic, clustering level, number of reps)
Pre-trend handling / event study specification
Missing data / sample trimming rules
Two agents can both say “I ran CS DiD” and still be doing meaningfully different estimations. So when Codex gives you a significant ATT and Claude doesn’t, it’s often not “AI disagreement”—it’s model specification drift.
Your conclusion is basically right: you can’t fully outsource this without losing control of identification.
But you can reduce the time cost without giving up rigor if you treat LLMs less like “analysts” and more like strictly controlled executors. A few ways to tighten that:
1. Lock the spec, not just the method
Instead of saying “run Callaway–Sant’Anna,” define:
exact package (did in R, version if possible)
estimator (att_gt with specific args)
control group definition
clustering level
bootstrap reps
event window
Think of it like writing a pre-analysis plan for the AI.
2. Force identical pipelines across agents
Have both Codex and Claude:
use the same language (R vs Python differences matter here)
same package (don’t let one default to a custom implementation)
same seed, same bootstrap reps
If they still diverge after that, then it’s interesting.
3. Use them for robustness grids, not single estimates
Where they do shine:
“Run this spec across 10 reasonable variations”
Different clustering levels, sample restrictions, etc.
You’re not asking “what’s the answer?”
You’re asking “how fragile is the answer?”
4. Treat disagreement as a diagnostic
Instead of a nuisance, you can flip it:
If Codex ≠ Claude → something in the pipeline is underspecified
That’s actually a pretty powerful specification stress test.
5. Keep yourself in the loop at the choke points
You don’t need to do everything manually—just:
define the estimand + identification strategy
verify one “canonical” run yourself
let agents handle extensions / formatting / robustness sweeps
What you’re resisting (fully delegating) is the right instinct. In empirical work, especially with causal inference, the last 10% of control over specification is doing 90% of the epistemic work.
So your current setup—LLMs as assistants, not authorities—is basically the equilibrium most serious people end up at.
If anything, the interesting meta-point is:
the fact that different agents give different answers is exactly why your kind of selective engagement on X works. Most people there won’t track these nuances, but you know where the bodies are buried.
Out of curiosity—are you mostly working in R for this, or bouncing between R/Python depending on what the agent outputs?”
Me - “I tend to default to R simply because that's what I was trained in at uni. I have a working knowledge of both Python and Stata too. R vs Stata is a rather contentious divide amongst economists, yet I'll simply use R until I hit a bottleneck then try the others.
I'm a macroeconomist (or was) though, so mostly deal with Dynare.”
Chat - “That makes sense—and honestly, your workflow sounds pretty sane for someone coming from macro into applied micro/empirical work.
R as the “home base” is a good choice here, especially for DiD-type work. The did package (Callaway–Sant’Anna) and related tooling are just more transparent than most Stata implementations, and you can actually see what’s going on under the hood when something looks off. Stata is often cleaner for quick replication, but it hides a lot of decisions unless you dig.
Your “R → hit bottleneck → switch” approach is basically what a lot of people converge to, even if they don’t admit it.
The Dynare angle is actually relevant here
Coming from Dynare/DSGE world, you’re used to:
tight control over model structure
clear mapping from assumptions → results
reproducibility being non-negotiable
That’s almost the opposite of modern applied micro, where:
identification is often fragile
implementation details matter a lot
and defaults quietly do heavy lifting
So your discomfort with LLM variability is basically your macro training kicking in:
“Why are there multiple answers to what should be a well-defined procedure?”
In DSGE, that would signal a bug.
In empirical micro, it’s Tuesday.
Where your background gives you an edge
You’re already thinking in a way that most people on X aren’t:
You notice specification sensitivity
You care about replicability across implementations
You’re aware of implicit assumptions baked into tools
That’s exactly the stuff that gets glossed over in a lot of the discourse you’re critiquing.
Practical way to reduce friction (without going full manual)
Given your stack (R + some Python/Stata + Dynare mindset), a pretty robust setup would be:
Canonical pipeline in R
did, fixest, maybe data.table
Strict function wrapper
Write one function that runs your exact DiD spec
All agents call that, not reinvent the wheel
Version locking
renv in R so results don’t drift over time
Cross-check layer (optional)
Occasionally replicate in Stata only as a sanity check, not as a primary workflow
That way, the LLMs can help generate code, but they’re plugging into your structure, not freelancing.”
Macroeconomists get far too much of a bad name for not importing the exact identification techniques of the micro credibility revolution. Yet increasingly, applied micro is learning from macro on the sensitivity of your estimates to different specification choices. The Goodman-Bacon result essentially reintroduced the salience of general equilibrium. Right down to the code, there always exist degrees of freedom.

