Half the Building Is Still Standing (Or, On Nature's SocSci Replication Piece)
A love letter to incentive structures, confidence intervals, and some intellectual humility
Three papers dropped in Nature yesterday (April Fools’ Day, which feels a little on the nose) reporting results from a seven-year, DARPA-funded project called SCORE. The project asked 865 researchers to poke at 3,900 social-science papers published between 2009 and 2018 and check whether the results hold up.
The headline: only about half the studies they tried to replicate actually replicated.
If you’ve been in the social sciences for any amount of time, your reaction to this is probably one of two things. Either “oh my God, half?” or “honestly, I’m surprised it’s that high.” I’m somewhere in between, but my read as a quantitative political scientist is more complicated than either reaction, and more interesting. So bear with me while I get a little nerdy about what these papers actually found, because the headline number is doing some heavy lifting that it probably shouldn’t be.
Three Tests, Not One
First, the setup. SCORE didn’t just run replications. It ran three separate audits, each published as its own paper.
Reproducibility (Miske et al.): Can you take the same data and the same code and get the same answer? This should be the easy one. It’s literally just: does the math check out? They looked at 600 papers. Only 24% of authors made their data available. In political science that number was 54%. In education it was 2.9%. Of the papers where reproduction was even possible, about 54% matched precisely. Three-quarters matched approximately.
Robustness (Aczel et al.): If you take the same data but run a different reasonable analysis, do you get the same conclusion? Here they had at least five independent teams reanalyze each of 100 papers. About 75% reached the same conclusion. Economics and political science led again here. But in 2% of cases, a fresh analysis reached the opposite conclusion of the original paper. Two percent doesn’t sound like much until you think about what it means at scale — for every fifty papers you read, one of them might be actively pointing you the wrong way.
Replicability (Tyner et al.): Can you run the whole study from scratch (new data, new participants, the works) and find the same thing? This is the hardest test, and it’s where the headline number comes from. Of 274 claims from 164 papers, 49% replicated with statistical significance. But honestly, the significance number isn’t even the most alarming part. The median effect size dropped from a correlation of about 0.25 in the originals to 0.10 in the replications. That’s not shrinkage. That’s collapse. If you’re building policy on a finding with an effect size of 0.25 and the real number is 0.10, you’re not just slightly overestimating — you’re in a different ballpark.
Who Looks Good and Who Doesn’t
Here’s where it gets interesting if you’re me. Look at the reproducibility numbers by field. Political science: 84% of papers with data available were approximately or precisely reproduced, 66% precisely. Economics: 77% approximately or precisely, 72% precisely. Psychology and education? Roughly half or less were precisely reproduced. That’s not a gap. That’s a canyon.
[Correction 4/12: The original version of this chart understated economics' precise reproducibility rate. The text was correct (72%, per Miske et al.); the chart has been updated to match.]
Sánchez-Tójar’s commentary says it plainly: reproducibility was substantially higher in economics and political science, which have more-established norms for sharing data and code.
This is not because economists and political scientists are inherently more rigorous. Part of it is methodological culture — these fields rely heavily on standardized regression frameworks and widely-used software packages, which makes code easier to share and results easier to check. An OLS regression in Stata is more portable than a bespoke mixed-methods pipeline. But the bigger story is that the journals in those fields started requiring you to show your work. Miske et al. report that 77.8% of economics and political science journals had data sharing, code sharing, or reproducibility check policies. For journals in other fields? 6.8%. The American Economic Review has had a data editor since 2004. The American Journal of Political Science hired a replication analyst in 2015 and started checking code before publication. Papers in journals with these policies were precisely reproduced 70.5% of the time; papers in journals without them, 40.7%.
My read: this is a story about incentives, not about virtue.
Now, the replicability numbers tell a different story. And this is the part that genuinely surprised me.
When you move from “can we check the math” to “can we reproduce the actual finding with new data,” the field-level ordering doesn’t just shuffle. It inverts. Economics — the field that looks best on reproducibility — has the lowest replication rate: 42.5%. Education, which looked worst on reproducibility, comes in at 63.1%. Political science sits at 52%. Psychology at 49%.
Now, the sample sizes are small (some fields have only 13-15 replicated papers), so I wouldn’t carve these specific rankings into stone. But the broad pattern is striking: being good at showing your work doesn’t automatically make your findings more likely to hold up with fresh data. Transparency and truth are related but they’re not the same thing.
And the effect size story is arguably worse than the headline replication rate. The median original effect in polisci was r = 0.16; the replication median was r = 0.05. In sociology, the originals averaged 0.10; replications came in at 0.03. Economics went from 0.28 to 0.13. Across every field, the replicated effects were a fraction of the originals.
Robb Willer’s commentary is blunt: past replication debates focused on psychology, but these results show every social science has a replication problem. Political science doesn’t get a free pass here, and I’m not going to pretend otherwise.
But the variation in reproducibility — the thing that should be easiest to fix — tells you everything about what actually works and what’s just performative hand-wringing.
The Design Choices That Matter
OK, methods nerd time. A few things about SCORE’s design that should shape how you interpret the results.
The sampling window is 2009 to 2018. This matters enormously. The open-science movement didn’t really hit its stride until around 2015, and many of the strongest transparency mandates came after 2018. Abel Brodeur published a companion paper (also in Nature 652) looking at economics and political science papers from 2022-23 and found 85% computational reproducibility (same-data, same-code verification, not full replication with new data). That’s a huge jump from SCORE’s 53.6% on the equivalent measure. We don’t yet know whether replication rates have similarly improved, but on the reproducibility side at least, the trajectory is real. Whether you read this as “things are getting better” or “SCORE is measuring an era that’s already over” depends on your temperament, but both readings are defensible.
Each paper’s reproducibility was assessed by a single analyst team in most cases. Sánchez-Tójar flags this: if the reproduction fails, you can’t tell whether it failed because the original was wrong or because the analyst team made a different judgment call. This is a real limitation. It means the 54% precise reproducibility number is probably a floor, not a ceiling.
The sample was also stratified by journal rather than weighted by publication volume. So a niche journal that publishes forty papers a year gets the same representation as a flagship that publishes four hundred. This highlights journal-level trends (useful!) but doesn’t give you a representative snapshot of what the average social scientist is reading (less useful for the headline).
And then there’s the replication design. Tyner et al. acknowledge something important: in social science, there’s no such thing as an exact replication. Different participants, different contexts, different moments in time. Some regression in effect size is expected even if the original finding is real. The fact that 49% hit conventional significance thresholds doesn’t mean the other 51% found nothing. Many of them found effects in the same direction, just smaller. The median effect size getting cut in half is concerning, but “concerning” and “the original was wrong” aren’t the same claim.
So What’s Actually Broken?
The SCORE results are mostly a story about infrastructure. Tim Errington — who coordinated parts of the project — said it himself: the problem in many cases is simply that papers don’t provide enough information to check. The raw materials aren’t there. The methods descriptions are vague. The data are locked in someone’s hard drive.
This is fixable. We know how to fix it because some fields already have. Require data and code sharing. Fund data editors. Preregister analyses. Run robustness checks before publication, not after. Brodeur’s reproducibility numbers from 2022-23 suggest that these reforms are working where they’ve been implemented.
But here’s the uncomfortable part: the fields with the worst reproducibility numbers are the ones with the weakest infrastructure for transparency. Sociology. Education. Parts of psychology. Now, I want to be fair here — some of these fields face genuine structural barriers that econ and polisci don’t. Education researchers work with student-level data that’s locked down by FERPA. Sociologists studying vulnerable populations run into IRB constraints that make open data sharing genuinely difficult, not just inconvenient. These aren’t excuses; they’re real obstacles that require creative solutions (synthetic data, secure enclaves, tiered access) rather than just “open everything up.”
But those barriers don’t explain the full gap. Building transparency infrastructure also costs money, requires institutional buy-in, and changes who has power in the publication process. Data editors cost money. Requiring code sharing means someone has to check the code. Preregistration means you can’t run fifteen analyses and report the one that worked.
The incentives that produce non-reproducible science are the same incentives that resist reform. Publish or perish rewards volume over transparency. Journals want novel findings, not careful replications. Tenure committees count publications, not data deposits.
So we’ve arrived where these conversations always arrive. Everyone agrees on the diagnosis. Everyone knows the prescription. The question is whether the institutions that created the problem are capable of fixing it, and I go back and forth on that one honestly.
Brodeur’s data gives me some optimism. Things genuinely are better than they were in 2009. But “better than 2009” is a low bar, and we’re nowhere near where we need to be. DARPA’s original hope was to build automated tools that could flag unreliable papers. That part is still barely functional. The best AI model in the latest competition hit 68.5% accuracy, which is better than chance but worse than a group of informed humans just talking it through.
The open-science reforms are real and they’re working. But they’re unevenly distributed, and the fields that need them most are the ones adopting them slowest. If you’re in a field that still treats data sharing as optional, the SCORE results are a mirror.
One more thing. SCORE looked exclusively at quantitative studies: papers with data files, code, statistical tests. This is the work where the tools for independent verification actually exist. Rerunning a regression is a tractable problem. Verifying an interpretation of six months of fieldwork is a fundamentally different kind of challenge.
So when we say that quantitative social science replicates about half the time, that’s the number for the research tradition with the most developed infrastructure for catching problems. The equivalent audit for qualitative or interpretivist work doesn’t exist, and honestly might not be possible in the same way. I’m not saying qualitative research is worse — it has its own error-correction mechanisms (triangulation, prolonged engagement, thick description) that don’t map neatly onto a replication framework. But the underlying pressures are the same: publication incentives, researcher degrees of freedom in interpretation, confirmation bias. Which means the SCORE numbers are the lower bound of what’s measurable, not a comprehensive audit of the whole enterprise.
And if you’re in political science or economics, feeling a little smug about the reproducibility numbers — remember that economics has the lowest replication rate in the study at 42.5%. Political science lands at 52%. Our grad students are reading papers where the original effect sizes were around r = 0.16 and the replicated effects came back at r = 0.05. We’re better at showing our work. We’re not better at being right. That’s not a victory. That’s a head start.
If you're a social scientist who found this useful, share it with your methods person. They probably have opinions.





In its own small way, the existence of the SCORE report indicates a degree of health in the scientific enterprise, which is only as good as its error metabolism is. The kind of critical work SCORE represents identifies error in the body scientific, and the kind of publishing venue Nature represents makes these identifications visible.
I'm a sociologist and am obviously unhappy with how sociology fared here. My response is to discuss this with my1st year grad students and to say, let's do better. (Transparency is taking hold in sociology, though, to be sure, more slowly than elsewhere.) Thanks for highlighting this important work.