
2026-03-18 · Adam Murphy
WSU Study Confirms What We Built TOE-Share to Fix
A Washington State University study found ChatGPT identifies false scientific claims correctly only 16.4% of the time. Here's why multi-agent architecture changes that equation.
A study published from Washington State University — right here in Pullman, just down the road from where TOE-Share is built — found something that won't surprise anyone who has tried to use AI for serious scientific work: a single AI model is unreliable when evaluating scientific claims.
Professor Mesut Cicek and his colleagues fed more than 700 hypotheses from published scientific papers into ChatGPT and asked a simple question: is this true or false? They repeated each query 10 times.
The results were sobering.
ChatGPT answered correctly about 80% of the time. That sounds reasonable until you remember that a coin flip gets you 50%. After adjusting for random chance, the AI was only about 60% better than guessing. And when the correct answer was "false" — when the hypothesis had been disproven by research — ChatGPT identified it correctly only 16.4% of the time.
Let that sink in. Five out of six times, when a scientific claim was wrong, the AI said it was right.
On top of the accuracy problem, there was a consistency problem. When asked the same identical question 10 times in a row, ChatGPT gave the same answer only 73% of the time. True, false, true, false, true — from the same prompt, about the same hypothesis.
Why This Matters
This study quantifies something that every serious AI user already felt: a single model, in a single conversation, is not reliable enough for scientific evaluation. It produces fluent, convincing language — but fluency isn't accuracy. A well-written wrong answer is still wrong.
This is the core problem TOE-Share was designed to solve.
How Multi-Agent Architecture Changes the Math
When you ask a single AI model to evaluate a scientific claim, you get one opinion with a roughly 80% accuracy rate and significant inconsistency. That's not a review — that's a coin flip with better branding.
When you run the same claim through multiple independent models — each evaluating from a different angle, none of them seeing each other's work — and then have a coordinator synthesize the findings, something different happens. Disagreements surface. When one model says "this equation is valid" and another says "this equation has a sign error," that conflict gets escalated. The coordinator sends it back. The agents re-evaluate with the new information.
The inconsistency that plagues a single model becomes signal in a multi-agent system. One model flip-flopping between true and false is noise. Four models from different providers reaching different conclusions is data that triggers deeper analysis.
We've already documented cases where this architecture caught errors that individual AI sessions missed — mathematical inconsistencies that survived review by ChatGPT, Claude, Grok, and Gemini separately, but were identified when those same models ran independently through our structured specialist pipeline with coordinator synthesis.
The WSU study tells us why: a single model is bad at identifying when things are wrong. Our architecture forces the question by making models check each other's work.
The Paradigm Neutrality Connection
There's another layer to the WSU findings that's directly relevant. The study found that ChatGPT was worst at identifying false hypotheses — claims that had been disproven. This aligns with a known tendency in large language models: they default to the consensus in their training data.
If the training data says dark matter explains galaxy rotation curves, the model will lean toward "true" when asked about dark matter — even though dark matter has never been directly detected. The model treats the dominant interpretation as fact.
This is exactly why we built paradigm neutrality into TOE-Share's review prompts. Our specialist agents are explicitly instructed to distinguish between observational data (verified measurements, not debatable) and theoretical interpretations (models proposed to explain those measurements, debatable). A paper that contradicts a theoretical interpretation is not marked as wrong — it's evaluated on its own internal logic, mathematical rigor, and falsifiability.
The WSU study shows that without this kind of explicit instruction, AI defaults to agreeing with whatever the training data says is true. For scientific review — where the whole point is to evaluate new ideas that might challenge existing understanding — that default is dangerous.
What This Means for Researchers
If you're an independent researcher using AI to evaluate your work, the WSU study is both a warning and a roadmap.
The warning: don't trust a single AI model's assessment of your scientific claims. An 80% accuracy rate with 16.4% detection of false hypotheses means the model is almost certainly going to tell you your work is fine even when it isn't. That's not helpful — it's harmful.
The roadmap: structured, multi-model evaluation with explicit protocols for handling disagreement produces fundamentally different results than conversational review. The architecture matters more than the model.
We're running a preregistered calibration study right now to demonstrate this quantitatively. The methodology, success criteria, and paper pool are published at theoryofeverything.ai/calibration — because if we're going to claim our system is better than a single-model conversation, we should show our work.
Professor Cicek said it best: "Always be skeptical. I'm not against AI. I'm using it. But you need to be very careful."
We agree. That's why we built a system that's careful by design.
TOE-Share is an AI-powered scientific peer review platform at theoryofeverything.ai. The platform uses independent specialist AI agents from multiple providers with coordinator synthesis to produce structured, paradigm-neutral review of scientific work.