2026-03-17 · Adam Murphy

Why We're Calibrating TOE-Share — And Why It Matters

Every measurement instrument needs calibration. We're applying that same discipline to AI-powered peer review — running papers of known quality through the system and publishing the results.

When you step on a scale, you trust the number. When a thermometer reads 101°F, you act on it. But why? Because somewhere along the way, someone calibrated those instruments against a known standard. They put a precise 1-kilogram weight on the scale and made sure it read 1 kilogram. They dipped the thermometer in boiling water and confirmed it read 212°F.

Without calibration, a measurement is just a number. With calibration, it's information you can trust.

TOE-Share is a measurement instrument. Instead of measuring weight or temperature, it measures the rigor of scientific work — mathematical validity, internal consistency, falsifiability, clarity, novelty, completeness, and evidence strength. Seven dimensions, each scored by independent AI specialist agents who don't talk to each other until a coordinator synthesizes their findings.

But here's the question every honest person should ask: how do we know those scores mean anything?

That's what calibration answers.

What We're Actually Doing

We're running papers of known quality through the system and checking whether the scores match what we'd expect.

A paper published in Nature Physics — peer-reviewed by human experts, backed by massive research teams, validated by the scientific establishment — should score well. Not perfectly, because no paper is perfect. But well.

A paper with a known mathematical error should get flagged for that error. Not punished for being unconventional — flagged for the specific mistake.

A paper that makes grand claims with no supporting math and no testable predictions should score low. Not because the idea is bad, but because it hasn't done the work yet to earn a higher score.

If our system produces those results — high-quality work scores high, weak work scores low, and the middle falls in the middle — then the instrument is calibrated. The scores mean something.

If it doesn't? Then we have work to do. And we'll tell you that honestly, because transparency is the whole point.

Why This Matters for Independent Researchers

If you're working outside of a university or a national lab, you already know the problem. You write a paper. You think it's good. Maybe it is. But who's going to tell you?

You can paste it into ChatGPT and get an encouraging response. You can feed it to multiple AI models and get slightly different encouraging responses. But encouragement isn't review. A system that tells you everything is great isn't helping you — it's flattering you.

What you need is a system that says: "Your core idea is novel and your structure is clear, but equation 14 has a sign error, your falsifiability score is low because you haven't specified a testable prediction, and your evidence base needs supporting papers." That's not discouragement. That's a roadmap.

But you'll only trust that roadmap if you trust the instrument producing it. And you'll only trust the instrument if it's been calibrated against papers where you already know the answer.

That's why we're publishing our calibration methodology before we publish the results. We're telling you what we're testing, how we're testing it, and what "success" looks like — before we run a single paper. This is called preregistration, and it's what separates a real study from a marketing demo.

What We're Looking For

Our primary question is simple: do the scores discriminate meaningfully across quality tiers?

We've organized our test papers into four tiers:

Tier 1 — papers published in the world's top journals. These have already survived human peer review at the highest level. We expect them to score well in our system too.

Tier 2 — solid preprints with real merit but known limitations. Competent work that hasn't been through the full journal gauntlet. We expect mid-range scores with specific, actionable feedback.

Tier 3 — framework papers from independent researchers. Ambitious, often brilliant, but sometimes with gaps in mathematical rigor or evidence. These are the papers TOE-Share was built for. We expect the system to recognize the ambition while honestly identifying the gaps.

Tier 4 — papers we wrote ourselves, specifically designed to contain common failure modes. Circular reasoning. Unfalsifiable claims. Equations that look impressive but don't actually derive anything. We expect the system to catch every one of these.

If Tier 1 scores higher than Tier 2, which scores higher than Tier 3, which scores higher than Tier 4 — the instrument is working.

What Makes This Different From Just Asking AI

You might wonder: can't you just ask ChatGPT to review a paper? Yes, you can. And the response will probably be thoughtful and articulate. But there's a fundamental difference between a conversational AI review and a structured multi-agent review.

In a conversation, the AI is your partner. It builds context with you over time. It responds to your framing. If you push back, it often accommodates. If you feed it 4 or 5 papers sequentially, each one building on the last, by the end it has absorbed your entire worldview and is evaluating from inside your framework. That's valuable — but it's advocacy, not unbiased review. That built-in bias is the opposite of what scientific evaluation requires.

In our system, each specialist agent sees the paper cold. No prior context. No relationship with the author. No memory of previous papers. It evaluates against a structured rubric, independently, and then a coordinator reads all the specialist reports and synthesizes them — checking for disagreements, catching things that individual specialists missed, and producing a unified assessment.

We've already seen this architecture catch errors that individual AI sessions missed. Mathematical inconsistencies that survived review by multiple AI models in conversation were identified when the same models ran independently through our specialist pipeline with coordinator synthesis.

That's the value of calibration — not just showing that the system scores papers, but showing that the architecture itself produces something you can't get from a single conversation.

How This Looks Going Forward

Calibration isn't a one-time event. It's a continuous process.

Every time we update the AI models in our specialist panel, the instrument changes. Every time we refine a prompt or adjust a scoring threshold, the instrument changes. Each change gets a new calibration version, tested against reference papers, with results published publicly.

Every review a researcher receives will be stamped with the calibration version that was active when the review was produced. If you got a review under version 1.0 and later the system was recalibrated to version 1.1, both scores are preserved. You can see exactly how the instrument evolved and what that means for your results.

This is how scientific instruments work in the real world. A laboratory doesn't calibrate its mass spectrometer once and forget about it. It recalibrates on a schedule, documents every change, and traces every measurement back to a calibration certificate.

We're applying that same discipline to AI-powered peer review. Because if we're asking researchers to trust this system with their life's work, the least we can do is show our own.

Follow Along

Our calibration methodology is published and our study is in progress. You can follow the process, see the preregistered criteria, and eventually judge the results yourself at theoryofeverything.ai/calibration.

Science was meant to be shared. We think the tools that evaluate it should be held to the same standard.

Adam Murphy is the founder of TOE-Share and an independent researcher. He built the platform because he needed honest review of his own work and couldn't find it anywhere else.