📊 Study in Progress

This page tracks the preregistered calibration protocol, run log, and versioned release history.

TOE-Share Calibration Study

Public preregistration + transparent run-by-run reporting.

Active calibration version: v0.1-draft (2026-03-16)

Methodology (Preregistration)

TOEShare Calibration Study — Preregistration v1

Purpose

Demonstrate that TOEShare's multi-agent AI review system produces meaningful, discriminating assessments across a spectrum of scientific quality.

Primary Hypothesis

Tier median overall scores will be monotonically decreasing: Tier 1 (Gold Standard) > Tier 2 (Mid-Range) > Tier 3 (Framework) > Tier 4 (Weak/Synthetic)

Secondary Hypothesis (Exploratory)

AI-generated papers submitted blind will not receive systematically different scores from human-written papers of comparable quality.

Methodology

Review Protocol

  • All papers submitted metadata-blind (no venue, credential, or publication status information provided to the review system)
  • Full multi-agent pipeline: 3 specialist roles (Math/Logic, Sources/Evidence, Science/Novelty) across multiple models per role, plus coordinator synthesis
  • Consensus rounds triggered when specialists disagree
  • Current production configuration used for all runs

Configuration Snapshot (locked per run)

Each run records:

  • Git commit hash
  • Prompt versions for all specialists and coordinator
  • Model roster (which models in which specialist slots)
  • Timestamp
  • Run treated as immutable for auditability

Terminology

  • "Metadata-blind": the system receives paper content but no external quality signals (venue, author credentials, citation counts, publication status)
  • NOT fully blind: the system can infer institutional context from content

Failure Handling

  • 1 automatic retry per failed specialist agent
  • 60-second timeout per agent call
  • If specialist fails after retry, run marked as "partial" with failure documented
  • Partial runs are valid data — they demonstrate system resilience

Repeatability

  • Minimum 1 re-run per tier (4 total)
  • Report score spread, not just point scores
  • Top scorer per tier selected for re-run

Analysis Method

  • Tier medians + interquartile range (IQR)
  • Spearman rank correlation between tier assignment and overall score
  • Kruskal-Wallis test if sample supports it
  • Effect sizes reported alongside averages
  • Two claims kept separate:
    1. Tier discrimination (primary, must be supported by data)
    2. AI-vs-human comparison (exploratory, labeled as such)

Paper Pool

Pool A: Human-Written (12 papers)

Tier 1 — Gold Standard (4 papers) Published in top-tier peer-reviewed venues. Expected score range: 3.8-4.8/5

  • Paper 1: Google Quantum AI — Dynamic Surface Codes (Nature Physics, Oct 2025)
  • Paper 2: NOvA + T2K Joint Neutrino Oscillation Analysis (Nature, Oct 2025)
  • Paper 3: Emergent Photons in Quantum Spin Ice (Nature Physics, 2025)
  • Paper 4: Superconducting Qubit Material Improvements (Nature, Nov 2025)

Tier 2 — Mid-Range (2 papers) Competent work with notable limitations. Expected score range: 2.5-3.6/5

  • Paper 5: arXiv XOR framework paper (preprint, known math error)
  • Paper 6: Independent researcher paper (TBD — sourcing in progress)

Tier 3 — Framework Papers (3 papers) Ambitious unified theories from independent researchers. Expected score range: 2.0-3.8/5

  • Paper 7: GETT Foundation Stone — John Holland (pending permission)
  • Paper 8: GETT Hypothesis 3 — John Holland (pending permission)
  • Paper 9: QH Ghost Rank paper — Adam Murphy

Tier 4 — Synthetic Weak Papers (3 papers) Original papers written specifically for calibration, designed to contain specific failure modes. Expected score range: 1.0-2.5/5

  • Paper 10: Internal consistency failure (definitional drift)
  • Paper 11: Unfalsifiable framework (vague claims, no testable predictions)
  • Paper 12: Circular reasoning (dimensional analysis masquerading as derivation)

Pool B: AI-Generated Blind Papers (4 papers)

Each written by a different AI model, submitted without any indication of AI authorship. Expected score range: 2.5-4.0/5

  • Paper A1: Written by Claude — Hubble tension resolution
  • Paper A2: Written by ChatGPT — Black hole area theorem
  • Paper A3: Written by Gemini — Neutrino mass / dark energy connection
  • Paper A4: Written by Grok — Double-slit scalar field interpretation

AI Recusal Protocol

Default: authoring model remains in the review panel (tests architecture robustness). One paper also run with authoring model removed for bias comparison.

Repeatability Runs

  • 4 re-runs: top scorer from each tier
  • 1 bias comparison: same AI paper run with and without authoring model
  • Total: 16 primary + 4 AI + 4 repeatability + 1 bias = 25 runs

Success Criteria

  • PRIMARY: Tier medians are monotonically ordered (T1 > T2 > T3 > T4)
  • SECONDARY: Spearman correlation between tier and score is significant
  • EXPLORATORY: AI papers do not cluster at extremes relative to their quality tier
  • TRANSPARENCY: Any result that fails these criteria is reported honestly with analysis

Timeline

  • Week 1: Paper sourcing, synthetic paper creation, AI paper generation
  • Week 2: All primary runs + repeatability runs
  • Week 3: Analysis, report writing, publication

Cost Estimate

~25 runs × 3/run= 3/run = ~75 total

Suggested Calibration Pool (with links)

paper-1 · Tier 1 · Google Quantum AI — Dynamic Surface Codes

Tests scoring on high-rigor institutional quantum error-correction work.

https://www.nature.com/

paper-2 · Tier 1 · NOvA + T2K Joint Neutrino Oscillation Analysis

Tests large collaboration evidence handling and statistical claims.

https://www.nature.com/

paper-3 · Tier 1 · Emergent Photons in Quantum Spin Ice

Strong condensed-matter benchmark with experimental grounding.

https://www.nature.com/

paper-4 · Tier 1 · Superconducting Qubit Material Improvements

Tests materials-centric quantum claims with practical constraints.

https://www.nature.com/

paper-5 · Tier 2 · XOR framework preprint (known math error)

Mid-range benchmark where rhetoric exceeds mathematical correctness.

https://arxiv.org/

paper-6 · Tier 2 · Independent Researcher Mid-Range Candidate

Controls for non-institutional but technically competent work.

https://zenodo.org/

paper-7 · Tier 3 · GETT Foundation Stone — John Holland

Framework-style ambitious unification with explicit assumptions.

internal://pending-permission/gett-foundation-stone

paper-8 · Tier 3 · GETT Hypothesis 3 — John Holland

Second framework from same author to test consistency across related claims.

internal://pending-permission/gett-hypothesis-3

paper-9 · Tier 3 · QH Ghost Rank — Adam Murphy

Represents ambitious independent framework with mixed strengths.

internal://author-submission/qh-ghost-rank

paper-10 · Tier 4 · Synthetic Weak Paper — Internal consistency failure

Baseline failure case for definitional drift detection.

internal://synthetic/paper-10

paper-11 · Tier 4 · Synthetic Weak Paper — Unfalsifiable framework

Tests whether system penalizes vague claims lacking testability.

internal://synthetic/paper-11

paper-12 · Tier 4 · Synthetic Weak Paper — Circular reasoning

Checks resistance to superficial formalism masking invalid derivation.

internal://synthetic/paper-12

paper-a1 · Tier 2 · AI (Claude) — Hubble Tension Resolution

AI-authored blind sample for exploratory human-vs-AI comparison.

internal://ai-generated/paper-a1-claude-hubble-tension

paper-a2 · Tier 2 · AI (ChatGPT) — Black Hole Area Theorem

AI blind sample with solid structure but limited novelty.

internal://ai-generated/paper-a2-chatgpt-black-hole-area-theorem.md

paper-a3 · Tier 2 · AI (Gemini) — Quantum-Centric Supercomputing: QPU-GPU Architectures

AI blind sample — Gemini-generated. Scored ~2.14 avg on first run. Reviewer caught math inconsistencies, temporal confusion, missing derivations. Novelty was highest dimension (3/5).

internal://ai-generated/paper-a3-gemini-quantum-supercomputing

paper-a4 · Tier 2 · AI (Grok) — Double-Slit Scalar Field Interpretation

AI blind sample with explicit caveats and mixed novelty signal.

internal://ai-generated/paper-a4-grok-double-slit-scalar-field.md

paper-a5-gpt · Tier 2 · AI (GPT) — Quantum-Centric Supercomputing: Architectures, Tensor-Network Surrogates, and Hybrid Paths to Utility

GPT-generated blind sample on same QCSC topic as Gemini paper-a3. Scored 3.57 avg vs Gemini's 2.14 — demonstrates quality discrimination within same topic. Perspective/synthesis framing earned higher clarity and completeness.

internal://ai-generated/paper-a5-gpt-quantum-supercomputing

paper-a6-claude-sonnet · Tier 2 · AI (Claude Sonnet 4.6) — QCSC: Architectural Convergence QPU-GPU, Tensor Network Co-Processing, AI Error Mitigation in Post-NISQ Era

CORRECTION: Claude SONNET 4.6 (not Opus). Highest scorer so far at 4.0 avg. Only paper to get 5/5 Clarity. Reviewer flagged 4 foundational departures (hardware claims, timeline projections). Falsifiability weakest at 3/5.

internal://ai-generated/paper-a6-claude-sonnet-qcsc-post-nisq

paper-a7-grok-qcsc · Tier 2 · AI (Grok) — QCSC: From Classical Emulation to Integrated QPU-GPU Architectures Enabling Utility-Scale Quantum Simulation

Grok (xAI) blind sample on same QCSC prompt. Scored 2.71 avg — second lowest after Gemini. Like Gemini, has ZERO reviewer panel overlap. Math hit for malformed tensor-train expressions and boundary condition errors. Novelty weakest at 2/5 — called a tech survey. Strengthens bias correlation finding.

internal://ai-generated/paper-a7-grok-qcsc-utility-scale

paper-a7-claude-opus · Tier 2 · AI (Claude Opus 4) — Quantum-Classical Advantage Boundaries: An Analytical Framework for Hybrid QPU-GPU Computational Utility

Claude Opus 4 on a DIFFERENT topic (QCAB) from the shared QCSC prompt. Tests whether Opus produces higher quality than Sonnet. Scored 3.33 avg — lower than Sonnet's 4.0 on QCSC. Math errors caught: algebraic errors in critical qubit count derivation, arithmetic mistake (56,000s → 56s), overlapping regime definitions. Falsifiability strong at 4/5. gpt-5-nano failed (invalid JSON). Consensus round triggered.

internal://ai-generated/paper-a7-claude-opus-qcab

Run Log

run-001 · complete · 3/18/2026, 6:36:05 PM

Suite: manual-testing

Avg score: 2.14 · Recommendation: revise

run-002 · complete · 3/18/2026, 7:23:34 PM

Suite: manual-testing

Avg score: 3.57 · Recommendation: publish

run-003 · complete · 3/18/2026, 10:27:58 PM

Suite: manual-testing

Avg score: 4 · Recommendation: revisions_suggested

run-004 · complete · 3/18/2026, 6:56:34 PM

Suite: manual-testing

Avg score: 2.71 · Recommendation: revisions_suggested

run-005 · complete · 3/19/2026, 5:56:00 PM

Suite: manual-testing

Avg score: 3.33 · Recommendation: revisions_suggested

Results

Results will be published here as the study progresses.

Version History

v0.1-draft · active · 2026-03-16

Initial calibration scaffolding and preregistration publication.

Suite: Calibration Preregistration Baseline · Papers used: 0