Research progression 8 of 12

ERT: A Framework for Evaluating AI Reasoning Stability

Plain-language summary

The Epistemic Reliability Test, or ERT, is an experimental evaluation framework for studying whether AI reasoning remains stable, consistent, and uncertainty-aware under controlled variation. Most evaluations ask:

Current Status

Framework page. It describes how ERT evaluates reasoning stability without disclosing protected scoring internals.

How to Read This Page

  • This is page 8 of 12 in the public ERT / Project Aletheia progression.
  • Read it as a public research note: it explains the concept and what changed without exposing protected implementation details.
  • Redaction markers mean the public boundary is intentional, not that the section is missing by accident.
  • Use this page to understand how ERT evaluates reasoning stability at a public-safe level.

Research Log

1. Problem Being Addressed

Current AI evaluations often focus on static benchmark accuracy. These benchmarks are useful, but they do not fully show whether a system is reasoning in a stable and well-calibrated way.

A system may answer correctly once while still showing deeper reliability problems, such as:

  • unstable answers after small wording changes,
  • overconfidence under uncertainty,
  • hidden contradiction across related responses,
  • failure to preserve context,
  • or inconsistent reasoning under pressure.

ERT was created to explore these failure modes directly.

2. Why Epistemic Reliability Matters

A dependable AI system should not only produce correct answers. It should also show reliability in how it reaches and maintains those answers.

ERT focuses on properties such as:

  • stability,
  • consistency,
  • calibrated uncertainty,
  • contradiction handling,
  • and cross-response coherence.

Plain-language framing:

Correctness matters, but correctness alone is not enough if the reasoning becomes fragile under small changes.

3. What ERT Provides

ERT is model-agnostic. It is designed to evaluate behavior across systems without depending on one specific model provider or architecture.

At a public-safe level, ERT includes:

  • individual response review,
  • cross-response comparison,
  • structured prompt variation,
  • behavioral signal measurement,
  • diagnostic failure signatures,
  • and optional ERC tier interpretation.

[REDACTED — protected evaluator workflow and internal scoring implementation]

4. Two-Pass Evaluation Structure

ERT uses a two-part evaluation concept:

  1. Individual Evaluation — each response is reviewed for reasoning quality, evidence handling, and uncertainty behavior.
  2. Cross-Response Evaluation — related responses are compared to test stability, agreement, divergence, and contradiction.

This allows ERT to evaluate not just one output, but the behavior of reasoning across related conditions.

[REDACTED — private comparison logic and internal weighting method]

5. Structured Test Packs

ERT uses structured test packs made from controlled variations. These may include semantic, structural, adversarial, or ambiguity-based variations.

The public goal is simple:

Ask related questions in controlled ways and observe whether the reasoning changes for legitimate reasons.

A change is not automatically bad. Some changes are appropriate when the prompt meaning changes. ERT is interested in whether the system can distinguish legitimate change from unstable drift.

[REDACTED — protected test-pack construction rules and variation design method]

6. Behavioral Signals

ERT evaluates behavior across signals such as:

  • consistency,
  • uncertainty expression,
  • divergence across responses,
  • contradiction detection,
  • and confidence calibration.

Rather than producing only a single number, ERT is intended to produce a reliability profile.

This helps identify not only whether a system failed, but what kind of failure may have occurred.

[REDACTED — protected diagnostic taxonomy detail]

7. Certification Layer: ERC

The Epistemic Reliability Certification, or ERC, translates ERT results into a tiered reliability interpretation.

Public-safe tier framing:

  • ERC-C — basic reliability under limited variation.
  • ERC-B — consistent reasoning under standard conditions.
  • ERC-A — strong stability with calibrated uncertainty.
  • ERC-S — strong observed reliability under structured adversarial and ambiguous evaluation conditions.

These tiers should be interpreted carefully. They are not a universal guarantee of safety or truth. They are intended as an evaluation signal under defined test conditions.

8. How ERT Differs From Standard Benchmarks

Standard benchmarks often evaluate whether a model reaches expected outputs on fixed tasks.

ERT evaluates whether reasoning behavior remains reliable under controlled change.

Common Benchmark FocusERT Focus
Static evaluationControlled variation
Accuracy-focusedReliability-focused
Single outputsCross-response analysis
Limited diagnosticsInterpretable failure patterns

9. Why This Work Is Timely

AI systems are increasingly used in settings where uncertainty, ambiguity, and decision support matter. In those settings, failures may not appear as obvious wrong answers. They may appear as unstable reasoning, misplaced confidence, or contradiction across related responses.

ERT addresses this gap by studying how reasoning behaves under variation.

10. What Changed in Direction

This framework moved ERT from a simple correctness-adjacent evaluation idea toward a behavioral reliability framework.

The core shift is:

from “Did the answer match?”
to “Did the reasoning remain stable, calibrated, and accountable?”

Public Boundary

The overall framework can be described publicly. Protected details remain private, including exact evaluator internals, scoring methods, test-pack generation, threshold calibration, and full diagnostic taxonomy.

[REDACTED — internal implementation notes and calibration details reserved for controlled disclosure]