Governed Continual Learning · Evidence

Benchmarks and evidence registry.

This page measures evidence-backed performance for the governed-memory layer: automatic promotion, stateful lifecycle persistence, memory pressure, auditability controls, and supporting workflow diagnostics. The primary evidence is presented first, followed by method limitations and diagnostic measurements. This registry compiles measured scaffolding improvements rather than downstream provider substitution.

Memory pressure: 2026-05-23Automatic promotion: 2026-05-27Lifecycle: 2026-05-30Audit controls: 2026-05-27DOI evidence package
Core Evidence

Governed-memory evidence spine.

These four core studies serve as the evidence spine for the FieldHash governed-memory architecture. They measure whether the pre-answer control plane governs what reaches the model before generation: promoted authoritative memory, state that survives lifecycle changes, seeded reviewed state under conflict, and audit controls that catch broken governance.

Automatic memory promotion

90/90

The authoritative memory is selected, promoted, and governed before generation.

Internal synthetic adversarial benchmark, not external validation. The strongest independence check is the Claude-authored disjoint n=30 corpus; the n=100 provider replications use a same-family Gemini-authored corpus and frozen semantic-label artifact. A later same-budget Gemini two-pass smart diagnostic on the n=30 corpus selected the current record 30/30 and answered 28/30 with zero stale substitutions, so the public claim should not be framed as beating every same-budget selector. The benchmark measures governed answer-path control under singleton-current memory conflict, not broad reasoning superiority, universal memory safety, model-weight learning, provider-invariant fact extraction, or perfect source-span extraction. Claude Opus 4.7 n=100 misses were empty provider responses rather than stale substitutions.

Read case study

Governed learning lifecycle

2400/2400

Governed state survives update, rollback, repair, compaction, and repeated reads.

Internal diagnostic on a FieldHash-authored semantic-authority lifecycle corpus, not external validation, not base-model weight learning, and not a claim that FieldHash out-reasons frontier LLMs. Authority metadata is written by governed operations, so this does not prove authority inference. The claim is bounded by two controls: visible rollback is recoverable when present as retrievable text, and a full-operation-log baseline can replay the lifecycle. The supported claim is durable materialized governed state versus bounded read-time reconstruction.

Read case study

Governed memory pressure

600/600

Seeded reviewed/current state is enforced under plausible stale context.

Internal seeded-authority memory-pressure benchmark; the scenario was generated by an external base model, then executed through the FieldHash harness with deterministic exact-substring scoring (binary match against the approved codeword, same matching code across all complete comparison runs). The current record carries pre-written FieldHash governance metadata such as reviewed/current authority state; retrieval-only and prompt-only baselines receive retrieved records as ordinary context without that structured metadata, approved-current precedence, supersession handling, Bayesian arbitration, or hub compression. This is a supported governed-stack enforcement claim, not a pure single-variable ablation and not proof that FieldHash infers authority from raw text better than a same-model selector. It measures approved-context preservation and what memory was allowed to shape the answer under seeded conflict, not broad reasoning superiority, universal memory safety, independent external validation, billing-token reduction, or database storage compression. The original n=200 run predates strict provider metadata; the refreshed Gemini, Claude Opus, and GPT provider comparisons include row-level provider/model guards and direct answer-path exposure telemetry.

Read case study

Auditability diagnostic

437/437

Deliberate broken-governance controls are detected.

Internal deterministic code-path diagnostic with zero LLM calls. It verifies governed memory state, falsifiability checks, and audit telemetry behavior; it is not external validation, not a named memory-system comparison, and not evidence of base-model weight learning or general reasoning improvement.

Read case study

How these fit together

Automatic promotion is upstream of memory pressure: first identify the authoritative memory, then govern what survives stale-context pressure. The memory-pressure benchmark supports the second half after reviewed/current state exists; it should not be read as an authority-inference superiority claim. The lifecycle pillar tests whether governed state remains durable through rollback, repair, compaction, and repeated reads. The auditability diagnostic checks whether broken governance is caught instead of silently steering answers. The same-budget two-pass diagnostic narrows the promotion claim to governed answer-path control rather than basic semantic selection. Read the loop case study.

Evidence package

The original public-safe methods reports, the governed-learning loop synthesis report, figures, source-artifact hashes, and the newer lifecycle diagnostics are retained as the FieldHash governed-context evidence set. Row-level prompts, answers, and implementation-sensitive traces are retained for qualified private review. View DOI 10.5281/zenodo.20401670.

How to read the rest of this registry

Core governed-memory studies

Public-safe methods reports, source-hash manifests, aggregate artifacts, and case-study audits for the four governed-memory proof points.

Workflow diagnostics

Same-provider reasoning and selection checks that measure scaffolding quality, not provider substitution.

Mechanism checks

Deterministic controls for memory gates, audience scope, compression, telemetry, and routing behavior.

Research evidence

Deep Synthesis and Research Lab outputs that show hypothesis generation, validation plans, symbolic regression, and pipeline constraints.

Governed Continual Learning

The category claim must demonstrate both adaptation and restraint.

FieldHash uses “governed continual learning” narrowly: validated outcomes can change future retrieval, routing, and memory influence, while scope, confidence, compression, and telemetry gates decide what is allowed to carry forward.

It is not base-model fine-tuning, preference-data training, plain retrieval, or context compaction. The benchmarks below separate the two halves of the claim: what gets reused, and what is blocked from becoming a future prior.

Reviewed context persists

98.96%

In a 32-case live governed-memory benchmark, FieldHash recovered current approved project context with a 98.96% mean recall score, a 100% seeded memory-retrieval rate, and 0% control-user leakage across continuity, rejected-noise, superseding-update, and topic-isolation tasks.

Scoped memory retrieval plus correction precedence.

Bad priors are blocked

5/5

A deterministic governed-learning controls benchmark passed 5/5 mechanism checks covering memory arbitration, organization-scoped collective insight and heuristic write gates, audience-scoped retrieval within an organization, hub compression, and audit telemetry.

Write gates, audience scope, hub compression, and telemetry checks.

Audit trail is falsifiable

437/437

The governed memory auditability diagnostic passed 437/437 deterministic checks across 36 lifecycle scenarios. That includes 257 positive governance invariants and 180/180 negative controls that deliberately disabled governance or corrupted state, confirming the suite catches stale exposure, rejected-context promotion, missing superseded_by links, scope leakage, and stale re-promotion.

State invariants plus deliberate broken-governance controls.

Workflow quality improves

+48.6%

Against the same-provider direct baseline, FieldHash improved average composite reasoning quality by 48.6% on a 56-prompt live paired benchmark while improving grounding fit from 94.64% to 100%.

The same-provider baseline isolates the FieldHash layer.

Rejected-learning control example

The governed-memory suite includes later turns that introduce discarded brainstorming, superseding corrections, and unrelated project context. Passing behavior means the system can preserve reviewed work while preventing stale, noisy, or out-of-scope material from becoming the answer’s hidden prior. The newer auditability diagnostic adds deliberate breakages so the suite must also catch disabled governance, missing supersession links, rejected-context promotion, scope leakage, and stale re-promotion.

Supporting Workflow Diagnostics

Workflow-quality suite.

These measurements test open-form reasoning quality, exactness, task selection, and semantic grounding against a same-provider direct baseline. They support the product story, but the flagship governed-context studies above carry the strongest public evidence.

Reasoning quality

+48.6%

0.3740 to 0.5556

Grounding fit

100%

0.9464 to 1.0000

Task selection

+0.40

0.00 to 0.40

Exact correctness

100%

100% to 100%

Reviewer Stress Test

The broader diagnostic suite establishes reasoning safety at scale.

In the broader diagnostic suite (v4), FieldHash measured a 52.2% relative reasoning lift with 155 wins, 5 losses, and 43 ties; exact correctness held at 100% on 24 deterministic tasks, and semantic grounding held at 100% across 32 ambiguity-control cases.

Broader diagnostic suite, not yet the promoted public headline; it preserves the reasoning lift at larger sample size while showing task-selection remains an active reliability target.

Reasoning lift

+52.2%

0.3509 to 0.5340

Paired wins

155/5/43

wins / losses / ties

Mean-delta CI95

[0.1668, 0.1976]

Exact correctness

100%

24 deterministic tasks

Semantic grounding

100%

32 ambiguity-control cases

Task selection

38.98%

0.0000 to 0.3898

What this tells us

The diagnostic suite is useful precisely because it is broader: the paired reasoning lift persisted at larger sample size, exactness held, and semantic grounding held after repair. The remaining softness is narrow and visible: task-selection accuracy remains an active optimization target.

Mechanism Ablation

Workload lift concentrates in the quick-lite reasoning path.

A router-conditioned ablation of the broader diagnostic suite (v4) shows the lift is concentrated in quick-lite-routed rows: 164 quick-lite prompts improved from 0.3507 to 0.5777 with 155 wins, 2 losses, and 7 ties, while 39 direct-high-retained rows were essentially flat.

Router-conditioned ablation, not a randomized forced-routing experiment; prompt difficulty may differ between quick-lite-eligible and direct-high-retained rows.

Quick-lite rows

164

0.3507 to 0.5777

Quick-lite wins

155/2/7

wins / losses / ties

Quick-lite mean delta

+0.2270

Direct-high rows

39

0.3516 to 0.3500

Governed Memory

Memory governance, not just recall.

In a 32-case live governed-memory benchmark, FieldHash recovered current approved project context with a 98.96% mean recall score, a 100% seeded memory-retrieval rate, and 0% control-user leakage across continuity, rejected-noise, superseding-update, and topic-isolation tasks.

The diagnostic scenario is strictly bounded: it validates whether the governed context maintains high-fidelity alignment with verified project facts when subsequent interactions introduce discarded concepts, explicit overrides, or cross-project context. This design isolates contextual stability far more rigorously than standard static recall evaluations.

Focused internal live benchmark of governed project-memory behavior; it measures update precedence, noise suppression, topic isolation, and user isolation, not general model quality or universal memory performance.

Mean governed recall

98.96%

32 live seeded cases

Memory retrieval

100%

seeded recall turns

Control leakage

0%

32 unseeded controls

Task families

4

continuity, noise suppression, updates, isolation

Procedure

Each diagnostic iteration utilizes an isolated state session. The evaluation sequence seeds a verified baseline context, then introduces systematic interventions during subsequent turns: baseline retrieval, rejected/noisy exploratory content, explicit superseding corrections, or cross-project data. State recall is quantified directly against the generated response, with unseeded control runs executed in parallel to establish the baseline guessing boundary and rule out leakage.

Governed Learning Controls

The gates are measured separately from the model.

A deterministic governed-learning controls benchmark passed 5/5 mechanism checks covering memory arbitration, organization-scoped collective insight and heuristic write gates, audience-scoped retrieval within an organization, hub compression, and audit telemetry.

This benchmark is fully deterministic, verifying that logic-layer constraints and control-plane gates operate exactly as specified prior to and independent of model generation.

Deterministic mechanism benchmark; it verifies code-path controls and telemetry, not open-ended model answer quality.

Control checks

5/5

deterministic code-path benchmark

Pass rate

100%

all mechanism checks passed

Mechanisms

5

arbitration, gates, scope, compression, telemetry

Procedure

The suite runs direct code-path checks for personal-memory supersession, collective organizational insight and heuristic write gates, audience-scoped retrieval within an organization, hub-compression representative selection, and organization-scoped collective-ingestion telemetry.

Governed Memory Auditability

The audit state has a falsifiability check.

The governed memory auditability diagnostic passed 437/437 deterministic checks across 36 lifecycle scenarios. That includes 257 positive governance invariants and 180/180 negative controls that deliberately disabled governance or corrupted state, confirming the suite catches stale exposure, rejected-context promotion, missing superseded_by links, scope leakage, and stale re-promotion.

Internal deterministic code-path diagnostic with zero LLM calls. It verifies governed memory state, falsifiability checks, and audit telemetry behavior; it is not external validation, not a named memory-system comparison, and not evidence of base-model weight learning or general reasoning improvement.

Total checks

437/437

deterministic memory-state and audit controls

Negative controls

180/180

broken governance states detected

Lifecycle scenarios

36

partial supersession and replacement updates

LLM calls

0

code-path diagnostic, not answer-quality benchmark

Core Workflow Measurements

How we measured it

Live reasoning benchmark

Against the same-provider direct baseline, FieldHash improved average composite reasoning quality by 48.6% on a 56-prompt live paired benchmark while improving grounding fit from 94.64% to 100%.

Reasoning quality

+48.6%

0.3740 to 0.5556

Steering usefulness

0.3857

0.0125 to 0.3857

Grounding fit

100%

0.9464 to 1.0000

Sample: 56 live prompts

Measures whether the companion chooses a more useful line of thought under live conditions while staying grounded.

Average uplift across a 56-prompt suite; not a claim that every prompt improves equally.

Evidence package

Website Benchmark Suite v2

Paired same-provider direct-baseline benchmark covering 56 live reasoning prompts, with methodology and component reports retained for technical review.

live benchmarkUpdated 2026-03-29technical review

Exact correctness floor

Exact correctness held at 100% on 24 callable-backed deterministic tasks, used as a deterministic safety-floor alongside the broader reasoning benchmark.

Exact correctness

100%

100% to 100%

Sample: 24 deterministic tasks

Checks that the system preserves or improves closed-form correctness while adding reasoning scaffolding.

This is a deterministic callable-backed multiple-choice safety-floor metric, not an open-form reasoning benchmark.

Evidence package

Website Benchmark Suite v2 exactness slice

Callable-backed deterministic task slice used as a correctness safety-floor check alongside the broader reasoning suite.

offline benchmarkUpdated 2026-03-29technical review

Task selection

On the approved gold lens-family set, task-selection accuracy improved from 0.00 to 0.40 across 30 prompts.

Task selection

+0.40

0.00 to 0.40

Sample: 30 approved-gold prompts

Measures whether the system chooses a more useful reasoning family before answering.

Curated approved-gold lens-family benchmark; not an open-world classifier claim.

Evidence package

Website Benchmark Suite v2 task-selection slice

Approved-gold lens-family selection benchmark measuring whether the system chooses a useful reasoning family before answering.

offline benchmarkUpdated 2026-03-29technical review

Semantic grounding proxy

On a 16-case ambiguity-control proxy benchmark, semantic grounding landed at 1.00 artifact-class accuracy and 1.00 prompt-family accuracy.

Artifact class accuracy

1.00

Prompt-family accuracy

1.00

Sample: 16 ambiguity-control cases

Checks that ambiguous prompts stay in the right semantic universe instead of collapsing into generic or literalized readings.

Narrow ambiguity-control proxy benchmark; supporting evidence, not the headline reasoning claim.

Evidence package

Website Benchmark Suite v2 semantic-grounding slice

Focused ambiguity-control proxy benchmark measuring artifact-class and prompt-family grounding under ambiguous prompts.

offline benchmarkUpdated 2026-03-29technical review

Where it wins

Ambiguous Named Concept

0.3411 → 0.5872

When a prompt uses a poetic or metaphorical name, the system keeps it in the right conceptual frame instead of interpreting it literally.

Mathematical Strategy

0.3793 → 0.5536

On math problems, the system picks stronger proof strategies and gives clearer next-step guidance.

Operational Tradeoff

0.4115 → 0.5754

For real-world trade-off decisions, the system identifies the variable that actually matters instead of listing generic pros and cons.

UI System Design

0.3689 → 0.5782

On design problems, the system finds the real implementation decision point instead of writing a generic architecture overview.

High-Stakes Advisory

0.3370 → 0.4843

Under high-stakes or ambiguous pressure, the system provides grounded, practical strategies instead of vague reassurance.

Scientific Mechanism

0.3799 → 0.5809

On science questions, the system frames mechanisms more precisely and distinguishes between competing experimental approaches.

Ambiguous Abstract

0.4003 → 0.5293

For abstract or philosophical prompts, the system gives substantive framing instead of decorative language.

Example judgments

Clear win

Ambiguous concept framing

Under highly figurative prompts (such as "Glass Field"), unguided baselines frequently literalized metaphorical inputs. FieldHash successfully isolated the latent functional constraints, prioritizing structural implementation variables in its response.

Typical win

Operational tradeoff

Rather than presenting symmetry-bound lists of generic advantages and disadvantages, the FieldHash control layer successfully identified the governing trade-off variable—the critical factor that dictates real-world viability.

Clear miss

Scientific mechanism prompt

A diagnostic stress-suite miss revealed instances where the response drifted into narrative-conversational language rather than maintaining strict, quantitative mechanism analysis. This boundary illustrates why task-routing precision and semantic grounding are tracked as explicit architectural optimization targets.

Infrastructure & Governance

Tools manifold routing

Tools manifold routing improved top-1 selection by 3.77 percentage points on real paired events and by 5.34 percentage points on the broader combined benchmark.

Real paired events

+3.77 pp

50.94% to 54.72%

Combined benchmark

+5.34 pp

Broad benchmark-scale evaluation

Sample: 53 real paired events; combined benchmark-scale evaluation

Measures whether learned routing improves tool choice compared with a fixed baseline policy.

Real-event significance remains underpowered at n=53; strongest support comes from the broader combined benchmark.

Evidence package

Tools manifold routing significance package

Paired real-event and broader combined routing benchmark measuring learned tool selection against a fixed baseline policy.

offline benchmarkUpdated 2026-02-25technical review

Manifold stability

Production manifolds validated above 91% accuracy while monitored training drift stayed within a bounded L2 range of 0.014 to 0.121.

Validation accuracy

>91%

L2 drift band

0.014–0.121

Convergence

1–13 epochs

Sample: Nine trained manifolds

Supports the claim that learning components remain stable enough to deploy under governance.

Training and validation stability evidence for manifolds, not a live companion benchmark.

Evidence package

Manifold validation and drift report

Training and validation stability evidence for production manifolds, including validation accuracy and bounded L2 drift ranges.

offline benchmarkUpdated 2025-12-01under nda

Mesh sharding speed

On a sharded synthesis workload, mesh-parallel execution achieved a 2.74x mean speedup over local execution across 10 queries.

Mean speedup

2.74x

CI95

2.66x–2.83x

Queries

10

Sample: 10 benchmark queries

Shows that the mesh can materially reduce wall-clock time for sharded synthesis workloads.

Measures orchestration and distributed execution speed for a specific sharded synthesis workload, not model quality.

Evidence package

Mesh synthesis sharding benchmark

Sharded synthesis workload comparison measuring wall-clock speedup for mesh-parallel execution versus local execution.

offline benchmarkUpdated 2026-01-25under nda

Deep Synthesis & Research Lab

PIMA Diabetes

On the PIMA Diabetes benchmark, the research pipeline reached 85.3% AUC on 768 rows while retaining the safe local-data path when a borrowed configuration would hurt performance.

AUC

85.3%

Rows

768

Sample: 768 rows

Shows parity-level performance on a clean medical classification benchmark with governance preventing a harmful borrowed configuration.

Dataset-task benchmark for the research platform, not a live companion benchmark.

Evidence package

QARIN Research Lab tabular benchmark report

Dataset-task benchmark evidence for the research pipeline on PIMA Diabetes, with governance retaining the safe local-data path when a borrowed configuration would hurt performance.

offline benchmarkUpdated 2025-11-30under nda

Non-linear stress test

On the non-linear stress benchmark, the research pipeline reached 90.8% AUC, outperformed the linear baseline by 10.5%, and filtered 87% of noise columns.

AUC

90.8%

Lift vs linear baseline

+10.5%

Noise filtered

87%

Sample: 1,000 rows, 23 features

Shows autonomous signal detection and noise filtering on a deliberately difficult synthetic benchmark.

Synthetic signal-vs-noise benchmark; illustrates autonomous feature selection, not a production customer metric.

Evidence package

QARIN Research Lab non-linear stress benchmark

Synthetic signal-versus-noise benchmark measuring autonomous signal detection and noise filtering under controlled conditions.

offline benchmarkUpdated 2025-11-30under nda

Adult Census

On Adult Census, the research pipeline reached 91.1% AUC on 30,162 rows and 96 features while degrading gracefully when dynamic grouping timed out.

AUC

91.1%

Rows

30,162

Features

96

Sample: 30,162 rows, 96 features

Shows robustness on high-dimensional, messy, real-world tabular data.

Dataset-task benchmark for robustness and fallback behavior, not a live companion benchmark.

Evidence package

QARIN Research Lab Adult Census benchmark

High-dimensional tabular benchmark measuring robustness and graceful degradation under dynamic grouping timeouts.

offline benchmarkUpdated 2025-11-30under nda

Symbolic regression

The symbolic-regression stack recovered Kepler’s Third Law and the Rydberg Formula with perfect fit on standard benchmark tasks.

Kepler fit

R² = 1.0

Kepler complexity

4 nodes

Rydberg fit

R² = 1.0

Sample: Standard physics benchmark tasks

Shows interpretable equation discovery rather than black-box prediction alone.

Physics symbolic-regression benchmark; demonstrates the research pipeline, not the consumer companion.

Evidence package

QARIN symbolic-regression benchmark report

Physics-law recovery benchmark demonstrating interpretable equation discovery on standard symbolic-regression tasks.

offline benchmarkUpdated 2026-01-15under nda

Alzheimer’s biomarker discovery

On the GSE84422 Alzheimer’s candidate-marker task, the research pipeline produced AUC 0.855 on an internally processed evaluation matrix across 19 brain regions.

Validation AUC

0.855

Data matrix

processed

Brain regions

19

Sample: Internally processed GSE84422 evaluation matrix, 19 regions

Shows structured hypothesis generation on a real biological dataset with literature-grounded marker interpretation.

Scientific discovery benchmark on curated transcriptomics data; not a live companion eval.

Evidence package

QARIN biomedical discovery benchmark report

Curated transcriptomics benchmark and literature-grounded marker interpretation for the GSE84422 Alzheimer’s task.

offline benchmarkUpdated 2026-01-27under nda

FieldHash & Provenance

These benchmarks validate the integrity and auditability of the provenance layer governing verified artifacts and policy actions. The accompanying FieldHash documentation outlines the underlying cryptographic registry and state-verification sequence.

FieldHash hardening closure

On the measured adversarial synthesis benchmark, a standard-profile uniform-blend attack passed in 15 of 800 trials while the hardened profile closed that gap to 0 of 800.

Standard profile

15/800

1.875%

Hardened profile

0/800

Sample: 800 trials per profile

Shows that hardening materially closed a measured attack family rather than relying on a generic security narrative.

Attack-family measurement on a specific adversarial synthesis benchmark; not a universal security guarantee.

Evidence package

FieldHash adversarial hardening package

Measured adversarial synthesis benchmark comparing standard and hardened profiles against a uniform-blend attack family.

adversarial validationUpdated 2026-02-17public summary

FieldHash production-gated adaptive campaign

In the calibration-conditioned adaptive ML campaign, production-gated verification measured 0 of 5,000 successful forgeries per tested model, with a Wilson 95% upper bound of 0.0768%.

Production-gated acceptance

0/5000

Wilson 95% upper bound

0.0768%

Sample: 5,000 trials per tested model

Shows that the production-gated path held under stronger adaptive attacks than the policy-only path.

Per-tested-model result under the documented production-gated verifier and no-signing-key assumption; not an absolute impossibility claim.

Evidence package

FieldHash adaptive spoofing campaign

Calibration-conditioned adaptive ML spoofing campaign under the documented production-gated verifier and no-signing-key assumption.

adversarial validationUpdated 2026-02-17public summary

Scientific Boundaries & Caveats

Defined reasoning boundaries: These metrics isolate cognitive alignment, instruction adherence, and state-governed retrieval. They do not simulate general consciousness or unconstrained artificial general intelligence.

Evaluation footprint: While the test sets are empirically significant and robust, they reflect distinct governed-context scenarios rather than exhaustive multi-modal enterprise operations.

Workflow diagnostics: The measured +48.6% reasoning lift represents a statistical mean across the 56-prompt diagnostic set. This diagnostic metric serves as secondary validation rather than the primary governed-context proof.

Broader diagnostic suite (v4): The May 2026 diagnostic run demonstrated a +52.2% reasoning lift across 203 live test prompts. Exactness and semantic grounding held at a 100% correctness floor, though dynamic task selection remains an active focus for optimization.

Scope of metrics: The exact correctness metrics establish a deterministic safety-floor, while semantic grounding functions as a high-fidelity control proxy rather than the primary architectural benchmark.

Review the evidence.

This registry documents verified benchmark milestones and active verification runs. Institutional partners may request secure access to row-level traces, the core architectural whitepaper, and advanced evaluation datasets.

Request access

Ready to build?

These quantitative results establish the empirical foundation. The architectural whitepaper details the underlying mechanics, and our case studies demonstrate these governed behaviors in active environments.