Methodology v0.1

What did the user actually receive, and was the user told?

The Observatory measures observable delivery conditions. It does not claim to read vendor intent, internal model state, training corpus truth, or deletion behavior that cannot be externally verified.

Research TrackPilot index 121 self-audit

Charter

Public method before public authority.

Methodology is public by default.
Historical scores are preserved.
Corrections are linked, not quietly overwritten.
121 products are scored under the same rubric.
All conflicts are disclosed.
No scored vendor can sponsor its own category.
Vendor responses are published as responses, not substituted for findings.
Unverified claims are labeled.
Any methodology change gets a dated changelog.

Source labels

Every claim gets a label.

Confirmed by vendor

A vendor-owned source states the claim directly.

Research Track

Confirmed by public record

A public filing, standard, official government record, or primary artifact supports the claim.

Research Track

Observed by 121

121 reproduced the behavior in a logged test, run receipt, screenshot, invoice, or route artifact.

Research Track

Reported by third party

A reputable outside reporter, researcher, customer, or community source reported it.

Research Track

Inferred

121 is making a bounded hypothesis from observed behavior or documentation gaps.

Research Track

Disputed

Credible sources conflict, or the vendor contests the interpretation.

Research Track

Unknown

The claim is not established enough to score or summarize as fact.

Research Track

Benchmark suites

Observable, falsifiable, receipt-bearing.

#	Suite	Dimension	Scoring	How it can be falsified
1	Model Identity Disclosure	model_identity	0-5 score	Can return full pass when exact model identity or signed receipt is visible on every sampled run.
2	Fallback / Reroute Visibility	routing_visibility	percent reroutes disclosed plus clarity score	Can return no issue when reroutes do not occur or every reroute is clearly disclosed.
3	Silent Capability Drift Monitor	capability_drift	drift magnitude minus disclosure quality	Can return no degradation when sentinel tasks remain within pre-registered tolerance.
4	System Prompt / Behavior Change Disclosure	behavior_change_disclosure	0-5 score	Can pass when dated changelogs cover material wrapper, prompt, memory, and effort changes.
5	Context Preservation Integrity	context_integrity	recall accuracy, false continuity rate, context-loss disclosure	Can pass when seeded facts are recalled or context loss is honestly disclosed within threshold.
6	Data Retention and Human Access Clarity	data_retention	multi-dimension score	Can pass when policy, docs, UI, and deletion workflow agree in sampled paths.
7	Pricing / Quota / Effort Honesty	pricing_quota_effort	clarity of capability-to-price relationship	Can pass when observed usage and invoices match published terms within tolerance.
8	Internal-vs-External Capability Parity Disclosure	parity_disclosure	0-5 score	Can pass when tier differences and trusted-access exceptions are explicitly documented.
9	Tool-Use and Source Provenance	provenance	provenance completeness plus fake-liveness penalty	Can pass when sampled outputs distinguish live lookup, memory, file content, inference, and unavailable tools.
10	Correction / Redress Path	correction_redress	multi-dimension score	Can pass when reports, appeals, remedies, and linked correction history work in sampled cases.
11	Safety Boundary Clarity	safety_boundary_clarity	clarity, consistency, safe redirection quality	Can pass when refusal reasons are understandable, consistent, and paired with safe alternatives.
12	Machine-Readable Trust Receipt	trust_receipt	0-5 score	Can pass when every sampled run exports a signed receipt with route, model, tools, policy, data class, and context state.

Distinctive 121 suite

Substrate continuity

Substrate flexibility matters only when users receive measured continuity guarantees across model, provider, or route changes.