121 Trust Observatory
Capability is not enough. AI users need operational trust.
What did the user actually receive, and was the user told?
v0.2 - additive expansion - methodology under iteration. Not certification. External scores remain TBD until measured. Public-read content is not age gated.
Category phrase
Operational Trust
Operational Trust is not a general AI ethics score, a model-trustworthiness claim, or a safety benchmark. It measures the delivery conditions around AI systems: routing, fallback, product wrappers, data handling, pricing, correction paths, substrate continuity, and receipts.
Strategic anchor
The model is no longer the product. The delivery conditions are the product.
v0.2 state
More coverage, still no fake authority.
12 external profiles
TBD scoresOpenAI, Anthropic, Google, Cursor, Perplexity, xAI, Mistral, Cohere, Apple ML, Meta, Microsoft, and AWS.
13 benchmark suites
FalsifiableEach suite includes run steps, evidence requirements, confidence labels, and a no-problem-found path.
1 computed run
Observed by 121Current measured lane is a local static fixture for Switchboard and Hermes. External vendor scores remain TBD.
Public surfaces
The v0.2 map
Vendor Coverage
Research TrackVendor/system profiles with docs links, confidence labels, and TBD score discipline.
Score Index
Research TrackEvery score cell stays TBD until a measured run receipt exists.
Methodology
Research TrackActual run lifecycle, blinding controls, source labels, corrections, and scoring limits.
Run Receipt
Research TrackThe first v0.2 local fixture run for Switchboard and Hermes disclosure gaps.
Self-Audit
Research Track121 products graded under the same rubric with explicit gaps and N/A labels.
Vendor Response
Research TrackVendor pushback, correction status, and account-gated response intake placeholder.
Changelog
Research TrackEvent log for methodology updates, score changes, corrections, and self-audit changes.
Trust Wire
Research TrackCurated operational-trust intelligence, not a raw scraper feed.
Governance first
Trust has to be inspectable.
Charter commitments
- Methodology is public by default.
- Historical scores are preserved.
- Corrections are linked, not quietly overwritten.
- 121 products are scored under the same rubric.
- All conflicts are disclosed.
- No scored vendor can sponsor its own category.
- Vendor responses are published as responses, not substituted for findings.
- Unverified claims are labeled.
- Any methodology change gets a dated changelog.
- The benchmark has to be able to come back negative.
v0.2 limits
- No external vendor score is published until a benchmark run has a receipt.
- The v0.2 local runner scores only 121 static source evidence; it is not full benchmark automation.
- No certification or trust badge exists in v0.2.
- The Trust Wire is curated and review-gated; raw scraper output is not a product.
- External vendor data retention, internal model parity, and training corpus truth remain only partially observable.
- Private canaries may exist to reduce benchmark gaming, but public scores need public methodology and preserved history.
- Interactive vendor-response submission is scaffolded as account-required until the account system lands.
- The Anthropic Fable 5 incident is a trigger for routing-disclosure testing, not the target or the whole story.