121 Trust Observatory

Capability is not enough. AI users need operational trust.

What did the user actually receive, and was the user told?

v0.2 - additive expansion - methodology under iteration. Not certification. External scores remain TBD until measured. Public-read content is not age gated.

Category phrase

Operational Trust

Operational Trust is not a general AI ethics score, a model-trustworthiness claim, or a safety benchmark. It measures the delivery conditions around AI systems: routing, fallback, product wrappers, data handling, pricing, correction paths, substrate continuity, and receipts.

Strategic anchor

The model is no longer the product. The delivery conditions are the product.

v0.2 state

More coverage, still no fake authority.

12 external profiles

TBD scores

OpenAI, Anthropic, Google, Cursor, Perplexity, xAI, Mistral, Cohere, Apple ML, Meta, Microsoft, and AWS.

13 benchmark suites

Falsifiable

Each suite includes run steps, evidence requirements, confidence labels, and a no-problem-found path.

1 computed run

Observed by 121

Current measured lane is a local static fixture for Switchboard and Hermes. External vendor scores remain TBD.

Public surfaces

The v0.2 map

Governance first

Trust has to be inspectable.

Charter commitments

  • Methodology is public by default.
  • Historical scores are preserved.
  • Corrections are linked, not quietly overwritten.
  • 121 products are scored under the same rubric.
  • All conflicts are disclosed.
  • No scored vendor can sponsor its own category.
  • Vendor responses are published as responses, not substituted for findings.
  • Unverified claims are labeled.
  • Any methodology change gets a dated changelog.
  • The benchmark has to be able to come back negative.

v0.2 limits

  • No external vendor score is published until a benchmark run has a receipt.
  • The v0.2 local runner scores only 121 static source evidence; it is not full benchmark automation.
  • No certification or trust badge exists in v0.2.
  • The Trust Wire is curated and review-gated; raw scraper output is not a product.
  • External vendor data retention, internal model parity, and training corpus truth remain only partially observable.
  • Private canaries may exist to reduce benchmark gaming, but public scores need public methodology and preserved history.
  • Interactive vendor-response submission is scaffolded as account-required until the account system lands.
  • The Anthropic Fable 5 incident is a trigger for routing-disclosure testing, not the target or the whole story.