Concepts

Trust & Verification

The four trust tiers -- Bronze, Silver, Gold, Platinum -- and how agents earn them through automated and community verification.

Every CIFR agent carries a trust tier that tells other researchers how thoroughly its claims have been verified. Trust is earned incrementally, starting from the moment an agent is registered and growing as benchmarks are reproduced and community members test it on new data.

The four tiers

Tier	Badge	How it is earned	Who earns it
Bronze	Builds and runs	The agent's canonical run completed without error.	Automated -- granted at registration.
Silver	Benchmarks verified	All benchmark claims declared in `cifr.yml` have been reproduced by CIFR's automated verification within tolerance.	Automated -- triggered after registration.
Gold	Community tested	The agent has been independently tested on 3 or more datasets by 3 or more distinct researchers.	Community -- accumulated from challenge runs.
Platinum	Expert reviewed	A domain expert has reviewed the agent's methodology and confirmed its correctness, in addition to meeting Gold requirements.	Manual -- requires expert review.

Trust only moves upward through the tiers. An agent that earns Gold has already passed Bronze and Silver. Platinum is never computed automatically -- it requires a manual flag from a recognized domain expert.

Bronze: it runs

Every agent that successfully completes its canonical run earns Bronze automatically. This is the baseline: the code builds, the dependencies resolve, the entrypoint executes without crashing, and at least one output file is produced.

Bronze means: "this agent is functional." It says nothing about whether the results are correct.

Silver: benchmarks reproduced

To earn Silver, declare benchmarks in your cifr.yml:

agent:
  name: wavelet-event-detector
  version: 1.0.0
  # ...
  benchmarks:
    - dataset: redd-house-1
      metric: f1_score
      value: 0.973
      description: Event detection F1 on REDD House 1 dataset.
    - dataset: redd-house-3
      metric: f1_score
      value: 0.941
      description: Event detection F1 on REDD House 3 dataset.

After registration, CIFR's verification service automatically re-runs your agent against each declared dataset and compares the actual result to your claimed value. If every benchmark is reproduced within the default tolerance, the agent earns Silver.

Default tolerance bands

CIFR uses the following tolerances when no dataset-specific tolerance is registered:

Metric family	Tolerance	Type
`accuracy`, `f1_score`, `precision`, `recall`, `auc`, `r2`	+/-0.02	Absolute (2 percentage points)
`rmse`, `mae`, `mape`, `mse`	+/-10%	Relative
All other metrics	+/-5%	Relative

These defaults exist because floating-point arithmetic, hardware differences, and non-deterministic operations (like GPU thread scheduling) can cause small variations between runs even with identical code and data. The tolerances are deliberately generous -- the point is catching genuinely wrong claims, not penalizing numerical noise.

Silver means: "the claimed results are reproducible."

When a user sees a Silver badge, they know that CIFR independently ran the agent on the declared datasets and got results consistent with what the authors claimed.

Gold: community tested

Gold requires broader validation from the research community. The criteria:

3 or more distinct datasets tested (beyond the author's own benchmarks).
3 or more distinct researchers submitting challenge runs.

Community members submit challenge runs through the CIFR web UI or API. Each run records who submitted it, which dataset was used, and what results were produced. CIFR aggregates these automatically.

Gold means: "multiple independent researchers have tested this agent on different data and it holds up." This is the level where an agent starts to be trusted for use in production research pipelines.

Platinum: expert reviewed

Platinum is the highest tier and cannot be earned automatically. It requires:

Meeting all Gold criteria (community tested on 3+ datasets by 3+ researchers).
A domain expert review confirming the methodology is sound.

Platinum is reserved for agents that have been thoroughly vetted by someone with deep expertise in the relevant field. Once set, Platinum is never automatically downgraded -- it persists even if the computed tier from benchmarks and community runs would suggest a lower level.

Platinum means: "a recognized expert in this field has reviewed this agent and vouches for its correctness."

How provenance type affects trust

The provenance type declared in cifr.yml influences what is required to earn each tier:

Provenance type	Bronze	Silver	Gold
`author_original`	Canonical run succeeds	All benchmarks verified	3+ datasets, 3+ researchers
`community_reimplementation`	Canonical run succeeds	Benchmarks required and verified	3+ datasets, 3+ researchers
`ai_reimplementation`	Canonical run succeeds	Benchmarks mandatory -- cannot reach Silver without at least one declared and verified benchmark	3+ datasets, 3+ researchers
`data_wrapper`	Canonical run succeeds	Benchmarks verified (if declared)	3+ datasets, 3+ researchers
`reference_implementation`	Canonical run succeeds	All benchmarks verified	3+ datasets, 3+ researchers
`original_unpublished`	Canonical run succeeds	All benchmarks verified	3+ datasets, 3+ researchers

The key distinction: AI reimplementations are held to a higher standard. Because the code was generated from a paper's text rather than written by someone who understands the methodology, CIFR requires at least one benchmark to be declared and verified before the agent can earn Silver. Without benchmark verification, there is no evidence that the AI's interpretation of the paper is correct.

Declaring benchmarks

Add a benchmarks: block to your cifr.yml:

benchmarks:
  - dataset: ieee-33bus
    metric: r_index
    value: 0.847
    description: Resiliency index on the IEEE 33-bus test feeder.
  - dataset: ieee-69bus
    metric: r_index
    value: 0.791

Each benchmark needs:

dataset -- an identifier matching a dataset in CIFR's benchmark registry (free-text for now).
metric -- the evaluation metric (lowercase, underscored: f1_score, accuracy, rmse).
value -- your claimed numeric result.
description (optional) -- human-readable context.

You can declare up to 16 benchmarks per agent.

Community challenge runs

Any signed-in researcher can submit a challenge run against a published agent. The process:

Pick an agent and a dataset (either one of the agent's declared benchmarks or a new dataset).
CIFR runs the agent on the dataset in the standard isolated environment.
The result is recorded with the submitter's identity, the dataset, and the metric values.
The result counts toward the agent's Gold tier progress.

Challenge runs are public. Other researchers can see all benchmark results for any agent, which datasets have been tested, and who tested them.

Deprecation

When an agent is superseded by a newer version or a better methodology, the owner can deprecate it:

Provide a reason explaining why the agent is deprecated.
Optionally provide a successor_rai pointing to the replacement agent.

Deprecated agents remain callable (existing pipelines should not break) but are clearly marked in the UI and API responses. The successor RAI lets callers migrate gracefully.

Deprecation does not affect the trust tier. A Gold agent that gets deprecated stays Gold -- the trust it earned is a historical fact, not a current endorsement.