Concepts
Trust & Verification
The four trust tiers -- Bronze, Silver, Gold, Platinum -- and how agents earn them through automated and community verification.
Every CIFR agent carries a trust tier that tells other researchers how thoroughly its claims have been verified. Trust is earned incrementally, starting from the moment an agent is registered and growing as benchmarks are reproduced and community members test it on new data.
The four tiers
| Tier | Badge | How it is earned | Who earns it |
|---|---|---|---|
| Bronze | Builds and runs | The agent's canonical run completed without error. | Automated -- granted at registration. |
| Silver | Benchmarks verified | All benchmark claims declared in cifr.yml have been reproduced by CIFR's automated verification within tolerance. |
Automated -- triggered after registration. |
| Gold | Community tested | The agent has been independently tested on 3 or more datasets by 3 or more distinct researchers. | Community -- accumulated from challenge runs. |
| Platinum | Expert reviewed | A domain expert has reviewed the agent's methodology and confirmed its correctness, in addition to meeting Gold requirements. | Manual -- requires expert review. |
Trust only moves upward through the tiers. An agent that earns Gold has already passed Bronze and Silver. Platinum is never computed automatically -- it requires a manual flag from a recognized domain expert.
Bronze: it runs
Every agent that successfully completes its canonical run earns Bronze automatically. This is the baseline: the code builds, the dependencies resolve, the entrypoint executes without crashing, and at least one output file is produced.
Bronze means: "this agent is functional." It says nothing about whether the results are correct.
Silver: benchmarks reproduced
To earn Silver, declare benchmarks in your cifr.yml:
agent:
name: wavelet-event-detector
version: 1.0.0
# ...
benchmarks:
- dataset: redd-house-1
metric: f1_score
value: 0.973
description: Event detection F1 on REDD House 1 dataset.
- dataset: redd-house-3
metric: f1_score
value: 0.941
description: Event detection F1 on REDD House 3 dataset.
After registration, CIFR's verification service automatically re-runs your agent against each declared dataset and compares the actual result to your claimed value. If every benchmark is reproduced within the default tolerance, the agent earns Silver.
Default tolerance bands
CIFR uses the following tolerances when no dataset-specific tolerance is registered:
| Metric family | Tolerance | Type |
|---|---|---|
accuracy, f1_score, precision, recall, auc, r2 |
+/-0.02 | Absolute (2 percentage points) |
rmse, mae, mape, mse |
+/-10% | Relative |
| All other metrics | +/-5% | Relative |
These defaults exist because floating-point arithmetic, hardware differences, and non-deterministic operations (like GPU thread scheduling) can cause small variations between runs even with identical code and data. The tolerances are deliberately generous -- the point is catching genuinely wrong claims, not penalizing numerical noise.
Silver means: "the claimed results are reproducible."
When a user sees a Silver badge, they know that CIFR independently ran the agent on the declared datasets and got results consistent with what the authors claimed.
Gold: community tested
Gold requires broader validation from the research community. The criteria:
- 3 or more distinct datasets tested (beyond the author's own benchmarks).
- 3 or more distinct researchers submitting challenge runs.
Community members submit challenge runs through the CIFR web UI or API. Each run records who submitted it, which dataset was used, and what results were produced. CIFR aggregates these automatically.
Gold means: "multiple independent researchers have tested this agent on different data and it holds up." This is the level where an agent starts to be trusted for use in production research pipelines.
Platinum: expert reviewed
Platinum is the highest tier and cannot be earned automatically. It requires:
- Meeting all Gold criteria (community tested on 3+ datasets by 3+ researchers).
- A domain expert review confirming the methodology is sound.
Platinum is reserved for agents that have been thoroughly vetted by someone with deep expertise in the relevant field. Once set, Platinum is never automatically downgraded -- it persists even if the computed tier from benchmarks and community runs would suggest a lower level.
Platinum means: "a recognized expert in this field has reviewed this agent and vouches for its correctness."
How provenance type affects trust
The provenance type declared in cifr.yml influences what is required to earn each tier:
| Provenance type | Bronze | Silver | Gold |
|---|---|---|---|
author_original |
Canonical run succeeds | All benchmarks verified | 3+ datasets, 3+ researchers |
community_reimplementation |
Canonical run succeeds | Benchmarks required and verified | 3+ datasets, 3+ researchers |
ai_reimplementation |
Canonical run succeeds | Benchmarks mandatory -- cannot reach Silver without at least one declared and verified benchmark | 3+ datasets, 3+ researchers |
data_wrapper |
Canonical run succeeds | Benchmarks verified (if declared) | 3+ datasets, 3+ researchers |
reference_implementation |
Canonical run succeeds | All benchmarks verified | 3+ datasets, 3+ researchers |
original_unpublished |
Canonical run succeeds | All benchmarks verified | 3+ datasets, 3+ researchers |
The key distinction: AI reimplementations are held to a higher standard. Because the code was generated from a paper's text rather than written by someone who understands the methodology, CIFR requires at least one benchmark to be declared and verified before the agent can earn Silver. Without benchmark verification, there is no evidence that the AI's interpretation of the paper is correct.
Declaring benchmarks
Add a benchmarks: block to your cifr.yml:
benchmarks:
- dataset: ieee-33bus
metric: r_index
value: 0.847
description: Resiliency index on the IEEE 33-bus test feeder.
- dataset: ieee-69bus
metric: r_index
value: 0.791
Each benchmark needs:
- dataset -- an identifier matching a dataset in CIFR's benchmark registry (free-text for now).
- metric -- the evaluation metric (lowercase, underscored:
f1_score,accuracy,rmse). - value -- your claimed numeric result.
- description (optional) -- human-readable context.
You can declare up to 16 benchmarks per agent.
Community challenge runs
Any signed-in researcher can submit a challenge run against a published agent. The process:
- Pick an agent and a dataset (either one of the agent's declared benchmarks or a new dataset).
- CIFR runs the agent on the dataset in the standard isolated environment.
- The result is recorded with the submitter's identity, the dataset, and the metric values.
- The result counts toward the agent's Gold tier progress.
Challenge runs are public. Other researchers can see all benchmark results for any agent, which datasets have been tested, and who tested them.
Deprecation
When an agent is superseded by a newer version or a better methodology, the owner can deprecate it:
- Provide a reason explaining why the agent is deprecated.
- Optionally provide a
successor_raipointing to the replacement agent.
Deprecated agents remain callable (existing pipelines should not break) but are clearly marked in the UI and API responses. The successor RAI lets callers migrate gracefully.
Deprecation does not affect the trust tier. A Gold agent that gets deprecated stays Gold -- the trust it earned is a historical fact, not a current endorsement.