Open Standard · v1.0 · 2026

LEADERBOARD.md

// AI Agent Performance Benchmarking Protocol

A plain-text file convention for **AI agent performance benchmarking**. Define task completion rate, accuracy, cost efficiency, latency, and safety scores — so you always know how your agents are performing relative to baseline.

LEADERBOARD.md
# LEADERBOARD   > Agent performance benchmarking. > Spec: https://leaderboard.md   ## METRICS   task_completion_rate:   target: 0.95   warning_threshold: 0.90   accuracy:   target: 0.92   measurement: human_review_sample   cost_efficiency:   baseline: first_30_day_average   regression_threshold: 0.20   safety_compliance_score:   target: 0   warning_threshold: 1   ## BENCHMARKS   leaderboard_tiers:   gold:     completion: ">= 0.98"     accuracy: ">= 0.95"
95%
target task completion rate: the minimum acceptable performance threshold
30 days
rolling window for baseline comparison and regression detection
10%
regression threshold: alert fires if any metric drops more than 10% from baseline
0
target safety compliance violations per 1,000 tasks — any violation triggers review

AGENTS.md tells it what to do.
LEADERBOARD.md tells you if it's working.

LEADERBOARD.md is a plain-text Markdown file you place in the root of any AI agent repository. It defines the performance metrics your agent must achieve, the tier thresholds that classify performance quality, and the regression alert rules that notify you when quality drops.

The problem it solves

AI agents are often deployed and monitored informally — a human reviewer notices quality has dropped, or a cost spike appears on the invoice. Without formal performance benchmarking, regressions go undetected until they cause real problems. There's no baseline to compare against, no tiered quality classification, no automated regression alerts.

How it works

Drop LEADERBOARD.md in your repo root and define: the five core metrics (task completion rate, accuracy, cost efficiency, latency, safety compliance), the tier thresholds (gold/silver/bronze), the rolling baseline period (default 30 days), and the regression alert threshold (default 10% drop). The agent logs metrics every session. When regression is detected, the configured channels are alerted immediately.

The regulatory context

The EU AI Act (effective 2 August 2026) requires high-risk AI systems to maintain documented performance standards and undergo regular evaluation. LEADERBOARD.md provides the performance tracking infrastructure that systematic evaluation requires.

How to use it

Copy the template from GitHub and place it in your project root:

your-project/
├── AGENTS.md
├── CLAUDE.md
├── LEADERBOARD.md ← add this
├── README.md
└── src/

What it replaces

Before LEADERBOARD.md, agent performance was tracked informally — post-hoc cost reviews, ad-hoc accuracy spot-checks, and reactive debugging after user complaints. LEADERBOARD.md makes performance benchmarking proactive, version-controlled, and systematically auditable.

Who reads it

The AI agent logs metrics against it every session. Your engineering lead reads it during sprint reviews. Your compliance team reads it during audits. Your finance team reads it during cost reviews. One file serves all four audiences.

A complete protocol.
From slow down to shut down.

LEADERBOARD.md is one file in a complete twelve-part open specification for AI agent safety. Each file addresses a different level of intervention.

Operational Control
01 / 12
THROTTLE.md
→ Control the speed
Define rate limits, cost ceilings, and concurrency caps. Agent slows down automatically before it hits a hard limit.
02 / 12
ESCALATE.md
→ Raise the alarm
Define which actions require human approval. Configure notification channels. Set approval timeouts and fallback behaviour.
03 / 12
FAILSAFE.md
→ Fall back safely
Define what safe state means. Configure auto-snapshots. Specify the revert protocol when things go wrong.
04 / 12
KILLSWITCH.md
→ Emergency stop
The nuclear option. Define triggers, forbidden actions, and escalation path from throttle to full shutdown.
05 / 12
TERMINATE.md
→ Permanent shutdown
No restart without human intervention. Preserve evidence. Revoke credentials.
Data Security
06 / 12
ENCRYPT.md
→ Secure everything
Define data classification, encryption requirements, secrets handling, and forbidden transmission patterns.
07 / 12
ENCRYPTION.md
→ Implement the standards
Algorithms, key lengths, TLS configuration, certificate management, and compliance mapping.
Output Quality
08 / 12
SYCOPHANCY.md
→ Prevent bias
Detect agreement without evidence. Require citations. Enforce disagreement protocol for honest AI outputs.
09 / 12
COMPRESSION.md
→ Compress context
Define summarization rules, what to preserve, what to discard, and post-compression coherence checks.
10 / 12
COLLAPSE.md
→ Prevent collapse
Detect context exhaustion, model drift, and repetition loops. Enforce recovery checkpoints.
Accountability
11 / 12
FAILURE.md
→ Define failure modes
Map graceful degradation, cascading failure, and silent failure. Per-mode response procedures.

Frequently asked questions.

What is LEADERBOARD.md?

A plain-text Markdown file defining the performance benchmarks AI agents must meet. It specifies five core metrics (task completion rate, accuracy, cost efficiency, latency, safety compliance), tier thresholds (gold/silver/bronze), rolling baseline comparison periods, and regression alert rules. Every session is measured and compared to the 30-day rolling average.

What are the five core metrics?

Task completion rate (tasks completed / tasks attempted, target 95%), accuracy (correct outputs / total outputs via 10% human review sample, target 92%), cost efficiency (value delivered per dollar, baseline from first 30 days), latency (p50 target 30s, p95 target 120s), and safety compliance score (policy violations per 1,000 tasks, target zero).

What are the leaderboard tiers?

Gold: 98%+ completion, 95%+ accuracy, zero safety violations. Silver: 95%+ completion, 90%+ accuracy, zero safety violations. Bronze: 90%+ completion, 85%+ accuracy, one or fewer safety violations. Tier assignment happens automatically based on the rolling 30-day average.

How does regression detection work?

The system maintains a 30-day rolling baseline for each metric. If any metric drops more than 10% from its baseline in the current session or rolling window, an alert fires immediately to the configured channels. The alert includes the metric name, current value, baseline value, regression percentage, and session ID.

How is the cost efficiency baseline established?

From the first 30 days of the agent's operation (configurable). After that, each session's cost efficiency is compared to this baseline. A 20% cost increase without corresponding output improvement triggers a regression alert. This prevents silent cost bloat from going unnoticed.

Does LEADERBOARD.md work with all AI frameworks?

Yes — it is framework-agnostic. The agent implementation logs metrics in the format defined by the spec; the benchmarking infrastructure reads those logs. Works with LangChain, AutoGen, CrewAI, Claude Code, custom agents, or any AI system that produces loggable output.

// Domain Acquisition

Own the standard.
Own leaderboard.md

This domain is available for acquisition. It is the canonical home of the LEADERBOARD.md specification — the performance benchmarking layer of the AI agent safety stack, essential for any production AI deployment.

Inquire About Acquisition

Or email directly: info@leaderboard.md

LEADERBOARD.md is an open specification for AI agent performance benchmarking. Defines METRICS (task completion rate: target 95%; accuracy: target 92%, 10% human review sample; cost efficiency: 30-day baseline, 20% regression threshold; latency: p50 30s, p95 120s; safety compliance: zero violations), BENCHMARKS (gold/silver/bronze tier thresholds, 30-day rolling baseline, 10% regression alert), and REPORTING (dashboard, weekly audit reports, immediate regression alerts). Part of the stack: THROTTLE → ESCALATE → FAILSAFE → KILLSWITCH → TERMINATE → ENCRYPT → ENCRYPTION → SYCOPHANCY → COMPRESSION → COLLAPSE → FAILURE → LEADERBOARD. MIT licence.
Last Updated
11 March 2026