// AI Agent Performance Benchmarking Protocol
A plain-text file convention for **AI agent performance benchmarking**. Define task completion rate, accuracy, cost efficiency, latency, and safety scores — so you always know how your agents are performing relative to baseline.
LEADERBOARD.md is a plain-text Markdown file you place in the root of any AI agent repository. It defines the performance metrics your agent must achieve, the tier thresholds that classify performance quality, and the regression alert rules that notify you when quality drops.
AI agents are often deployed and monitored informally — a human reviewer notices quality has dropped, or a cost spike appears on the invoice. Without formal performance benchmarking, regressions go undetected until they cause real problems. There's no baseline to compare against, no tiered quality classification, no automated regression alerts.
Drop LEADERBOARD.md in your repo root and define: the five core metrics (task completion rate, accuracy, cost efficiency, latency, safety compliance), the tier thresholds (gold/silver/bronze), the rolling baseline period (default 30 days), and the regression alert threshold (default 10% drop). The agent logs metrics every session. When regression is detected, the configured channels are alerted immediately.
The EU AI Act (effective 2 August 2026) requires high-risk AI systems to maintain documented performance standards and undergo regular evaluation. LEADERBOARD.md provides the performance tracking infrastructure that systematic evaluation requires.
Copy the template from GitHub and place it in your project root:
Before LEADERBOARD.md, agent performance was tracked informally — post-hoc cost reviews, ad-hoc accuracy spot-checks, and reactive debugging after user complaints. LEADERBOARD.md makes performance benchmarking proactive, version-controlled, and systematically auditable.
The AI agent logs metrics against it every session. Your engineering lead reads it during sprint reviews. Your compliance team reads it during audits. Your finance team reads it during cost reviews. One file serves all four audiences.
LEADERBOARD.md is one file in a complete twelve-part open specification for AI agent safety. Each file addresses a different level of intervention.
A plain-text Markdown file defining the performance benchmarks AI agents must meet. It specifies five core metrics (task completion rate, accuracy, cost efficiency, latency, safety compliance), tier thresholds (gold/silver/bronze), rolling baseline comparison periods, and regression alert rules. Every session is measured and compared to the 30-day rolling average.
Task completion rate (tasks completed / tasks attempted, target 95%), accuracy (correct outputs / total outputs via 10% human review sample, target 92%), cost efficiency (value delivered per dollar, baseline from first 30 days), latency (p50 target 30s, p95 target 120s), and safety compliance score (policy violations per 1,000 tasks, target zero).
Gold: 98%+ completion, 95%+ accuracy, zero safety violations. Silver: 95%+ completion, 90%+ accuracy, zero safety violations. Bronze: 90%+ completion, 85%+ accuracy, one or fewer safety violations. Tier assignment happens automatically based on the rolling 30-day average.
The system maintains a 30-day rolling baseline for each metric. If any metric drops more than 10% from its baseline in the current session or rolling window, an alert fires immediately to the configured channels. The alert includes the metric name, current value, baseline value, regression percentage, and session ID.
From the first 30 days of the agent's operation (configurable). After that, each session's cost efficiency is compared to this baseline. A 20% cost increase without corresponding output improvement triggers a regression alert. This prevents silent cost bloat from going unnoticed.
Yes — it is framework-agnostic. The agent implementation logs metrics in the format defined by the spec; the benchmarking infrastructure reads those logs. Works with LangChain, AutoGen, CrewAI, Claude Code, custom agents, or any AI system that produces loggable output.
This domain is available for acquisition. It is the canonical home of the LEADERBOARD.md specification — the performance benchmarking layer of the AI agent safety stack, essential for any production AI deployment.
Inquire About AcquisitionOr email directly: info@leaderboard.md