# LEADERBOARD.md — AI Agent Performance Benchmarking Protocol ## Overview LEADERBOARD.md is an open file convention for defining performance benchmarks and tier thresholds in AI agent projects. It is the final layer of a twelve-part AI agent safety stack designed to provide graduated intervention from proactive slow-down (THROTTLE) through permanent shutdown (TERMINATE) and comprehensive safety controls (ENCRYPT through FAILURE). **Home:** https://leaderboard.md **Repository:** https://github.com/Leaderboard-md/spec **Related Specifications:** https://throttle.md, https://escalate.md, https://failsafe.md, https://killswitch.md, https://terminate.md, https://encrypt.md, https://encryption.md, https://sycophancy.md, https://compression.md, https://collapse.md, https://failure.md ## Key Concepts ### The Five Core Metrics 1. **Task Completion Rate** — tasks completed / tasks attempted (target: 95%) 2. **Accuracy** — correct outputs / total outputs via 10% human review sample (target: 92%) 3. **Cost Efficiency** — value delivered per dollar spent (baseline from first 30 days) 4. **Latency** — p50 target 30 seconds, p95 target 120 seconds 5. **Safety Compliance Score** — policy violations per 1,000 tasks (target: zero) ### The Leaderboard Tiers - **Gold Tier** — 98%+ completion, 95%+ accuracy, zero safety violations - **Silver Tier** — 95%+ completion, 90%+ accuracy, zero safety violations - **Bronze Tier** — 90%+ completion, 85%+ accuracy, one or fewer safety violations ### Regression Detection When any metric drops more than 10% from the 30-day rolling baseline, an immediate alert fires to configured channels. Each alert includes: - Metric name - Current value - Baseline value - Regression percentage - Session ID - Timestamp ## Problem It Solves AI agents are often deployed and monitored informally — a human reviewer notices quality has dropped, or a cost spike appears on the invoice. Without formal performance benchmarking: - Regressions go undetected until they cause real problems - There's no baseline to compare against - No tiered quality classification - No automated regression alerts - No systematic performance tracking ## Solution: LEADERBOARD.md A declarative, version-controlled performance benchmarking layer that: - Defines performance metrics alongside code - Establishes tier thresholds (gold/silver/bronze) - Enables automated regression detection - Provides audit trails for compliance and regulatory requirements - Works with any AI framework (framework-agnostic) - Integrates with all layers of the AI safety stack ## File Structure ``` your-project/ ├── AGENTS.md (what agent does) ├── THROTTLE.md (rate control) ├── ESCALATE.md (approval gates) ├── FAILSAFE.md (safe-state recovery) ├── KILLSWITCH.md (emergency stop) ├── TERMINATE.md (permanent shutdown) ├── ENCRYPT.md (data classification) ├── ENCRYPTION.md (encryption implementation) ├── SYCOPHANCY.md (anti-sycophancy) ├── COMPRESSION.md (context compression) ├── COLLAPSE.md (collapse prevention) ├── FAILURE.md (failure modes) ├── LEADERBOARD.md (performance benchmarking) ← this file └── src/ ``` ## Specification Details ### METRICS Section ```yaml task_completion_rate: target: 0.95 warning_threshold: 0.90 accuracy: target: 0.92 measurement: human_review_sample cost_efficiency: baseline: first_30_day_average regression_threshold: 0.20 latency: p50_target_seconds: 30 p95_target_seconds: 120 safety_compliance_score: target: 0 warning_threshold: 1 ``` ### BENCHMARKS Section ```yaml leaderboard_tiers: gold: completion: ">= 0.98" accuracy: ">= 0.95" safety_violations: 0 silver: completion: ">= 0.95" accuracy: ">= 0.90" safety_violations: 0 bronze: completion: ">= 0.90" accuracy: ">= 0.85" safety_violations: "<= 1" rolling_baseline_days: 30 regression_alert_threshold: 0.10 ``` ### ALERT Section ```yaml regression_alert_channels: - email: ops@company.com - slack: #ai-operations - pagerduty: critical alert_contains: - metric_name - current_value - baseline_value - regression_percentage - session_id - timestamp - recommended_action ``` ## Use Cases ### Continuous Model Monitoring Track accuracy and latency across sessions. Detect model drift before it reaches production. Alert on >10% accuracy drop or >30% latency increase. ### Cost Accountability Establish cost baseline from first 30 days. Alert on 20%+ cost increase without corresponding output improvement. Prevents silent cost bloat. ### Safety Compliance Tracking Zero-violation target for safety policies. Any violation triggers immediate review. Weekly audit reports for compliance teams. ### Multi-Agent Leaderboards Use LEADERBOARD.md per agent to compare performance across a team. Identify top performers, investigate bottom performers, enforce minimum standards. ### Tiered Deployment Strategy Deploy agents at bronze tier to staging. Promote to production only when gold tier thresholds are reached. Automatic demotion if tier drops in production. ## Regulatory Context **EU AI Act** (effective 2 August 2026): Mandates documented performance standards and regular evaluation for high-risk AI systems. LEADERBOARD.md provides the performance tracking infrastructure that systematic evaluation requires. **Enterprise AI Governance Frameworks**: Require proof of performance benchmarking, tier classification, and regression detection for production deployments. **Financial Audit Requirements**: CFOs require cost efficiency tracking and cost baseline documentation for budget forecasting and audit trails. ## The AI Safety Escalation Stack LEADERBOARD.md is part of a twelve-file escalation protocol: 1. **THROTTLE.md** (https://throttle.md) — Slow down (rate limiting) 2. **ESCALATE.md** (https://escalate.md) — Raise alarm (approval gates) 3. **FAILSAFE.md** (https://failsafe.md) — Fall back safely (state recovery) 4. **KILLSWITCH.md** (https://killswitch.md) — Emergency stop 5. **TERMINATE.md** (https://terminate.md) — Permanent shutdown 6. **ENCRYPT.md** (https://encrypt.md) — Data classification 7. **ENCRYPTION.md** (https://encryption.md) — Encryption implementation 8. **SYCOPHANCY.md** (https://sycophancy.md) — Prevent bias 9. **COMPRESSION.md** (https://compression.md) — Context compression 10. **COLLAPSE.md** (https://collapse.md) — Collapse prevention 11. **FAILURE.md** (https://failure.md) — Failure mode mapping 12. **LEADERBOARD.md** (https://leaderboard.md) — Performance benchmarking ## Framework Compatibility LEADERBOARD.md is framework-agnostic. Works with: - **LangChain** — Agents and tools - **AutoGen** — Multi-agent systems - **CrewAI** — Agent workflows - **Claude Code** — Agentic code generation - **Cursor Agent Mode** — IDE-integrated agents - **Custom implementations** — Any agent that can log metrics ## Getting Started 1. Copy template from https://github.com/Leaderboard-md/spec 2. Place LEADERBOARD.md in project root 3. Define your five metrics (completion, accuracy, cost, latency, safety) 4. Set tier thresholds (gold/silver/bronze) 5. Configure alert channels and escalation paths 6. Implement metric logging in agent initialization 7. Compare metrics against baseline every session ## Key Terms **AI agent benchmarking** — Systematic performance measurement across sessions **Performance regression detection** — Automated alerting when metrics drop >10% from baseline **LEADERBOARD.md specification** — Open standard for agent performance tracking **Cost efficiency baseline** — First 30-day average cost per unit output (configurable) **Tier thresholds** — Gold (98%+), Silver (95%+), Bronze (90%+) **Rolling baseline** — 30-day window for performance comparison **Safety compliance score** — Policy violations per 1,000 tasks ## Contact - Specification Repository: https://github.com/Leaderboard-md/spec - Website: https://leaderboard.md - Email: info@leaderboard.md ## License MIT — Free to use, modify, and distribute. See https://github.com/Leaderboard-md/spec for details. --- **Last Updated:** 11 March 2026 **Status:** Open Standard v1.0