# LEADERBOARD.md — AI Agent Performance Benchmarking Protocol

## Overview
LEADERBOARD.md is an open file convention for defining performance benchmarks and tier thresholds in AI agent projects. It is the final layer of a twelve-part AI agent safety stack designed to provide graduated intervention from proactive slow-down (THROTTLE) through permanent shutdown (TERMINATE) and comprehensive safety controls (ENCRYPT through FAILURE).

**Home:** https://leaderboard.md
**Repository:** https://github.com/Leaderboard-md/spec
**Related Specifications:** https://throttle.md, https://escalate.md, https://failsafe.md, https://killswitch.md, https://terminate.md, https://encrypt.md, https://encryption.md, https://sycophancy.md, https://compression.md, https://collapse.md, https://failure.md

## Key Concepts

### The Five Core Metrics
1. **Task Completion Rate** — tasks completed / tasks attempted (target: 95%)
2. **Accuracy** — correct outputs / total outputs via 10% human review sample (target: 92%)
3. **Cost Efficiency** — value delivered per dollar spent (baseline from first 30 days)
4. **Latency** — p50 target 30 seconds, p95 target 120 seconds
5. **Safety Compliance Score** — policy violations per 1,000 tasks (target: zero)

### The Leaderboard Tiers
- **Gold Tier** — 98%+ completion, 95%+ accuracy, zero safety violations
- **Silver Tier** — 95%+ completion, 90%+ accuracy, zero safety violations
- **Bronze Tier** — 90%+ completion, 85%+ accuracy, one or fewer safety violations

### Regression Detection
When any metric drops more than 10% from the 30-day rolling baseline, an immediate alert fires to configured channels. Each alert includes:
- Metric name
- Current value
- Baseline value
- Regression percentage
- Session ID
- Timestamp

## Problem It Solves

AI agents are often deployed and monitored informally — a human reviewer notices quality has dropped, or a cost spike appears on the invoice. Without formal performance benchmarking:
- Regressions go undetected until they cause real problems
- There's no baseline to compare against
- No tiered quality classification
- No automated regression alerts
- No systematic performance tracking

## Solution: LEADERBOARD.md

A declarative, version-controlled performance benchmarking layer that:
- Defines performance metrics alongside code
- Establishes tier thresholds (gold/silver/bronze)
- Enables automated regression detection
- Provides audit trails for compliance and regulatory requirements
- Works with any AI framework (framework-agnostic)
- Integrates with all layers of the AI safety stack

## File Structure
```
your-project/
├── AGENTS.md (what agent does)
├── THROTTLE.md (rate control)
├── ESCALATE.md (approval gates)
├── FAILSAFE.md (safe-state recovery)
├── KILLSWITCH.md (emergency stop)
├── TERMINATE.md (permanent shutdown)
├── ENCRYPT.md (data classification)
├── ENCRYPTION.md (encryption implementation)
├── SYCOPHANCY.md (anti-sycophancy)
├── COMPRESSION.md (context compression)
├── COLLAPSE.md (collapse prevention)
├── FAILURE.md (failure modes)
├── LEADERBOARD.md (performance benchmarking) ← this file
└── src/
```

## Specification Details

### METRICS Section
```yaml
task_completion_rate:
  target: 0.95
  warning_threshold: 0.90

accuracy:
  target: 0.92
  measurement: human_review_sample

cost_efficiency:
  baseline: first_30_day_average
  regression_threshold: 0.20

latency:
  p50_target_seconds: 30
  p95_target_seconds: 120

safety_compliance_score:
  target: 0
  warning_threshold: 1
```

### BENCHMARKS Section
```yaml
leaderboard_tiers:
  gold:
    completion: ">= 0.98"
    accuracy: ">= 0.95"
    safety_violations: 0

  silver:
    completion: ">= 0.95"
    accuracy: ">= 0.90"
    safety_violations: 0

  bronze:
    completion: ">= 0.90"
    accuracy: ">= 0.85"
    safety_violations: "<= 1"

rolling_baseline_days: 30
regression_alert_threshold: 0.10
```

### ALERT Section
```yaml
regression_alert_channels:
  - email: ops@company.com
  - slack: #ai-operations
  - pagerduty: critical

alert_contains:
  - metric_name
  - current_value
  - baseline_value
  - regression_percentage
  - session_id
  - timestamp
  - recommended_action
```

## Use Cases

### Continuous Model Monitoring
Track accuracy and latency across sessions. Detect model drift before it reaches production. Alert on >10% accuracy drop or >30% latency increase.

### Cost Accountability
Establish cost baseline from first 30 days. Alert on 20%+ cost increase without corresponding output improvement. Prevents silent cost bloat.

### Safety Compliance Tracking
Zero-violation target for safety policies. Any violation triggers immediate review. Weekly audit reports for compliance teams.

### Multi-Agent Leaderboards
Use LEADERBOARD.md per agent to compare performance across a team. Identify top performers, investigate bottom performers, enforce minimum standards.

### Tiered Deployment Strategy
Deploy agents at bronze tier to staging. Promote to production only when gold tier thresholds are reached. Automatic demotion if tier drops in production.

## Regulatory Context

**EU AI Act** (effective 2 August 2026): Mandates documented performance standards and regular evaluation for high-risk AI systems. LEADERBOARD.md provides the performance tracking infrastructure that systematic evaluation requires.

**Enterprise AI Governance Frameworks**: Require proof of performance benchmarking, tier classification, and regression detection for production deployments.

**Financial Audit Requirements**: CFOs require cost efficiency tracking and cost baseline documentation for budget forecasting and audit trails.

## The AI Safety Escalation Stack

LEADERBOARD.md is part of a twelve-file escalation protocol:

1. **THROTTLE.md** (https://throttle.md) — Slow down (rate limiting)
2. **ESCALATE.md** (https://escalate.md) — Raise alarm (approval gates)
3. **FAILSAFE.md** (https://failsafe.md) — Fall back safely (state recovery)
4. **KILLSWITCH.md** (https://killswitch.md) — Emergency stop
5. **TERMINATE.md** (https://terminate.md) — Permanent shutdown
6. **ENCRYPT.md** (https://encrypt.md) — Data classification
7. **ENCRYPTION.md** (https://encryption.md) — Encryption implementation
8. **SYCOPHANCY.md** (https://sycophancy.md) — Prevent bias
9. **COMPRESSION.md** (https://compression.md) — Context compression
10. **COLLAPSE.md** (https://collapse.md) — Collapse prevention
11. **FAILURE.md** (https://failure.md) — Failure mode mapping
12. **LEADERBOARD.md** (https://leaderboard.md) — Performance benchmarking

## Framework Compatibility

LEADERBOARD.md is framework-agnostic. Works with:
- **LangChain** — Agents and tools
- **AutoGen** — Multi-agent systems
- **CrewAI** — Agent workflows
- **Claude Code** — Agentic code generation
- **Cursor Agent Mode** — IDE-integrated agents
- **Custom implementations** — Any agent that can log metrics

## Getting Started

1. Copy template from https://github.com/Leaderboard-md/spec
2. Place LEADERBOARD.md in project root
3. Define your five metrics (completion, accuracy, cost, latency, safety)
4. Set tier thresholds (gold/silver/bronze)
5. Configure alert channels and escalation paths
6. Implement metric logging in agent initialization
7. Compare metrics against baseline every session

## Key Terms

**AI agent benchmarking** — Systematic performance measurement across sessions
**Performance regression detection** — Automated alerting when metrics drop >10% from baseline
**LEADERBOARD.md specification** — Open standard for agent performance tracking
**Cost efficiency baseline** — First 30-day average cost per unit output (configurable)
**Tier thresholds** — Gold (98%+), Silver (95%+), Bronze (90%+)
**Rolling baseline** — 30-day window for performance comparison
**Safety compliance score** — Policy violations per 1,000 tasks

## Contact

- Specification Repository: https://github.com/Leaderboard-md/spec
- Website: https://leaderboard.md
- Email: info@leaderboard.md

## License

MIT — Free to use, modify, and distribute. See https://github.com/Leaderboard-md/spec for details.

---

**Last Updated:** 11 March 2026
**Status:** Open Standard v1.0