Public Benchmarks

Judgment Benchmarks

An anonymized view of how DelegateZero workspaces improve over time: where overrides cluster, how fast escalation rates fall, and how much context tends to be in place before a full autonomous week is possible.

At A Glance

The signals that matter

1

Decision categories with enough recent sample volume to benchmark override rate reliably.

2

Plan cohorts currently included in the network escalation benchmark.

135

Average context entries loaded when a workspace first completes a full benchmark-qualified autonomous week.

1

Workspaces currently contributing to the autonomous-week context-depth benchmark.

Override Rate By Category

Higher override rates usually indicate a category where the governing policy is underspecified, the edge cases are unusual, or the confidence threshold is still too conservative for the context available.

Category

Override Rate

Recent Decisions

Other

2%

53

Escalation Rate By Plan Cohort

This view uses recent completed decisions and groups them by plan tier. It is a proxy for maturity and decision load, not a substitute for ARR or stage segmentation.

Plan

Escalation Rate

Avg Confidence

Recent Decisions

Team

0%

0.88

71

Scale

20%

0.85

10

Average Confidence By Tenure

Confidence is grouped by how old a workspace was when a decision occurred, which gives a cleaner view of how judgment systems mature than a simple calendar chart.

Tenure Band

Avg Confidence

Decision Count

First 30 days

0.88

81

Days 31-60

-

0

Days 61-90

-

0

Day 91+

-

0

Context Depth At First Autonomous Week

Current benchmark definition: First calendar week with at least 5 completed decisions and zero escalations.

Across 1 workspaces that have already hit that milestone, the average context depth at that moment is 135 entries.

This page reports anonymized aggregate metrics only. It does not expose workspace names, requests, policies, or customer data.

Judgment Health

Benchmark your own workspace

Signed-in workspaces can compare their own override rate, escalation rate, confidence, and context depth against these benchmarks in the Judgment Health dashboard.