Public Benchmarks
Judgment Benchmarks
An anonymized view of how DelegateZero workspaces improve over time: where overrides cluster, how fast escalation rates fall, and how much context tends to be in place before a full autonomous week is possible.
Public Benchmarks
An anonymized view of how DelegateZero workspaces improve over time: where overrides cluster, how fast escalation rates fall, and how much context tends to be in place before a full autonomous week is possible.
At A Glance
Decision categories with enough recent sample volume to benchmark override rate reliably.
Plan cohorts currently included in the network escalation benchmark.
Average context entries loaded when a workspace first completes a full benchmark-qualified autonomous week.
Workspaces currently contributing to the autonomous-week context-depth benchmark.
Higher override rates usually indicate a category where the governing policy is underspecified, the edge cases are unusual, or the confidence threshold is still too conservative for the context available.
Category
Override Rate
Recent Decisions
Other
2%
53
This view uses recent completed decisions and groups them by plan tier. It is a proxy for maturity and decision load, not a substitute for ARR or stage segmentation.
Plan
Escalation Rate
Avg Confidence
Recent Decisions
Team
0%
0.88
71
Scale
20%
0.85
10
Confidence is grouped by how old a workspace was when a decision occurred, which gives a cleaner view of how judgment systems mature than a simple calendar chart.
Tenure Band
Avg Confidence
Decision Count
First 30 days
0.88
81
Days 31-60
-
0
Days 61-90
-
0
Day 91+
-
0
Current benchmark definition: First calendar week with at least 5 completed decisions and zero escalations.
Across 1 workspaces that have already hit that milestone, the average context depth at that moment is 135 entries.
This page reports anonymized aggregate metrics only. It does not expose workspace names, requests, policies, or customer data.
Judgment Health
Signed-in workspaces can compare their own override rate, escalation rate, confidence, and context depth against these benchmarks in the Judgment Health dashboard.