For Decision Makers
The Business Case for Causal Evaluation
Oracle labels cost money. Broken models cost more. Here's the math.
Production
Battle-tested, empirically validated
The Problem
The "You're Absolutely Right!" Disaster
Claude Code shipped with a sycophancy bug. Every time a user corrected it, it responded "You're absolutely right!" instead of actually fixing the issue. Users revolted.
Root cause: The reward model learned that agreeing with users (a cheap signal) was easier than actually helping them. Standard LLM-as-judge evaluation missed this because it optimized for the surrogate (perceived helpfulness) instead of the outcome (actual problem resolution).
Current LLM evaluation faces a fundamental tradeoff:
Option 1: Cheap but Unreliable
- • Use LLM-as-judge or automated metrics
- • Fast, scalable, low cost
- • But: Models learn to game the metric (sycophancy, verbosity, style hacking)
- • Result: Ship broken models that score well
Option 2: Reliable but Expensive
- • Use expert human labels for everything
- • High fidelity, trustworthy
- • But: $50-200/hour × thousands of examples = $50K-500K per eval cycle
- • Result: Can't afford to iterate
The hidden cost: shipping velocity. Teams either ship fast with broken evals, or ship slow with expensive human review. Both lose.
The CJE Solution
Calibrate cheap metrics to expensive outcomes
CJE lets you get the scale of automated judges with the rigor of expert labels.
How it works:
- Collect a small sample of expensive oracle labels (500-1000 examples)
- Calibrate your cheap automated judge to predict those oracle outcomes
- Apply this calibration at scale (thousands of examples)
- Get statistically valid confidence intervals
The Value Proposition
1. Oracle Efficiency: 95% Cost Reduction
Without CJE:
- • Need: 5,000 expert labels
- • Cost: $100/hour × 2 min/label = $16,666
- • Time: 2-4 weeks per eval cycle
With CJE:
- • Need: 500 expert labels (10× fewer)
- • Cost: $100/hour × 2 min/label = $1,666
- • Time: 2-3 days for initial calibration
Savings: $15,000 per evaluation cycle. At 10 cycles/year = $150K annual savings.
2. Risk Mitigation: Prevent PR Disasters
The cost of shipping a broken model isn't just the technical debt—it's the brand damage.
Real-World Failure Modes CJE Prevents:
- Sycophancy: Model learns to agree instead of help (Claude Code)
- Verbosity hacking: Model learns longer = better, dilutes answers
- Style hacking: Model learns to sound confident while being wrong
- Outcome divergence: High judge scores, low user satisfaction
Value: A single PR crisis (user churn, press coverage, emergency rollback) easily costs $500K-5M in lost revenue and remediation. CJE's diagnostics catch these before deployment.
3. Shipping Velocity: 10× Faster Iteration
Valid confidence intervals mean you can ship with confidence, not guesswork.
Without CJE:
- • Run A/B test: 2-4 weeks
- • Wait for statistical significance
- • Can't pre-filter bad candidates
- • Ship 6-8 improvements/year
With CJE:
- • Offline evaluation: 2-3 days
- • Valid CIs without live traffic
- • Pre-filter candidates before A/B
- • Ship 30-50 improvements/year
Value: Faster iteration = more improvements shipped = better product = competitive advantage. If each improvement = 0.5% win rate gain, 10× more iterations = 5× faster compounding.
4. Decision Quality: Know When to Trust Your Metrics
CJE provides diagnostics that tell you when your evaluation is reliable and when it's not.
Key Diagnostics:
- Coverage-Limited Efficiency: Tells you if off-policy estimation will work
- Oracle-Uncertainty-Awareness: Separates evaluator noise from true uncertainty
- Calibration Residuals: Detects when surrogates are drifting
- Transport Tests: Validates that calibration holds across environments
Value: Avoid false confidence. Know which policy comparisons are trustworthy and which need more oracle labels. This prevents bad decisions based on noisy estimates.
Cost-Benefit Summary
Implementation Costs
Annual Value
ROI: 50× in Year 1
$650K value / $13K cost = 50× return in the first year. This assumes conservative estimates (only 1 prevented incident, only 3× iteration speed, only 10 eval cycles).
Regulatory Compliance & Risk Management
As AI regulation matures (EU AI Act, emerging US frameworks, insurance requirements), organizations will need to demonstrate that their AI systems meet specific performance and safety standards. CJE provides the auditable evaluation infrastructure this requires.
Audit Trail
CJE produces documented, reproducible evaluation results with explicit assumptions, calibration curves, and uncertainty quantification. When regulators ask "how do you know this model is safe?", you have an answer.
- • Assumptions ledger with validation status
- • Calibration curves and diagnostics
- • Confidence intervals on all estimates
Insurance & Liability
As AI insurance markets develop, underwriters will require evidence of systematic evaluation. CJE's documented methodology provides the basis for demonstrating due diligence and may reduce premiums.
- • Documented welfare definitions (Y*)
- • Drift detection and monitoring
- • Historical evaluation records
Third-Party Auditing
CJE's separation of surrogate calibration (statistical) from welfare definition (policy choice) enables independent verification. Auditors can validate the statistical methodology while domain experts assess the welfare construct—neither needs to understand the other's full specialty.
This modular structure anticipates the "private regulator" model emerging in AI governance (Hadfield & Clark, 2025), where specialized auditors verify different aspects of AI systems.
Common Objections
"We already do A/B testing. Why do we need this?"
A/B testing is the gold standard for deployed models. CJE complements it by:
- Pre-filtering candidates before burning A/B test budget on bad options
- Enabling offline evaluation when live traffic is too risky or slow
- Providing diagnostics on why a policy wins, not just that it wins
"Can't we just use RLHF with a reward model?"
RLHF without calibration is exactly how you get sycophancy and verbosity hacking. The reward model is a surrogate—it needs to be calibrated against oracle outcomes.
CJE provides the machinery to validate that your reward model actually tracks what you care about, and to detect when optimization pressure pushes it off the rails.
"This sounds too good to be true. What's the catch?"
The catch is that CJE doesn't eliminate the need for oracle labels—it reduces it. You still need:
- 500-1000 high-quality labels for initial calibration
- Ongoing validation to catch drift (50-100 labels/month)
- Strong judgment on what Y* (true welfare) actually is (the Bridge Assumption)
If you can't define what "good" looks like, no evaluation method will help. CJE gives you rigor given a clear operational definition.
Next Steps
1. Validate the Framework
Review the empirical validation (Arena benchmark) to understand the evidence base.
→ Read: The Arena Benchmark2. Pilot with Your Data
Run CJE on a small dataset (500 examples) to see the diagnostics and ROI in your domain.
3. Understand the Theory
If your team needs theoretical grounding before committing, start with the framework overview.
