Stop shipping based on eval scores that don't predict production
Your eval scores aren't calibrated to your actual KPIs. CJE fixes this—giving you audit-ready estimates with confidence intervals you can trust for deployment decisions.
Real Failure Mode
The judge preferred verbose, polite responses. Users wanted speed. Extra tokens = higher latency = bounces. The judge couldn't see what mattered.
What most teams do (and why it fails)
The standard heuristic:
- 1. Run two models/prompts on your eval set
- 2. Score outputs with LLM-as-judge (0-10 scale)
- 3. Compare average scores (8.2 vs 7.8)
- 4. Ship the higher-scoring one
Simple. Fast. Feels data-driven. Completely unreliable.
Wrong scale
A score of 8/10 might mean 40% conversion, or 8%, or 73%. You have no idea. Judge scores aren't on your KPI scale.
No uncertainty
Is +0.4 points real or noise? Without confidence intervals, you're guessing. No statistical rigor = no launch confidence.
Not causal
Comparing different prompt sets or logged data? Your "8.2" came from easy queries, "7.8" from hard ones. Selection bias ruins everything.
Large-scale studies confirm LLM judges show high variance across tasks, differ by 5+ points from human scores even with high agreement, and perform worse on model-generated text (Bavaresco et al., 2024; Thakur et al., 2025).
What CJE gives you instead
Turn unreliable judge scores into statistically valid deployment decisions
Calibration
Judge scores → KPI units
Label 200-1000 examples with ground-truth outcomes. Learn the mapping: "Score 8 = 23% conversion." Now estimates are interpretable.
AutoCal-R: mean-preserving, monotone
Honest Uncertainty
95% CIs you can trust
Accounts for both sampling noise AND calibration uncertainty. Know when a difference is real enough to ship.
OUA: oracle-uncertainty aware intervals
Off-Policy Correction
Reuse logged data safely
Comparing different models/prompts? Stabilized importance weights fix distribution shift without exploding variance.
SIMCal: score-indexed, mean-one stabilization
Example output:
Baseline: 0.23 [0.21, 0.25] purchase probability (95% CI)
New prompt: 0.26 [0.24, 0.28] purchase probability (95% CI)
Difference: +3pp [+0.5pp, +5.5pp]
Decision: Ship (interval excludes zero, meaningful lift)
Tested on 4,989 real Arena evaluations
Standard off-policy methods collapse. CJE recovers 158× more signal.
ESS recovery by policy comparison
Policy | Before CJE | After CJE | Improvement |
---|---|---|---|
Prompt Variant | 0.6% | 94.6% | 158× |
Premium Model | 0.7% | 80.8% | 115× |
Clone (A/A test) | 26.2% | 98.8% | 3.8× |
Three modes for different situations
Pick the right method for your data and constraints
Direct Method (DM)
Generate fresh outputs from each policy on the same prompts. Simplest, most reliable.
When: You can generate for all candidates
Output: V̂(π) ± OUA confidence intervals
IPS (Off-Policy)
Reweight logged data using likelihood ratios, stabilized with SIMCal. Reuse historical data.
When: You have logs with teacher-forcing
Output: V̂IPS ± OUA + ESS/tails diagnostics
DR (Doubly Robust)
Combine IPS with outcome models. Tighter CIs when overlap is imperfect.
When: Weak overlap but you can train a critic
Output: V̂DR ± OUA + orthogonality check
When to use CJE vs A/B testing
Use CJE When
- • A/B tests take weeks and you need answers now
- • You're evaluating 10+ model/prompt variants
- • Limited production traffic (can't split safely)
- • Can't A/B test (compliance, safety, low-traffic segments)
- • You have historical logs you want to reuse
Still A/B Test When
- • Only 1-2 candidates and plenty of traffic
- • KPI is easy to measure online (clicks, conversions)
- • High-stakes launch needing absolute certainty
- • Judge can't observe critical features (latency, UI)
CJE isn't a replacement for A/B tests—it's a complement that lets you iterate faster offline before committing to expensive online experiments.
Get started in 5 minutes
Installation
What you get
- • Point estimates in KPI units with 95% CIs
- • Calibration reliability plots
- • Coverage and overlap diagnostics
- • ESS and tail heaviness checks (for IPS/DR)
- • Orthogonality tests (for DR)
- • Refuse-to-estimate flags when unreliable