
Causal Judge Evaluation
Evaluate LLM policies using judge scores with the same rigor as A/B tests—without running expensive online experiments.
The goal in practical evals
You need to rank policy candidates and—when possible—estimate KPI levels you can ship with a confidence interval. Is the new prompt actually better? By how much? Can you trust the difference enough to deploy?
What We're Actually Trying to Estimate
V(π) = 𝔼[Y(π)]
V(π) — The value of deploying policy π to production
𝔼[Y(π)] — The expected business outcome (conversion rate, revenue, user satisfaction, etc.) if we actually shipped π to real users
This is not "What does the judge score as 8.2/10?" It's "What conversion rate do we get if we deploy this?" The first is an arbitrary number. The second is what stakeholders care about and what determines business impact.
The heuristic most teams run
Average raw judge scores per policy and compare deltas. Your Slack channel looks like:
@sarah: New prompt scored 8.2, baseline is 7.8
@mike: +0.4 improvement, nice! Ship it?
@sarah: Looks good to me 🚢
Simple, fast, feels data-driven. But it's heuristic, not science.
Concrete Failure Mode
Your e-commerce chatbot gets an LLM-as-judge score of 8.2/10 vs the baseline's 7.8/10. The team ships it. Two weeks later: purchase conversion dropped 3%. What happened?
The judge preferred longer, more polite responses. But customers wanted speed—extra tokens meant higher latency, and users bounced. The judge couldn't see latency, so it optimized the wrong thing.
Why the heuristic fails
- Wrong scale. Judge scores aren't on your KPI scale. A score of 8/10 might mean 40% conversion, or 8%, or 73%—you have no idea.
- Hidden drift and slice bias. Behavior varies across time, prompt types, and user segments. Averages paper over the fact that your "8.2" came from easy queries and "7.8" from hard ones.
- No uncertainty. You can't tell if a 0.4-point difference is noise or a real uplift. No confidence interval means no statistical rigor.
- Not causal. Results reflect the data you happened to collect, not what would happen if you actually deployed the new policy to production traffic.
Recent research confirms these issues: LLM judges exhibit high variance across tasks and perform worse when evaluating model-generated text (Bavaresco et al., 2024). Even judges with high agreement can differ by 5+ points from human scores, and show systematic leniency bias (Thakur et al., 2025).
CJE in one paragraph
CJE wraps LLM-as-judge in a causal inference framework. On a small labeled slice (200-1000 examples), learn a mean-preserving mapping from judge scores to KPI outcomes (AutoCal-R). Apply this calibration to your evaluation data to get estimates in KPI units. Add oracle-uncertainty aware (OUA) confidence intervals that account for both sampling noise and calibration uncertainty. For off-policy logs, compute importance weights via teacher forcing, then stabilize them with score-indexed, mean-one normalization (SIMCal). Optionally hedge with an outcome model for doubly robust (DR) estimation.
The result: policy value estimates with 95% CIs, plus diagnostic checks that tell you when and how to improve reliability.
Three modes at a glance
DM (Direct)
Generate fresh outputs from each policy on the same prompts → calibrate → average
When: You can generate for all candidates
Output: V̂(π) ± OUA
IPS (Off-Policy)
Reweight logged data using likelihood ratios → stabilize with SIMCal
When: You have logs with teacher-forcing
Output: V̂IPS ± OUA + ESS/tails
DR (Doubly Robust)
IPS + outcome model → tighter CIs when overlap is imperfect
When: Overlap is weak but you can train a critic
Output: V̂DR ± OUA + orthogonality check
Operator contract: what you always get
Standard Output
- • Point estimate in KPI units (e.g., 0.23 purchase probability) with 95% CI
- • OUA share — fraction of variance from calibration vs sampling (tells you where to invest next)
- • AutoCal-R reliability plot — predicted vs observed outcomes across score bins
- • S-coverage — fraction of eval scores within labeled range (detects extrapolation risk)
Additional for IPS/DR
- • ESS fraction — effective sample size after SIMCal (healthy: >30%)
- • Max-weight share — concentration of weight mass (alert if single sample >10%)
- • Hill tail index — weight distribution heaviness (healthy: >2.0)
- • Overlap heatmap — weights vs judge scores to diagnose support issues
Additional for DR Only
- • Orthogonality test — weighted mean of (R - q̂) with CI (should cover zero)
- • Critic R² on held-out data (healthy: >0.5)
Diagnostic thresholds (reference card)
Diagnostic | Healthy | Alert |
---|---|---|
S-coverage | >95% within range | <90% or flat boundary slopes |
AutoCal-R reliability | R² > 0.5, tight bands | R² < 0.3, wide CIs in bins |
ESS fraction (IPS/DR) | >30% after SIMCal | <20% (poor overlap) |
Hill tail index (IPS/DR) | >2.0 (light tails) | <1.5 (heavy tails) |
Orthogonality (DR) | 95% CI covers zero | CI excludes zero |
→ See Diagnostics & Fixes for detailed fix strategies
Philosophy: estimate + diagnose → fix → re-run
CJE provides estimates with diagnostics, not hard failures. Run your estimation, check the diagnostic dashboard, apply targeted fixes if alerts fire, then re-run. Each diagnostic links to specific remedies:
- • Poor S-coverage? Add labels in missing score bins or narrow eval distribution
- • Low ESS? Restrict to better-overlap cohort or switch from IPS to DR
- • Heavy tails? SIMCal should fix; if not, check likelihood ratio bugs or model drift
- • Orthogonality fails? Improve critic with more features or better architecture
This iterative workflow mirrors A/B testing: you get a result and health metrics, then improve as needed.
When to use CJE (and when not to)
Use CJE When
- • A/B tests are too slow (weeks of user exposure)
- • A/B tests are too expensive (limited traffic)
- • You need to evaluate many candidates quickly (10+ variants)
- • You can't A/B test (safety issues, compliance, low-traffic segments)
- • You have historical logs and want to reuse them
Consider A/B Testing Instead
- • You have only 1-2 candidates and plenty of traffic
- • Your KPI is easy to measure online (clicks, conversions)
- • You need absolute certainty for a high-stakes launch
- • Your judge can't observe critical features (latency, UI)
- • Labeling cost exceeds A/B test cost
CJE is for rapid iteration and cases where online testing is infeasible. It's not a replacement for A/B tests—it's a complement that lets you move faster and test more hypotheses offline before committing to expensive online experiments.
Selected References
- Bavaresco et al. (2024). "LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks." arXiv:2406.18403.
Finds that LLMs exhibit large variance in correlation to human judgments across datasets and perform worse when evaluating model-generated text. - Thakur et al. (2025). "Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges." Proceedings of GEM 2025.
Shows that even high-agreement judge models can differ by 5+ points from human scores and exhibit systematic leniency bias.
Deep Dives
For detailed explanations of each mode, diagnostic interpretation, and theoretical foundations:
Quick-Start Recipes
→7-step workflow, mode selection (DM/IPS/DR), and copy-and-use recipe cards.
Diagnostics & Fixes
→The 5 highest-leverage diagnostics with alert thresholds and fix strategies.
Direct Method (DM)
→AutoCal-R calibration, OUA for honest uncertainty, and what to report.
Off-Policy Re-use (IPS & DR)
→Calibrated IPS with SIMCal stabilization and doubly robust estimation for log re-use.
Assumptions (Plain English)
→When your estimates are causally interpretable: shared assumptions and mode-specific requirements.