Causal Judge Evaluation

Stop guessing. Start measuring.

Turn noisy, biased LLM-judge scores into precise, unbiased estimates of the outcomes you actually care about.

The Problem: Your Judge Is Lying

Screenshot showing Claude Code repeatedly saying 'You're absolutely right!'

Real example: Claude Code sycophancy. absolutelyright.lol

January: users loved the affirmation. Your LLM judge scored it 9/10. March: users found it cloying. The judge still gave 9/10. You shipped it. User satisfaction dropped 18%.

Raw judge scores are surrogates (S), not outcomes (Y). They fail in three ways:

1. Preference Inversion

Higher judge scores often predict lower real-world quality due to verbosity bias or sycophancy.

2. Invalid Confidence Intervals

Standard error bars assume the judge is perfect. Uncalibrated judges yield 0% coverage—your "95% CI" almost never captures the truth.

3. Scale Arbitrariness

Is 4.2 actually 5% better than 4.0? Or just noise? Without calibration, you can't know.

The Solution: Calibration

CJE treats the judge as a programmable sensor that must be calibrated against ground truth.

1. Label a small slice: Provide "oracle" labels (human expert, gold-standard, A/B outcome) for 5-25% of your data.
2. Learn the mapping: CJE learns f(S, X) → Y using isotonic regression or two-stage calibration.
3. Estimate with rigor: Apply calibration at scale with valid CIs that propagate all uncertainty.

What should my oracle be?

Expert audits (domain specialists rate quality 1-5)
A/B test outcomes (conversion, retention lift)
User satisfaction surveys (post-task ratings)
Long-term metrics (7-day retention, support tickets)

Quick Start

Install:pip install cje-eval

from cje import analyze_dataset

# Point to responses directory (one JSONL per policy)
results = analyze_dataset(fresh_draws_dir="data/responses/")

# Get unbiased estimates with valid 95% CIs
for policy, est, se in zip(
    results.metadata["target_policies"],
    results.estimates,
    results.standard_errors
):
    print(f"{policy}: {est:.3f} ± {1.96*se:.3f}")

# Output:
# model_a: 0.732 ± 0.028
# model_b: 0.689 ± 0.031

Forest plot showing policy estimates vs ground truth

What you're seeing: Blue circles = CJE estimates. Red diamonds = oracle ground truth. Error bars = 95% CIs. With n=1,000 and 25% oracle labels, intervals successfully cover the true values.

Three Modes of Evaluation

CJE selects the most rigorous estimator based on your data:

Mode	Data Required	Use Case	Why
Direct	Fresh responses	A/B Testing	Most robust. No complex math. Just compare policies on same prompts.
IPS	Logs + logprobs	Historical Analysis	Fastest. Re-weigh old logs without running inference.
DR	Logs + fresh	Production Deploy	Most accurate. Doubly robust—combines both for minimum variance.

Recommendation: Start with Direct Mode. It's the most robust and requires no logprobs. Full guidance →

Validated on 5k Real Evaluations

Benchmarked 14 estimators on LMSYS Arena data with GPT-5 as oracle.

94%

Pairwise Accuracy

95%

CI Coverage

16×

Cost Reduction

Baseline comparison: Raw judges = 0% CI coverage. Standard IPS = worse than random chance.

See Full Results →

Sample Size Reference

Tier	Oracle Labels	Eval Samples	CI Width
Minimal	100-200	500-1k	±5-8%
Recommended	300-500	1-2k	±3-4%
Gold	1,000+	5k+	±1-2%

When to Use CJE

✅ Use CJE if:

• You need high-stakes deployment decisions
• You suspect judge favors verbose/sycophantic answers
• You have a small budget for gold labels (50-100 to start)
• You want valid confidence intervals, not just point estimates

❌ Don't use CJE if:

• You have zero gold labels (can't calibrate without ground truth)
• Your outcome can't be defined or measured
• You're doing exploratory research, not deployment decisions

Check detailed assumptions →

Built-In Diagnostics

CJE tells you when to trust estimates and how to improve your setup:

Drift Monitoring

Track residuals over time. When mean ≠ 0, calibration has drifted. Re-calibrate or update your judge.

Residual Analysis

Inspect large residuals to find what your judge is missing. Patterns reveal how to improve.

Ready to measure?

Start with the tutorial notebook or read the technical foundations.

Tutorial Notebook →Technical Report →