Why off-policy evaluation for LLMs fails (and how CJE fixes it)

You've collected thousands of conversations from your deployed LLM. Now you want to evaluate a new model or prompt on this historical data. Should be straightforward, right?

Not quite. When we ran standard off-policy evaluation methods on 5,000 Arena conversations, the effective sample size dropped to 0.6% — essentially 30 samples. The evaluation was worse than random guessing.

The importance weight explosion

Off-policy evaluation relies on importance weights to correct for distribution shift:

w(x) = π_new(x) / π_old(x)

For LLMs, these weights explode. A single token difference in a 100-token response can create weights of 10^50 or higher. Your evaluation becomes dominated by a handful of lucky samples.

Here's what happens in practice:

Standard SNIPS: ESS = 0.6%, pairwise accuracy = 38.3%
With CJE: ESS = 94.6%, pairwise accuracy = 91.9%

Three problems compound

1. Weight concentration

When comparing different LLM policies (models, prompts, temperatures), the importance weights become extremely concentrated. The top 1% of weights often carry 99%+ of the total mass. Your "5,000 sample evaluation" becomes effectively 50 samples.

2. Judge miscalibration

LLM judges (GPT-4, Claude, etc.) aren't calibrated to real outcomes. A judge score of 0.8 doesn't mean 80% win probability. Without calibration, you're optimizing for the wrong target.

3. No failure detection

Standard methods silently fail. They'll give you a number even when the evaluation is meaningless. No warnings when ESS drops below usable thresholds.

How CJE fixes it

Causal Judge Evaluation uses Design-by-Projection (DbP) to solve all three problems:

AutoCal-R: Calibrated rewards

Instead of using raw judge scores, CJE projects them onto calibrated rewards using isotonic regression against a small oracle set. This ensures judge scores map to actual win probabilities.

SIMCal-W: Stable weights

CJE projects importance weights onto the S-monotone cone — the largest convex set that provably improves ESS. This isn't smoothing or clipping; it's finding the optimal weights that preserve your estimand while maximizing stability.

DR-CPO: Honest uncertainty

Doubly robust estimation with honest confidence intervals. When evaluation isn't reliable, CJE returns REFUSE-LEVEL rather than a misleading point estimate.

The math that matters

The key insight is the Knowledge-Riesz representation theorem. By encoding what we know (monotonicity, boundedness, oracle labels) as convex constraints, we can project our estimator to minimize variance while preserving unbiasedness:

φ* = argmin_{φ ∈ I(P) ∩ C} E[φ²]

This isn't a heuristic — it's the mathematically optimal estimator given our knowledge constraints.

Real Arena results

On 4,989 Arena conversations comparing 5 policies:

Metric	Standard (SNIPS)	CJE (Stacked-DR)	Improvement
ESS	0.6%	94.6%	158×
RMSE	25.3%	3.6%	7× lower
Pairwise accuracy	38.3%	91.9%	2.4×
Kendall τ	-0.235	0.837	Correct ranking

Try it yourself

pip install causal-judge-evaluation
cje evaluate --data your_logs.parquet --oracle labels.csv

CJE handles the complexity — fold generation, teacher forcing, calibration, projection — and outputs reliable estimates with honest confidence intervals.

The bottom line

Off-policy evaluation for LLMs isn't just hard; with standard methods, it's essentially broken. The importance weights are too unstable, the judges too miscalibrated, and the failure modes too silent.

CJE makes offline evaluation actually work. Not through approximations or heuristics, but through principled projections that maximize the information in your data.

Want to dive deeper? Check out our technical overview or read the implementation standards that ensure reproducible evaluation.