From A/B to Offline: Causal Evaluation for LLMs

Offline evaluation on LLM logs fails spectacularly. On 5,000 Arena conversations, standard methods achieve 0.6% effective sample size—essentially 30 usable samples. Here's how to apply causal inference to make offline evaluation actually work.

The core problem: distribution shift between policies. When you evaluate a new model or prompt on historical data generated by a different policy, naive averaging gives you correlation, not the counterfactual effect you need for decision-making.

Arena data empirical results:

Standard offline evaluation: 38.3% accuracy (worse than random)
Effective sample size: 0.6% (30 samples out of 5,000)
Confidence intervals: ±180% (completely useless)

The A/B Test Gold Standard

Let's start with what actually works: online A/B tests. When you randomly assign users to control or treatment, you break the correlation between assignment and outcomes. A simple difference in means gives you the causal effect.

Why A/B tests work:

Randomization → assignment is independent of user features
Same population → control and treatment see identical traffic
Clean statistics → difference-of-means = causal effect with valid CIs

But A/B tests are expensive. You need to risk real traffic. You wait weeks for results. Can't we just test on historical data?

Why Offline Evaluation Fails

When you evaluate a new policy (model, prompt, temperature) on logs from an old policy, you face a fundamental problem: the data was generated by different behavior.

Technical insight: The importance weight explosion

Off-policy evaluation uses importance sampling to reweight historical data:

weight = P(response | new_prompt) / P(response | old_prompt)

For LLMs, this ratio explodes. A single token difference in a 100-token response can create weights of 10^50. Your "5,000 sample evaluation" collapses to effectively 50 samples—all the weight concentrates on a few lucky responses.

Imagine evaluating a creative writing prompt on logs from a factual Q&A prompt. The responses are fundamentally different—any evaluation is meaningless. But even subtle changes (temperature 0.7 → 0.8) cause enough distribution shift to break standard methods.

The CJE Solution: A/B Testing for Historical Data

Causal Judge Evaluation (CJE) makes offline evaluation work by solving three critical problems:

1. Calibrated Judges

LLM judges (GPT-4, Claude) give scores, but a 0.8 doesn't mean 80% win rate.AutoCal-R maps judge scores to actual KPIs using a small labeled sample.

2. Stable Weights

Raw importance weights explode to infinity.SIMCal-W finds optimal stable weights that preserve your estimates while maximizing effective sample size.

3. Honest Uncertainty

Standard methods give you a number even when it's meaningless. CJE returns REFUSE-LEVEL when evaluation isn't reliable, with guidance on how to fix it.

The CJE Pipeline in 5 Steps

Validate propensities
Use teacher forcing to get exact log-probabilities for historical responses under both old and new policies. Run determinism checks—bad propensities mean no causal claims.
Calibrate the judge
Map judge scores to your KPI scale (win rate, revenue, user satisfaction) using ~200 labeled examples. Check coverage—if the new policy generates responses outside your calibration range, you'll get warnings.
Stabilize weights
Transform raw importance weights into stable, unit-mean weights that preserve monotonicity. This deterministically improves ESS while maintaining unbiasedness.
Estimate with doubly robust methods
Use Calibrated-IPS for speed or Stacked-DR for accuracy. Add oracle uncertainty to confidence intervals. Cross-fit to ensure √n convergence.
Apply gates and decide
Check OVERLAP (ESS ≥ 30%), JUDGE (calibration quality), IDENTIFICATION (coverage). Get a clear SHIP or REFUSE decision with honest confidence intervals.

Real Results on Arena Data

We tested CJE on 4,989 real conversations from Chatbot Arena, comparing 5 different policies:

Method	Effective Sample Size	Pairwise Accuracy	Usable?
Standard (SNIPS)	0.6%	38.3%	❌
Calibrated-IPS	88.2%	87.3%	⚠️
CJE (Stacked-DR)	94.6%	91.9%	✅

The improvement isn't marginal—it's the difference between random noise and actionable insights. CJE turns 30 effective samples into 4,700+.

What You Need to Run CJE

Data requirements:

✓ Logs: Prompts, responses, and base policy log-probs
✓ Judge scores: Any LLM judge (GPT-4, Claude, etc.)
✓ Target log-probs: Teacher-forced probabilities under new policy
✓ Labels: ~200 human labels (5-10% of data)

Quick start:

# Validate your data
python -m cje validate logs.jsonl

# Run evaluation
python -m cje analyze logs.jsonl --estimator stacked-dr -o results.json

The Bottom Line

Stop lying to yourself with broken offline evaluations. Standard methods on LLM logs are worse than useless—they're actively misleading. CJE gives you the A/B test guarantees you need, using the historical data you already have.

Want to see it in action? Check out the complete CJE product tour with interactive estimator comparisons and visual diagnostics.

Ready to fix your evaluations?

CJE is open source and ready to use. Check out the Quick Start guide or dive into the technical details.

GitHub Try it now →