CIMO LabsCIMO Labs

Arena Experiment, in 15 Minutes: What Actually Works for LLM Evals

Eddie Landesberg15 min read

This is the plain‑English summary of our full technical post benchmarking 14 estimators on 5k Chatbot Arena prompts. If you want the math, proofs, and all ablations, read the long version; this page is for quickly deciding how to evaluate your models in practice.

TL;DR

  • Cheap judges can replace most expensive labels—if you calibrate them against a small "oracle" slice.
  • On shared prompts, generate fresh responses and use Direct + two‑stage calibration. It hits 94% pairwise ranking accuracy (averaged across regimes) and avoids fragile propensity tricks.
  • Off‑policy IPS fails with teacher forcing (ranking ≈ random); Doubly Robust + stabilized weights recovers strong performance but costs more.
  • Always calibrate judge scores (don't trust raw judge numbers). Add response length as a covariate—ranking improves across the board.
  • Show honest error bars. Our uncertainty accounting includes the fact that the calibrator itself is learned (not fixed).

The three measurement tasks

Most AI evaluation focuses narrowly on ranking—identifying which model is best. But production deployment requires answering three related questions. All three estimate the same thing: V(π) = expected oracle outcome under policy π. What differs is how we evaluate success.

What's a "policy"? A policy π is a complete generation configuration: model (e.g., Llama 3.3 70B), system prompt, decoding parameters (temperature, top-p), and any post-processing. The goal is to estimate V(π)—the expected oracle outcome under that policy.

1. Ranking

Which policy is best? Can we reliably order variants by quality? This is the traditional focus of model evaluation and leaderboards.

Evaluation:

Correct ordering. Metrics: pairwise accuracy, top-1 accuracy, Kendall's τ.

2. Magnitude

What's the actual policy value? If we deploy this variant, what conversion rate / user satisfaction / revenue can we expect?

Evaluation:

Accurate point estimates of V(π). Metric: RMSEd (oracle-noise-debiased RMSE).

3. Uncertainty

How confident are we? How many samples do we need? Uncertainty quantification enables efficient resource allocation.

Evaluation:

Valid statistical inference on V(π). Metric: 95% CI coverage rate (should be ≈95%).

Why we ran this

Evaluating LLMs on outcomes that actually matter—human preference, conversion, revenue—gets expensive fast. Most teams therefore use "LLM‑as‑a‑judge" as a proxy. The practical question is: with a small amount of expensive oracle truth, which estimators let us trust proxy‑based evaluations across all three measurement tasks?

What we did (the short version)

  • Data: 5,000 real prompts from Chatbot Arena (first‑turn English), used for all policies (paired design).
  • Policies: Five Llama configurations: base, clone (A/A), parallel_universe_prompt, premium (bigger model), and an adversarial unhelpful.
  • Labels: Each response gets a cheap judge score (GPT‑4.1‑nano) and an oracle label (GPT‑5). We also study the realistic case where only 5–50% have oracle labels.
  • Estimators: 14 methods across three families—Direct (fresh responses), Importance Sampling (reuse logs), and Doubly Robust (combine both).
  • Costs (Oct 2025 ballpark): Judge calls are ~16× cheaper than oracle calls; calibration lets you use mostly judge scores while staying on the oracle scale.

Setup: Judges, Oracles, and Score Types

Judge scores: CJE works with direct scores (0–100 ratings) or pairwise comparisons (A vs. B wins—convert to continuous scores via Bradley‑Terry). This experiment used direct GPT-4.1-nano ratings, calibrated to GPT-5 oracle scores.

Oracle (Y) vs. Idealized Oracle (Y*): We use GPT-5 as the oracle Y (expensive, high-quality labels)—the highest practical rung we can reach at scale. The true target—Y* (Idealized Deliberation Oracle)—would be what infinite deliberation concludes about quality.

GPT-5 is closer to Y* than GPT-4.1-nano, but it's still a surrogate (model biases, finite reasoning). For production: use the highest practical rung available as Y—human judgments, A/B test outcomes, expert audits, or whatever represents your best signal. This experiment demonstrates the methodology; you'd swap GPT-5 for your actual oracle.

Key idea—CJE in one paragraph: Learn a mapping from cheap judge scores to the oracle scale using a small labeled slice, then apply it to lots of judge scores. Estimate policy values and produce uncertainty that includes both sampling noise and "we learned the calibrator" noise. If that mapping transports across policies/time, you can evaluate many systems cheaply and reliably.

The core assumption: Your judge (surrogate) must be predictive of the oracle outcome, even if it's biased. Example: a judge that prefers verbosity can still work—calibration learns "score 8 with 200 tokens ≈ oracle 0.65" vs "score 8 with 50 tokens ≈ oracle 0.75." But a judge that's uncorrelated with the oracle (random noise) can't be calibrated. Check correlation > 0.3 as a floor; > 0.6 is comfortable.

The CJE Workflow

📝
1. Generate
Fresh responses on shared prompts
⚖️
2. Score
Cheap judge scores + small oracle slice (5-25%)
🎯
3. Calibrate
Learn judge→oracle mapping with response length
📊
4. Estimate
Policy values + honest uncertainty

The punchline results (no equations)

MethodPairwise %Top-1 %Kendall's τVerdict
Direct + two‑stage + cov
(aka direct+cov)
94.3%89.4%0.887✅ Best default
DR + SIMCal‑W + cov
(aka calibrated‑dr‑cpo+cov)
94.1%87.5%0.881✅ For log reuse
Stacked‑DR + cov
(aka stacked‑dr+cov)
91.9%84.7%0.842Strong alternative
SNIPS (raw IPS)38.3%8.7%-0.235❌ Avoid
Calibrated‑IPS
(aka calibrated‑ips)
55.6%20.0%-0.061⚠️ Still fails

Results averaged across all experimental regimes: 5 oracle coverages (5%, 10%, 25%, 50%, 100%), 5 sample sizes (250, 500, 1K, 2.5K, 5K), and 50 seeds. "Random" baseline ≈ 50% pairwise, 20% top‑1, τ ≈ 0.

Ranking metrics exclude the adversarial unhelpful policy (by design, it breaks transportability).

What CJE returns: estimates with honest uncertainty

Forest plot showing CJE policy estimates with 95% confidence intervals compared to oracle ground truth

Example output: CJE estimates (blue circles with error bars) closely match oracle ground truth (red diamonds). With n=1,000 evaluations and only 25% oracle labels, the 95% confidence intervals successfully cover the true values. This is what you get when you run the Direct method—not just point estimates, but calibrated uncertainties that reflect both sampling noise and calibration uncertainty.

94%
Pairwise ranking accuracy with Direct + covariates (averaged across regimes)
100-200×
ESS improvement with SIMCal-W vs raw SNIPS
*but IPS still fails at ranking despite this gain
Runtime cost: DR vs Direct (still only ~20s total)

Key takeaways:

  • Direct + two-stage calibration is the best default: Simple, fast, no teacher forcing, 94% pairwise accuracy.
  • Plain IPS fails catastrophically: Worse than random (38% pairwise vs 50% baseline) due to weight degeneracy. With poor overlap, effective n drops from 5000 to ~30. Rankings invert.
  • DR + stabilized weights recovers: Matches Direct performance but costs 7× more runtime.
  • Always calibrate: Raw judge scores give 0% CI coverage (error bars lie).
  • Add response length: Typically adds +5–7 pp top‑1 for Direct and DR methods (smaller gains for stacked estimators).

What to use, when

If you can generate fresh responses

Use: Direct + two‑stage calibration

Simpler, faster, and more reliable than logs‑only methods. Covers most offline benchmarking and model selection tasks.

Why it works: No teacher forcing, no propensity estimation, no overlap requirements.
Typical use cases: Model selection, prompt A/B testing, hyperparameter tuning, benchmarking new releases.
⚠️

If you must reuse logs

Use: Doubly Robust + stabilized weights

Only if there's decent overlap (ESS > 0.1×n) and you can compute propensities reliably via teacher forcing.

Avoid: Plain IPS—fails with poor overlap and noisy propensities (teacher forcing).
When necessary: Post-hoc analysis of deployed policies, auditing past launches when fresh generation is infeasible.

Other scenarios

📝 Multi-turn conversations or agent traces

Approach: Assign judge + oracle scores at the trajectory level (full conversation or task completion). Use Direct method on paired trajectories across policies. Teacher forcing propensities become intractable for multi-turn; stick with fresh generation.

⚡ Evaluating many policies quickly (>10 candidates)

Approach: Generate fresh responses on a shared prompt set (1k–2.5k). Judge all responses, collect oracle labels on base policy only (25% coverage). Calibrate once, apply to all candidates. Check transportability diagnostics; if any policy fails, add 50–100 oracle labels from that policy and recalibrate with pooled data.

🚫 Weak overlap in logs (ESS < 0.05×n)

Reality check: No off-policy estimator will save you. Even DR + stabilized weights will have massive variance. Either (1) generate a small fresh draw (n=500–1k) and use Direct, or (2) accept that you can't reliably evaluate this policy from historical logs—treat it as exploratory only.

How much data do you need?

Sample size and oracle coverage analysis showing RMSE, standard error, and MDE contours for Direct+cov estimator

Power analysis for Direct+cov: Panel A shows RMSE decreasing with sample size and oracle coverage. Panel B shows standard errors. Panel C shows MDE contour lines—with n=1000 and 25% oracle coverage, you can detect ~2% effects. Click to enlarge.

For ranking policies

1K
prompts
10-25%
oracle labels

Usually sufficient for solid policy ordering

For detecting small effects (<1%)

2.5K+
prompts
≥25%
oracle labels

Needed for tight confidence intervals

Oracle coverage regime guide

Start with 25% unless cost is prohibitive. This gives clean diagnostics and comfortable calibration uncertainty. You can dial down later.
10-15% is the practical sweet spot for production: calibration uncertainty is still reasonable, and you save substantial oracle cost. Check that CI widths meet your decision thresholds.
5-10% is expert territory: calibration uncertainty dominates. Only go here if (1) oracle cost is extreme, (2) judge-oracle correlation is very strong (>0.7), and (3) you can tolerate wider CIs. Run diagnostics carefully.
Below 5%: usually not recommended. Calibration becomes unstable; transportability checks have low power; small calibration errors get amplified. If oracle budget is this tight, consider whether evaluation is feasible at all.

⚠️ Why honest uncertainty (OUA) matters

Standard methods treat the calibrator as fixed—as if the judge→oracle mapping were known with certainty. That's wrong: you learned it from a finite sample, so there's uncertainty in the mapping itself.

Concrete impact: Without OUA, nominal 95% confidence intervals only achieve ~65% actual coverage in our experiment (they're too narrow). OUA widens the intervals to be honest. If oracle coverage is only 5–10%, calibration uncertainty can dominate—you'll see wider CIs, but they'll actually cover 95% of the time. Better wide and truthful than tight and wrong.

When this breaks: Concrete failure modes

Transportability Failure: The "Unhelpful" Policy

What happens: You calibrate the judge→oracle mapping using typical policies (base, premium, etc.). Then you evaluate an adversarial "unhelpful" policy that produces terrible responses. The transportability test fails: mean residual = -0.312, p < 0.001.

Why: The calibrator learned the S→Y relationship in the "good response" range. The adversarial policy produces responses so bad they're outside the calibration range—extrapolation failure. Rankings still work (it's correctly identified as worst), but point estimates are systematically biased.

Fix: Add 50–100 oracle labels from the adversarial policy, recalibrate with pooled data. The transportability diagnostic (residual plot by policy) catches this automatically.

Transportability test showing Clone, Parallel Universe, and Premium policies pass (residuals centered at zero), but Unhelpful policy fails with mean residual -0.312

From the experiment: Testing H₀: E[Y - f(S, X)] = 0 for each policy (Bonferroni-corrected α=0.0125 per policy). Clone, Parallel Universe, and Premium pass—calibration transports cleanly. Unhelpful fails—the mapping learned from base systematically underestimates oracle scores for adversarial responses. This is why we exclude it from ranking metrics (by design, it breaks transportability).

Temporal Drift

What happens: You calibrate in January. By April, user preferences have shifted (more emphasis on conciseness). Your judge→oracle mapping is stale; estimates drift off.

Why: The oracle (user preferences) changed, but your calibration assumed they were fixed.

Fix: Run transportability diagnostics monthly on recent data. If residuals trend non-zero, refresh calibration with 200–500 new labels.

IPS Weight Degeneracy

What happens: You use raw SNIPS on logged data. Rankings are worse than random (38% pairwise accuracy; random = 50%). One policy gets all the weight; the rest are ignored.

Why: Poor overlap + teacher forcing noise → a few samples dominate the effective sample size. With n=5000, ESS drops to ~30.

Fix: Use DR + stabilized weights (SIMCal-W) or switch to Direct method if you can generate fresh responses. Never use raw IPS for ranking.

Practical, step‑by‑step (do this)

  1. Pick a shared prompt set (1–5k typical). Generate responses from each policy you want to compare.
  2. Score everything with a cheap judge (fast LLM or heuristic).
  3. Collect oracle labels for a slice (start with 25%; go lower later if diagnostics look clean).
  4. Run Direct + two‑stage calibration (judge→oracle with response length as a covariate).
  5. Report point estimates + 95% CIs that include calibration uncertainty. Add pairwise accuracy / Kendall's τ for ranking.
  6. Check transportability diagnostics (residuals by policy). If they fail, recalibrate or expand covariates.
# Install
pip install cje-eval
# Minimal use (Direct; auto two‑stage with response_length)
from cje import analyze_dataset
results = analyze_dataset(
fresh_draws_dir="examples/arena_sample/fresh_draws",
include_response_length=True,
verbose=True
)
print("estimates:", results.estimates)
print("95% CIs:", results.confidence_interval(0.05))

Where Doubly Robust fits

DR is your "belt‑and‑suspenders" option for log reuse: even if propensities are imperfect, a good outcome model (the "critic") corrects most of the bias, and stabilized weights clean up the rest. Use it when:

  • You really need to reuse a large log from a single policy ("logger").
  • You can compute propensities with teacher forcing (and you trust the implementation).
  • You can afford some fresh draws to help the critic generalize.

If overlap is weak (weights explode; effective sample size tiny), no estimator will save you—collect fresh responses and use Direct.

Humans vs models as the oracle

We used GPT‑5 as the oracle to stress‑test methods cheaply. The workflow is the same with human labels: calibrate judge→human, include rater noise in the error bars, and periodically re‑check transportability. If your real goal is human impact, calibrate against a small human slice and keep model‑judged scores for scale.

Recommended defaults

🎯

Estimator

Direct + two‑stage calibration (length‑aware)

DR + stabilized weights only if you must reuse logs

📊

Data

Start with 1–2.5k prompts, 25% oracle

Lower oracle % later if diagnostics stay clean

📈

Reporting

Point estimates + 95% CIs with calibration uncertainty

+ ranking metrics + transportability checks

Limitations to keep in mind

  • GPT-5 as oracle ≠ human preferences. We used GPT-5 to stress-test methods cheaply and at scale. The workflow is identical with human labels, but if your goal is predicting human impact, calibrate against a human slice (not just model judgments). Model-as-judge and model-as-oracle can share systematic biases (verbosity, style, safety theater) that humans don't.
  • Domain specificity. Sample size and oracle coverage recommendations are specific to this prompt distribution, judge-oracle correlation (~0.73), and effect size regime. Your domain may differ—run your first analysis with more oracle coverage than you think you need, then use diagnostics to find your optimal design.
  • Teacher forcing brittleness. Off-policy propensities (for IPS/DR) depend on accurate teacher forcing implementation. Small bugs (wrong tokenization, incorrect logprob extraction) can silently destroy overlap. Always validate ESS > 0.1×n before trusting off-policy estimates.
  • Transportability is an assumption, not a guarantee. The judge→oracle mapping may shift across policies or time. Run residual diagnostics on every new policy; refresh calibration monthly for deployed systems. When diagnostics fail, recalibrate—don't blindly trust old mappings.

Fast checklist

Shared prompts?
Fresh responses per policy?
Judge scores everywhere?
Oracle slice 10–25% (start 25%)?
Two‑stage calibration with response length?
?Transportability residuals flat by policy?
If no, recalibrate or add covariates
Report CIs with calibration uncertainty (OUA)?

Want more?

Conceptual foundations

This experiment instantiates the surrogacy framework—treating GPT-4.1-nano as a surrogate S, GPT-5 as the operational oracle Y, and using AutoCal-R to learn the calibration mapping.

For the theory of why this works and how to extend it to human outcomes, start with AI Quality as Surrogacy.

Full technical details

The complete Arena Experiment post includes all ablations, estimator definitions, uncertainty math, diagnostics, and reproducible code.


Software: pip install cje-eval · CLI: python -m cje.interface.cli --help · Examples: GitHub examples