CIMO LabsCIMO Labs

Empirical Results

Performance benchmarks on 4,989 Arena conversations comparing 5 policies across 11 methods.

Performance Summary

158×
ESS Improvement
0.6% → 94.6%
7.1×
Lower RMSE
vs SNIPS
91.9%
Pairwise Accuracy
vs 38.3% baseline
0.837
Kendall τ
vs -0.235 baseline

Method Comparison

Estimator rankings heatmap showing RMSE performance across different policies and oracle sizes
Figure 1: Estimator ranking stability across evaluation scenarios. Lower RMSE (darker blue) indicates better performance. Stacked-DR consistently outperforms baseline methods across all policy comparisons.
Oracle sample size interaction analysis showing performance gains
Figure 2: Performance improvement with increasing oracle coverage. The plot shows how Stacked-DR (with OC) maintains stable performance even with minimal oracle labels, while standard methods require substantially more labeled data.

Detailed Metrics

MethodESS (%)RMSEBiasCoveragePairwise AccKendall τ
Stacked-DR (OC)94.60.0360.00295.591.90.837
Calibrated IPS88.20.0410.00594.187.30.754
DR-CPO91.80.0380.00395.089.50.796
SIMCal-W82.50.0470.00893.283.10.692
SNIPS (baseline)0.60.2530.08741.238.3-0.235

Table 1: Complete benchmark results on Arena data. ESS: Effective Sample Size (%). RMSE: Root Mean Squared Error. Coverage: 95% CI coverage (%). Pairwise Acc: Pairwise ranking accuracy (%).

Methodology

Dataset

4,989 conversations from Chatbot Arena with 5 distinct policies: base model, prompt variant, premium model, temperature adjustment, and clone (A/A test).

Oracle Labels

Human preference labels on 200 randomly sampled conversations (4% coverage) used for judge calibration via isotonic regression.

Evaluation Protocol

5-fold cross-validation with deterministic fold assignment based on conversation ID hash. Teacher forcing additivity verified to ±1e-7 tolerance. All methods evaluated on identical folds.

Metrics

  • • ESS computed via squared normalized weights
  • • RMSE against held-out oracle labels
  • • Coverage measured on 1000 bootstrap samples
  • • Pairwise accuracy on all 10 policy pairs

Downloads

Citation

@article{cje2025,
  title={Causal Judge Evaluation: Design-by-Projection for Off-Policy LLM Evaluation},
  author={CIMO Labs},
  year={2025},
  url={https://github.com/cimo-labs/cje}
}