CIMO LabsCIMO Labs

Stop shipping based on eval scores that don't predict production

Your eval scores aren't calibrated to your actual KPIs. CJE fixes this—giving you audit-ready estimates with confidence intervals you can trust for deployment decisions.

Real Failure Mode

8.2/10vs7.8/10
→ Team ships it
Two weeks later: conversion drops 3%

The judge preferred verbose, polite responses. Users wanted speed. Extra tokens = higher latency = bounces. The judge couldn't see what mattered.

What most teams do (and why it fails)

The standard heuristic:

  1. 1. Run two models/prompts on your eval set
  2. 2. Score outputs with LLM-as-judge (0-10 scale)
  3. 3. Compare average scores (8.2 vs 7.8)
  4. 4. Ship the higher-scoring one

Simple. Fast. Feels data-driven. Completely unreliable.

Wrong scale

A score of 8/10 might mean 40% conversion, or 8%, or 73%. You have no idea. Judge scores aren't on your KPI scale.

No uncertainty

Is +0.4 points real or noise? Without confidence intervals, you're guessing. No statistical rigor = no launch confidence.

Not causal

Comparing different prompt sets or logged data? Your "8.2" came from easy queries, "7.8" from hard ones. Selection bias ruins everything.

Large-scale studies confirm LLM judges show high variance across tasks, differ by 5+ points from human scores even with high agreement, and perform worse on model-generated text (Bavaresco et al., 2024; Thakur et al., 2025).

What CJE gives you instead

Turn unreliable judge scores into statistically valid deployment decisions

Calibration

Judge scores → KPI units

Label 200-1000 examples with ground-truth outcomes. Learn the mapping: "Score 8 = 23% conversion." Now estimates are interpretable.

AutoCal-R: mean-preserving, monotone

Honest Uncertainty

95% CIs you can trust

Accounts for both sampling noise AND calibration uncertainty. Know when a difference is real enough to ship.

OUA: oracle-uncertainty aware intervals

Off-Policy Correction

Reuse logged data safely

Comparing different models/prompts? Stabilized importance weights fix distribution shift without exploding variance.

SIMCal: score-indexed, mean-one stabilization

Example output:

Baseline: 0.23 [0.21, 0.25] purchase probability (95% CI)

New prompt: 0.26 [0.24, 0.28] purchase probability (95% CI)

Difference: +3pp [+0.5pp, +5.5pp]

Decision: Ship (interval excludes zero, meaningful lift)

Tested on 4,989 real Arena evaluations

Standard off-policy methods collapse. CJE recovers 158× more signal.

Standard IPS (SNIPS)
0.6% ESS
Basically noise
CJE (calibrated IPS)
94.6% ESS
Actual inference
158×
improvement
91.9%
Pairwise accuracy
vs 38.3% baseline
7.1×
Lower RMSE
vs SNIPS
0.837
Kendall τ
vs -0.235 baseline
95.5%
CI coverage
Honest uncertainty

ESS recovery by policy comparison

PolicyBefore CJEAfter CJEImprovement
Prompt Variant0.6%94.6%158×
Premium Model0.7%80.8%115×
Clone (A/A test)26.2%98.8%3.8×

Three modes for different situations

Pick the right method for your data and constraints

Direct Method (DM)

Generate fresh outputs from each policy on the same prompts. Simplest, most reliable.

When: You can generate for all candidates

Output: V̂(π) ± OUA confidence intervals

IPS (Off-Policy)

Reweight logged data using likelihood ratios, stabilized with SIMCal. Reuse historical data.

When: You have logs with teacher-forcing

Output:IPS ± OUA + ESS/tails diagnostics

DR (Doubly Robust)

Combine IPS with outcome models. Tighter CIs when overlap is imperfect.

When: Weak overlap but you can train a critic

Output:DR ± OUA + orthogonality check

When to use CJE vs A/B testing

Use CJE When

  • • A/B tests take weeks and you need answers now
  • • You're evaluating 10+ model/prompt variants
  • • Limited production traffic (can't split safely)
  • • Can't A/B test (compliance, safety, low-traffic segments)
  • • You have historical logs you want to reuse

Still A/B Test When

  • • Only 1-2 candidates and plenty of traffic
  • • KPI is easy to measure online (clicks, conversions)
  • • High-stakes launch needing absolute certainty
  • • Judge can't observe critical features (latency, UI)

CJE isn't a replacement for A/B tests—it's a complement that lets you iterate faster offline before committing to expensive online experiments.

Get started in 5 minutes

Installation

# Install
pip install causal-judge-evaluation
# Run evaluation
cje evaluate --data logs.parquet \
--oracle labels.csv \
--output results/

What you get

  • • Point estimates in KPI units with 95% CIs
  • • Calibration reliability plots
  • • Coverage and overlap diagnostics
  • • ESS and tail heaviness checks (for IPS/DR)
  • • Orthogonality tests (for DR)
  • • Refuse-to-estimate flags when unreliable