CIMO LabsCIMO Labs

The Visual Guide to Offline LLM Evaluation

Numbers lie. Plots don't. Here are the 9 visualizations that reveal whether your offline LLM evaluation is trustworthy or garbage—and exactly what to do about it.

In Part 1, we showed why standard offline evaluation fails catastrophically for LLMs. Now let's see what reliable evaluation actually looks like through the lens of CJE's diagnostic plots.

🎯 Each plot answers a critical question:

  • Is my reweighting working? (Plot 1)
  • Can I trust my judge? (Plots 2-3)
  • Are my weights stable? (Plot 4)
  • Do I have enough overlap? (Plot 5)
  • Which estimator should I trust? (Plots 6-7)
  • How uncertain am I really? (Plot 8)
  • Do I need more data or labels? (Plot 9)

1. The Reweighting Reality Check

Plot: Distribution Shift Visualization

[Histogram showing π₀ (light bars) and reweighted π′ (dark bars)]

What you're seeing: How importance weighting transforms your historical data distribution to match what the new policy would generate.

Good sign: Dark bars (reweighted) roughly match the expected distribution of your new policy. Some samples get weight arrows ↑ (more likely under new policy) or ↓ (less likely).

Red flag: Extreme concentration—if 90% of weight goes to 1% of samples, your evaluation is basically fiction.

This is your first sanity check. If reweighting creates a wildly different distribution, you're extrapolating beyond your data's support. Time to collect new logs.


2. Judge Calibration: From Scores to Reality

Plot: Reliability Diagram

[X-axis: predicted win rate, Y-axis: actual win rate, with diagonal reference]

What you're seeing: Whether your judge's scores actually predict real outcomes. Perfect calibration follows the diagonal.

Good sign: Calibrated line (blue) hugs the diagonal. Raw judge scores (red) are typically overconfident at extremes and compressed in the middle.

Red flag: Flat regions in calibration curve = judge can't distinguish quality in that range. Wide confidence intervals = need more labeled data.

Real example: GPT-4 giving a response 0.9 often means only 65% actual win rate. AutoCal-R fixes this systematic bias, mapping scores to true KPIs.


3. Coverage: Are You Extrapolating?

Plot: Coverage Badge

Policy A

92% in-range

Policy B

78% in-range

Policy C

45% in-range ⚠️

What you're seeing: What fraction of the new policy's responses fall within the score range where you have calibration labels.

Good sign: > 80% in-range. Small out-of-range is fine if calibration is approximately linear at boundaries.

Red flag: > 30% out-of-range + flat calibration at boundaries = REFUSE-LEVEL. You're making up numbers for a large fraction of data.

This is why creative writing models can't be evaluated on factual Q&A labels. The judge score distributions don't overlap—you're extrapolating blindly.


4. Weight Stability: Taming the Explosion

Plot: Weight Tail Distribution & ESS Improvement

Log-log CCDF of weights

[Shows raw weights with heavy tail vs calibrated weights shifted left]

ESS (Raw)0.6%
ESS (SIMCal-W)94.6%
Max weight: 52% → 1.8% ✓

What you're seeing: How SIMCal-W transforms explosive weights into stable ones while preserving unbiasedness.

Good sign: ESS > 30%, max weight < 5%, calibrated curve shifted left from raw.

Red flag: Even after calibration, ESS < 10% or max weight > 20%. The policies are too different for reliable evaluation.

This single improvement takes unusable evaluations (0.6% ESS) and makes them actionable (94.6% ESS). It's the difference between 30 effective samples and 4,700.


5. Structural Overlap in Judge Space

Plot: Score Distribution Overlap

[Overlapping histograms of judge scores for π₀ and π′]

Bhattacharyya coefficient: 0.73

What you're seeing: How much the judge score distributions overlap between old and new policies. More overlap = more reliable evaluation.

Good sign: Bhattacharyya coefficient > 0.6, visible overlap in histograms.

Red flag: Coefficient < 0.3 or completely separated distributions. You're comparing apples to oranges.

Low overlap doesn't always mean failure—but it means you need more sophisticated methods (DR/TMLE) and larger sample sizes.


6. The Forest of Truth: Which Estimator Wins?

Plot: Estimator Comparison Forest

Naive (π₀)
0.42 ± 0.18 ❌
SNIPS
0.31 ± 0.62 ❌
Calibrated-IPS
0.49 ± 0.08 ⚠️
Stacked-DR
0.51 ± 0.04 ✅
Ground Truth
0.52

What you're seeing: Point estimates and 95% CIs for different methods. Badges show which gates passed.

Good sign: Calibrated methods near truth, tight CIs, green badges.

Red flag: All methods have red badges or huge CIs. No amount of clever statistics can save bad data.

Notice how naive evaluation (0.42) is biased—it measures the old policy, not the new one. SNIPS has the right idea but explodes (±0.62!). Only Stacked-DR gives you the truth with honest uncertainty.


7. Orthogonality: Is Your Model Cheating?

Plot: DR Orthogonality Check

0

-0.003 ± 0.012 ✓

What you're seeing: Whether your nuisance model (outcome predictions) is orthogonal to your causal estimate. Should be centered at zero.

Good sign: CI includes zero, tight interval.

Red flag: CI excludes zero = your model learning is leaking into the estimate, biasing results.

This technical check ensures your fancy ML models aren't accidentally learning the treatment effect instead of the outcome function—a subtle but critical form of overfitting.


8. Uncertainty Decomposition: What Don't You Know?

Plot: Oracle Uncertainty Attribution

200 labels (5% coverage)

70% Data

30% Oracle

800 labels (20% coverage)

92% Data

8% Oracle

What you're seeing: How much of your confidence interval comes from limited data (blue) vs uncertainty in calibration (orange).

Good sign: Oracle uncertainty < 20% of total. Shrinks with more labels.

Red flag: Oracle uncertainty > 50%. Your bottleneck is labels, not logs.

This decomposition tells you exactly where to invest: more conversation logs or more human labels. Usually, 200-500 labels is enough unless you have extreme distribution shift.


9. The Planning Curve: Data vs Labels

Plot: MDE Power Analysis

[Curves showing MDE vs sample size for different label coverage levels]

5% labels: MDE = 8.2% at n=5000
10% labels: MDE = 5.1% at n=5000
20% labels: MDE = 3.8% at n=5000

What you're seeing: The minimum detectable effect (MDE) for 80% power as a function of sample size and label coverage.

Good sign: Your expected effect size is above the MDE curve for your data.

Red flag: MDE > 10% even with maximum data. You need a bigger effect or a better experiment design.

This plot turns statistics into decisions. If you expect a 5% improvement, you can see exactly whether your data can detect it—before running the analysis.


Putting It All Together: The Decision Flow

Your evaluation checklist:

  1. Weight stability: ESS > 30% after SIMCal-W?
  2. Judge reliability: Calibration R² > 0.7?
  3. Coverage: > 70% of new policy in calibration range?
  4. Overlap: Bhattacharyya > 0.3?
  5. Orthogonality: CI includes zero?
  6. Power: MDE less than expected effect?

All checks pass → SHIP IT

Trust your estimates. Use Stacked-DR for final numbers.

Any check fails → REFUSE-LEVEL

Get rankings only. Plots tell you what data to collect.

The Bottom Line

These 9 plots transform offline evaluation from gambling to engineering. Each visualization targets a specific failure mode, and together they give you the complete picture of your evaluation's reliability.

Stop trusting single numbers. Start demanding these plots. They're the difference between shipping improvements and shipping disasters.

Next in the series: Part 1 dives deep into calibration and stabilization—the mathematical machinery that makes these plots possible. Read Part 1 →


Generate these plots with CJE:

# Generate all diagnostic plots
python -m cje analyze logs.jsonl \
    --plots all \
    --output-dir results/plots/

# Or in Python
from cje import analyze_dataset
results = analyze_dataset("logs.jsonl", generate_plots=True)
results.save_plots("results/plots/")