CIMO LabsCIMO Labs
← Back to CJE Overview

Direct Method (DM)

The default, simplest path for LLM evaluation

What DM is

Direct Method generates outputs for each policy on the same prompt set, scores them with your judge, converts scores to calibrated rewards via AutoCal-R, averages them, and attaches a confidence interval with OUA (Outcome Uncertainty Adjustment).

When to use DM

DM is the recommended starting point if you can afford to generate fresh outputs from each candidate policy. It's simpler than off-policy methods, requires fewer diagnostics, and typically has tighter confidence intervals.

AutoCal-R: just what you need

AutoCal-R learns a smooth mapping from judge scores (S) to calibrated rewards (R) in KPI units. This ensures:

  • Interpretability: R is on your KPI scale (e.g., purchase conversion probability 0-1), not an arbitrary 0-10 scale
  • Ranking preservation: Higher judge scores always map to higher (or equal) calibrated rewards
  • Unbiasedness: Average calibrated reward matches average true outcome on your labeled data

How it works

  1. Collect labels. Sample 1-5% of your data randomly (typically 200-1000 examples), label with ground-truth KPI outcomes Y.
  2. Fit smooth calibration curve. Learn a function f: S → R that best predicts outcomes while preserving score order and matching the average outcome.
  3. Auto fallback (optional). If a single curve doesn't fit well (e.g., short prompts behave differently from long ones), fit separate curves for different groups and combine them.

Running example: E-commerce chatbot

Scenario: You run an e-commerce chatbot. Your baseline uses GPT-4, and you're testing a new prompt that encourages more concise responses (hoping to reduce latency and improve conversion).

KPI: Purchase within 24 hours of conversation (binary: 0 or 1)

Calibration: You randomly sample 500 conversations, label them with actual purchase outcomes, and get judge scores 0-10 for helpfulness. AutoCal-R learns:

  • • Judge score 4 → R = 0.08 (8% purchase probability)
  • • Judge score 7 → R = 0.21 (21% purchase probability)
  • • Judge score 9 → R = 0.34 (34% purchase probability)

Result: Baseline scores average 7.8 → 0.23 purchase probability. New prompt scores 8.2 → 0.26 purchase probability. Estimate: +3 percentage points (13% relative lift), 95% CI: [+0.5%, +5.5%]. You ship it.

Operator artifacts

Reliability plot

Predicted R vs observed Y across S bins, with confidence bands. Check for systematic over/under-prediction.

Mean-preservation check

Verify 𝔼[R] ≈ 𝔼[Y] on out-of-fold data. Large discrepancies indicate calibration issues.

S-coverage

Fraction of evaluation S falling within labeled S-range. Low coverage (<95%) or flat boundary slopes indicate extrapolation risk.

Honest uncertainty (OUA)

Standard confidence intervals only account for sampling noise—the randomness from having finite evaluation data. But there's another source of uncertainty: calibration uncertainty from learning the score-to-outcome mapping on a limited number of labels.

OUA (Outcome Uncertainty Adjustment) captures this by measuring how much your estimates would change if you had gotten slightly different labels, then widening your confidence interval accordingly.

How it works (simplified)

  1. 1. Split your labeled data into 5 groups
  2. 2. Fit 5 different calibration curves, each leaving out one group
  3. 3. Apply each curve to your evaluation data → 5 slightly different estimates
  4. 4. Measure how much these estimates vary
  5. 5. Add this extra uncertainty to your confidence interval

Interpreting OUA share

OUA share tells you what fraction of your total uncertainty comes from calibration vs sampling. If OUA share is high (>40%), adding more labels helps more than running more evaluations. If it's low (<20%), you're bottlenecked by evaluation sample size, not label quality.

What to report (DM)

Standard reporting template

Point estimate & CI

Policy A: 0.45 [0.41, 0.49] (95% CI)

Policy B: 0.38 [0.34, 0.42] (95% CI)

Difference: +0.07 [+0.01, +0.13]

OUA share

28% of total variance from calibration uncertainty

→ Sampling variance dominates; more prompts would tighten CIs faster than more labels

Reliability

Mean absolute error: 0.03 (out-of-fold)

→ Reliability plot shows good fit across S bins

S-coverage

98% of evaluation S within labeled range [2.1, 9.7]

→ Minimal extrapolation risk

Complete DM workflow

  1. 1. Define prompt set and candidate policies
  2. 2. Collect random 1-5% labeled sample (typically 200-1000 examples)
  3. 3. Fit AutoCal-R: S → R (with K-fold for OUA)
  4. 4. Check reliability and S-coverage diagnostics
  5. 5. Generate fresh outputs for each policy on same prompts
  6. 6. Score with judge → S, map to R = f(S)
  7. 7. Estimate V̂(π) = mean(R), add OUA to CI
  8. 8. Report point estimate, 95% CI, OUA share, reliability, coverage

When DM works well

  • ✓ You can afford fresh generations from π′
  • ✓ Your eval set is representative of production distribution
  • ✓ Policies share the same prompt distribution
  • ✓ Judge scoring is stable and consistent

When to consider off-policy instead

  • • Fresh generation is too expensive or API-limited
  • • You need to evaluate on historical data only
  • • Prompt distributions differ across policies (can't pair)
→ See Off-Policy methods (IPS & DR)