CIMO LabsCIMO Labs
← Back to CJE Overview

Quick-Start Recipes

Copy-and-use workflows for running CJE evaluations

Quickstart (7 steps)

  1. Fix your prompt set & candidate policies. Define what you're comparing—baseline vs new model, old vs new prompt, etc.
  2. Collect a small labeled slice. Sample 1-5% of your data randomly, label with ground-truth KPI outcomes (Y).
  3. Fit AutoCal-R and audit reliability. Learn the judge score → KPI mapping, check the reliability plot and coverage.
  4. Pick your mode. Can you generate fresh outputs for all policies on the same prompts? → DM. Otherwise → IPS (or DR if you can train a critic).
  5. If IPS/DR: run SIMCal. Stabilize importance weights using monotone-in-judge-score projection.
  6. Estimate and add OUA. Compute your point estimate and add outcome uncertainty adjustment for honest CIs.
  7. Check diagnostics. Review coverage, reliability, ESS, tails, and orthogonality (DR only). Apply fixes if alerts fire.

Mode selection

Can you generate for all policies on the same prompts?

Yes → Use Direct Method (DM)

This is the simplest, most reliable path. Pairs each prompt across policies for clean comparison.

Do you have a judged log with teacher-forcing ratios?

Yes → Use Calibrated IPS (with SIMCal)

Reweight logged data using likelihood ratios. Check ESS and tail diagnostics after SIMCal.

Is overlap imperfect and can you train a critic?

Yes → Use Calibrated DR

Hedge IPS with an outcome model for doubly robust estimation. Verify orthogonality test passes.

Recipe cards

Direct Method (DM)

  1. 1. Generate fresh outputs for each policy on the same prompt set
  2. 2. Score all outputs with your judge → judge scores S
  3. 3. Map S to calibrated rewards R using AutoCal-R: R = f(S)
  4. 4. Compute mean: V̂(π) = (1/n) ∑ Rᵢ
  5. 5. Add OUA to your confidence interval
  6. 6. Report: point estimate, 95% CI, OUA share, reliability plot, S-coverage

Example: E-commerce chatbot

You test a new "concise response" prompt vs your GPT-4 baseline:

  • • Sample 500 conversations → label with actual purchases (KPI)
  • • Calibrate judge scores to purchase probability
  • • Run both policies on 2000 eval prompts
  • • Baseline: 23% purchase rate [21%, 25%]
  • • New prompt: 26% purchase rate [24%, 28%]
  • • Decision: Ship (+3 pp lift is significant)
→ Full DM guide with detailed example

Calibrated IPS

  1. 1. Fit AutoCal-R on your labeled slice: S → R
  2. 2. Compute raw importance weights via teacher forcing: W = pπ'(A|X) / pπ₀(A|X)
  3. 3. Run SIMCal to stabilize weights (monotone in judge score, mean-preserving)
  4. 4. Compute weighted mean: V̂IPS = (1/n) ∑ Wcalᵢ Rᵢ
  5. 5. Add OUA to your CI
  6. 6. Report: point estimate, 95% CI, OUA share, ESS fraction, max-weight share, tail index, overlap heatmap
→ Full IPS guide

Calibrated DR

  1. 1. Fit AutoCal-R on labeled slice: S → R
  2. 2. Train outcome models (cross-fit): μπ'(X) for target policy, q(X,A) for logger
  3. 3. Compute raw weights via teacher forcing, then stabilize with SIMCal
  4. 4. Compute DR estimate: V̂DR = (1/n) ∑ [μπ'(Xᵢ) + Wcali(Rᵢ - q(Xᵢ,Aᵢ))]
  5. 5. Add OUA to your CI
  6. 6. Run orthogonality test: check weighted mean of (R - q̂) with CI
  7. 7. Report: point estimate, 95% CI, OUA share, ESS, orthogonality test result
→ Full DR guide

Sample-size & label planning

Rules of thumb

  • When CIs are dominated by OUA: Adding more labels helps most. Double your labeled sample to cut OUA variance by ~√2.
  • When CIs are dominated by sampling variance: More prompts (or samples) dominate. This is typical for DM once you have ~500-1000 labels.
  • For IPS/DR: If ESS is low (<30% after SIMCal), prioritize improving overlap (restrict cohort, better policies) before adding more data.

Quick calculator

For a binary KPI with base rate p ≈ 0.2 and desired CI width of ±0.03:

  • Labels needed: ~200-400 for stable AutoCal-R
  • Eval prompts needed (DM): ~1000-2000 per policy
  • Log size needed (IPS): ~2000-5000 samples (assuming moderate ESS ~50%)

These scale roughly as σ²/δ² where σ is outcome variance and δ is your target precision.