Quickstart (7 steps)
- Fix your prompt set & candidate policies. Define what you're comparing—baseline vs new model, old vs new prompt, etc.
- Collect a small labeled slice. Sample 1-5% of your data randomly, label with ground-truth KPI outcomes (Y).
- Fit AutoCal-R and audit reliability. Learn the judge score → KPI mapping, check the reliability plot and coverage.
- Pick your mode. Can you generate fresh outputs for all policies on the same prompts? → DM. Otherwise → IPS (or DR if you can train a critic).
- If IPS/DR: run SIMCal. Stabilize importance weights using monotone-in-judge-score projection.
- Estimate and add OUA. Compute your point estimate and add outcome uncertainty adjustment for honest CIs.
- Check diagnostics. Review coverage, reliability, ESS, tails, and orthogonality (DR only). Apply fixes if alerts fire.
Mode selection
Can you generate for all policies on the same prompts?
✓ Yes → Use Direct Method (DM)
This is the simplest, most reliable path. Pairs each prompt across policies for clean comparison.
Do you have a judged log with teacher-forcing ratios?
✓ Yes → Use Calibrated IPS (with SIMCal)
Reweight logged data using likelihood ratios. Check ESS and tail diagnostics after SIMCal.
Is overlap imperfect and can you train a critic?
✓ Yes → Use Calibrated DR
Hedge IPS with an outcome model for doubly robust estimation. Verify orthogonality test passes.
Recipe cards
Direct Method (DM)
- 1. Generate fresh outputs for each policy on the same prompt set
- 2. Score all outputs with your judge → judge scores S
- 3. Map S to calibrated rewards R using AutoCal-R: R = f(S)
- 4. Compute mean: V̂(π) = (1/n) ∑ Rᵢ
- 5. Add OUA to your confidence interval
- 6. Report: point estimate, 95% CI, OUA share, reliability plot, S-coverage
Example: E-commerce chatbot
You test a new "concise response" prompt vs your GPT-4 baseline:
- • Sample 500 conversations → label with actual purchases (KPI)
- • Calibrate judge scores to purchase probability
- • Run both policies on 2000 eval prompts
- • Baseline: 23% purchase rate [21%, 25%]
- • New prompt: 26% purchase rate [24%, 28%]
- • Decision: Ship (+3 pp lift is significant)
Calibrated IPS
- 1. Fit AutoCal-R on your labeled slice: S → R
- 2. Compute raw importance weights via teacher forcing: W = pπ'(A|X) / pπ₀(A|X)
- 3. Run SIMCal to stabilize weights (monotone in judge score, mean-preserving)
- 4. Compute weighted mean: V̂IPS = (1/n) ∑ Wcalᵢ Rᵢ
- 5. Add OUA to your CI
- 6. Report: point estimate, 95% CI, OUA share, ESS fraction, max-weight share, tail index, overlap heatmap
Calibrated DR
- 1. Fit AutoCal-R on labeled slice: S → R
- 2. Train outcome models (cross-fit): μπ'(X) for target policy, q(X,A) for logger
- 3. Compute raw weights via teacher forcing, then stabilize with SIMCal
- 4. Compute DR estimate: V̂DR = (1/n) ∑ [μπ'(Xᵢ) + Wcali(Rᵢ - q(Xᵢ,Aᵢ))]
- 5. Add OUA to your CI
- 6. Run orthogonality test: check weighted mean of (R - q̂) with CI
- 7. Report: point estimate, 95% CI, OUA share, ESS, orthogonality test result
Sample-size & label planning
Rules of thumb
- When CIs are dominated by OUA: Adding more labels helps most. Double your labeled sample to cut OUA variance by ~√2.
- When CIs are dominated by sampling variance: More prompts (or samples) dominate. This is typical for DM once you have ~500-1000 labels.
- For IPS/DR: If ESS is low (<30% after SIMCal), prioritize improving overlap (restrict cohort, better policies) before adding more data.
Quick calculator
For a binary KPI with base rate p ≈ 0.2 and desired CI width of ±0.03:
- • Labels needed: ~200-400 for stable AutoCal-R
- • Eval prompts needed (DM): ~1000-2000 per policy
- • Log size needed (IPS): ~2000-5000 samples (assuming moderate ESS ~50%)
These scale roughly as σ²/δ² where σ is outcome variance and δ is your target precision.