Diagnostics & Fixes
The 5 highest-leverage diagnostics with actionable remediation strategies
The goal is detection and remediation—not refusal. When an alert triggers, use the corresponding fix and re-run. These diagnostics map directly to the causal assumptions that make your estimates interpretable.
1. S-coverage
DM, IPS, DR
Why it matters
You only calibrated the judge where you have labels. Extrapolating beyond this range is unreliable.
Quick check
Fraction of evaluation judge scores (S) falling outside your labeled S-range. Also check boundary slopes (flat edges indicate poor coverage).
Alert threshold
Out-of-range mass > ~5% or near-flat slope at edges of calibration curve
What to try next
- • Add labels targeted to uncovered S bins (especially high/low score regions)
- • Keep ranking view and run a sensitivity panel (report rankings across different extrapolation assumptions) until coverage improves
- • Consider restricting evaluation set to well-covered S range
2. Reliability of AutoCal-R
Applies to: DM, IPS, DR
Why it matters
Ensures the S → R mapping accurately reflects ground-truth labels. Poor calibration means your estimates are systematically biased.
Quick check
Out-of-fold reliability curve and regional errors (low/mid/high S bins). Look for systematic over/under-prediction.
Alert threshold
Systematic miscalibration by region (e.g., predicted R consistently 0.1 higher than observed Y in low-S range)
What to try next
- • Use two-stage AutoCal-R: fit separate calibration curves for different groups (e.g., short vs long prompts, or product queries vs support queries), then aggregate
- • Gather a few more labels in regions with high error
- • Check for judge drift or scoring inconsistencies
- • Review judge prompt to ensure it targets your KPI
3. ESS fraction
Applies to: IPS, DR
Why it matters
Low effective sample size (ESS) means your estimates are dominated by a few high-weight samples, making them unstable and high-variance.
Quick check
ESS/n comparing raw weights vs SIMCal-stabilized weights. ESS = (∑w)² / ∑w²
Alert threshold
ESS/n < ~30% after SIMCal stabilization
What to try next
- • Restrict cohort to better-overlapped subsets (e.g., filter by prompt length or type)
- • Improve policy overlap: choose more similar policies or collect logs from target policy
- • Switch to DR and strengthen the critic (better features, more training data)
- • Check teacher-forcing coherence (tokenization, decode settings)
4. Tail heaviness
Applies to: IPS, DR
Why it matters
Heavy-tailed weight distributions mean extreme weights dominate variance. Even a few very large weights can make estimates unreliable.
Quick check
Hill tail index on top 1-5% of importance weights. Also check max-weight share (fraction of total weight from top 1% of samples).
Alert threshold
Hill tail index < ~2.0 (heavy tails), or max-weight share > 20%
What to try next
- • Same fixes as ESS: restrict cohort, improve overlap, or strengthen critic
- • Review tokenization & teacher-forcing coherence (mismatches create spurious tails)
- • Run trimming sensitivity analysis (e.g., clip top 1% weights and compare)
- • Consider switching to DM if generating fresh outputs is feasible
5. Orthogonality test (DR only)
Applies to: DR
Why it matters
Confirms doubly robust estimation's first-order protection. If this test fails, DR degrades to relying on either weights or critic alone.
Quick check
Weighted moment of (R - q̂) with confidence interval. Should be centered near zero.
Alert threshold
95% CI excludes zero, or consistently large deviation (>0.05 for binary outcomes)
What to try next
- • Improve the critic: add features, tune regularization, increase training data
- • Re-check SIMCal and overlap diagnostics (bad weights can break orthogonality)
- • Verify cross-fitting is properly implemented (no data leakage)
- • Fall back to IPS while you fix the critic, or switch to DM if feasible
Diagnostic workflow
- 1. Run all applicable diagnostics after your initial estimate
- 2. Prioritize fixes: Coverage and Reliability first (affect all modes), then ESS/Tails (IPS/DR), then Orthogonality (DR)
- 3. Apply the "what to try next" strategies for any alerts that fire
- 4. Re-run estimation with fixes applied
- 5. If diagnostics still alert, consider switching modes or using sensitivity panels for communication
Optional visual helpers
- Overlap heatmap: Plot weights vs judge score and/or prompt length to visualize where overlap is weak
- Variance decomposition: Show CI width split between sampling variance and OUA
- Reliability plot: Predicted R vs observed Y across S bins with confidence bands
- Weight distribution: Histogram of stabilized weights to spot outliers