Diagnostics & Fixes

The 5 highest-leverage diagnostics with actionable remediation strategies

The goal is detection and remediation—not refusal. When an alert triggers, use the corresponding fix and re-run. These diagnostics map directly to the causal assumptions that make your estimates interpretable.

1. S-coverage

DM, IPS, DR

Why it matters

You only calibrated the judge where you have labels. Extrapolating beyond this range is unreliable.

Quick check

Fraction of evaluation judge scores (S) falling outside your labeled S-range. Also check boundary slopes (flat edges indicate poor coverage).

Alert threshold

Out-of-range mass > ~5% or near-flat slope at edges of calibration curve

What to try next

• Add labels targeted to uncovered S bins (especially high/low score regions)
• Keep ranking view and run a sensitivity panel (report rankings across different extrapolation assumptions) until coverage improves
• Consider restricting evaluation set to well-covered S range

2. Reliability of AutoCal-R

Applies to: DM, IPS, DR

Why it matters

Ensures the S → R mapping accurately reflects ground-truth labels. Poor calibration means your estimates are systematically biased.

Quick check

Out-of-fold reliability curve and regional errors (low/mid/high S bins). Look for systematic over/under-prediction.

Alert threshold

Systematic miscalibration by region (e.g., predicted R consistently 0.1 higher than observed Y in low-S range)

What to try next

• Use two-stage AutoCal-R: fit separate calibration curves for different groups (e.g., short vs long prompts, or product queries vs support queries), then aggregate
• Gather a few more labels in regions with high error
• Check for judge drift or scoring inconsistencies
• Review judge prompt to ensure it targets your KPI

3. ESS fraction

Applies to: IPS, DR

Why it matters

Low effective sample size (ESS) means your estimates are dominated by a few high-weight samples, making them unstable and high-variance.

Quick check

ESS/n comparing raw weights vs SIMCal-stabilized weights. ESS = (∑w)² / ∑w²

Alert threshold

ESS/n < ~30% after SIMCal stabilization

What to try next

• Restrict cohort to better-overlapped subsets (e.g., filter by prompt length or type)
• Improve policy overlap: choose more similar policies or collect logs from target policy
• Switch to DR and strengthen the critic (better features, more training data)
• Check teacher-forcing coherence (tokenization, decode settings)

4. Tail heaviness

Applies to: IPS, DR

Why it matters

Heavy-tailed weight distributions mean extreme weights dominate variance. Even a few very large weights can make estimates unreliable.

Quick check

Hill tail index on top 1-5% of importance weights. Also check max-weight share (fraction of total weight from top 1% of samples).

Alert threshold

Hill tail index < ~2.0 (heavy tails), or max-weight share > 20%

What to try next

• Same fixes as ESS: restrict cohort, improve overlap, or strengthen critic
• Review tokenization & teacher-forcing coherence (mismatches create spurious tails)
• Run trimming sensitivity analysis (e.g., clip top 1% weights and compare)
• Consider switching to DM if generating fresh outputs is feasible

5. Orthogonality test (DR only)

Applies to: DR

Why it matters

Confirms doubly robust estimation's first-order protection. If this test fails, DR degrades to relying on either weights or critic alone.

Quick check

Weighted moment of (R - q̂) with confidence interval. Should be centered near zero.

Alert threshold

95% CI excludes zero, or consistently large deviation (>0.05 for binary outcomes)

What to try next

• Improve the critic: add features, tune regularization, increase training data
• Re-check SIMCal and overlap diagnostics (bad weights can break orthogonality)
• Verify cross-fitting is properly implemented (no data leakage)
• Fall back to IPS while you fix the critic, or switch to DM if feasible

Diagnostic workflow

1. Run all applicable diagnostics after your initial estimate
2. Prioritize fixes: Coverage and Reliability first (affect all modes), then ESS/Tails (IPS/DR), then Orthogonality (DR)
3. Apply the "what to try next" strategies for any alerts that fire
4. Re-run estimation with fixes applied
5. If diagnostics still alert, consider switching modes or using sensitivity panels for communication

Optional visual helpers

Overlap heatmap: Plot weights vs judge score and/or prompt length to visualize where overlap is weak
Variance decomposition: Show CI width split between sampling variance and OUA
Reliability plot: Predicted R vs observed Y across S bins with confidence bands
Weight distribution: Histogram of stabilized weights to spot outliers

← Previous: Recipes Next: Assumptions →