Assumptions for Causal Interpretation
When your estimates can be read as "what would happen if we shipped policy π"
These are the conditions under which your CJE estimates have a causal interpretation—meaning they predict what would actually happen if you deployed the candidate policy. Each assumption maps to specific diagnostics you can check.
Assumptions shared by all modes
These apply to DM, IPS, and DR
1. Surrogate adequacy
In plain English: Once you know the judge score (S), the average outcome depends primarily on S. The score is a sufficient summary for the mean—you don't need additional features to predict outcomes.
Why it matters: If the judge misses important drivers of your KPI (e.g., latency, formatting), your calibrated estimates will be systematically biased.
What you do:
- • Fit AutoCal-R and inspect the reliability plot
- • If regional errors persist, include a small index (e.g., prompt family or length)
- • Gather a few more labels in regions with poor fit
→ Checked by: Reliability diagnostic
2. Transport at fixed S
In plain English: The mapping from S to outcome learned on the oracle slice still holds where you evaluate. A score of 8 means the same thing on your eval set as it did on your labeled data.
Why it matters: If the judge's meaning drifts (different time period, different prompts), your calibration won't transfer correctly.
What you do:
- • Check S-coverage: is your eval S-range within your labeled S-range?
- • Check boundary slopes (flat edges indicate poor coverage)
- • If coverage is thin, add labels targeted at missing S bins
→ Checked by: S-coverage diagnostic
3. Score stability
In plain English: The judge means the same thing across time, versions, and slices. You're not comparing apples (old judge) to oranges (new judge).
Why it matters: Judge drift invalidates calibration. If the judge's scoring function changes, your S → R mapping breaks.
What you do:
- • Run simple drift checks: rank stability on a small anchor set over time
- • Use consistent judge configs (same model, temperature, prompt)
- • Freeze judge version during an evaluation window
- • Re-calibrate if you change the judge
→ Checked by: Rank correlation on anchor set, version pinning
DM-specific assumptions
Shared prompts
In plain English: Policies are evaluated on the same prompt set. Paired design is preferred—each prompt gets scored under both π₀ and π′.
Why it matters: Comparing different prompt sets introduces selection bias. DM needs apples-to-apples comparison.
→ Enforced by: Experimental design (use same prompts for all policies)
Fresh draws and judge availability
In plain English: You can generate outputs under each policy and score them with your judge.
Why it matters: DM requires generating new data—can't use pre-existing logs alone.
→ Enforced by: Workflow (if you can't generate, use IPS/DR instead)
What certifies DM validity
Good AutoCal-R reliability + adequate S-coverage
If not: add labels where missing; consider narrowing the prompt set to the target use case.
→ See full DM guideIPS-specific assumptions
Overlap / positivity
In plain English: The logger (π₀) produced outputs that have non-negligible probability under the candidate policy (π′). There's enough overlap to reweight.
Why it matters: Without overlap, importance weights explode. You can't learn about π′ from data that π′ would never produce.
Check:
- • ESS fraction: effective sample size after SIMCal
- • Tail index: heavy tails indicate poor overlap
- • Overlap heatmaps (weights vs S or prompt features)
→ Checked by: ESS and Tail diagnostics
Teacher-forcing coherence
In plain English: The per-token log-probabilities used for weights match the model used for generation. No hidden decode tricks or tokenization mismatches.
Why it matters: Incoherent likelihoods create spurious weights and invalidate IPS.
Check:
- • Provider additivity checks (sum of token logprobs = sequence logprob)
- • Stable teacher-forcing configs across runs
- • Verify tokenization is deterministic
→ Checked by: Additivity tests, reproducibility checks
Tail control
In plain English: Weights should not be dominated by a few extreme samples. Even after SIMCal, you want many samples contributing to the estimate.
Why it matters: A few outlier weights can make variance explode and estimates unreliable.
→ Checked by: Tail heaviness diagnostic
What certifies IPS validity
Healthy ESS/tails after SIMCal + reliable AutoCal-R
If not: restrict cohort to better overlap; gather more diverse logs; or switch to DR.
→ See full IPS guideDR-specific assumptions
Either a decent critic OR healthy weights
In plain English: DR is valid if either the critic (outcome model) approximates E[R|X,A] or the stabilized weights satisfy IPS assumptions. You don't need both—one is enough.
Why it matters: This is the "doubly robust" guarantee. Even if your critic is mediocre, good weights can save you (and vice versa).
Check:
- • Orthogonality test: weighted mean of (R - q̂) should be near zero with CI
- • Plus all IPS checks (ESS, tails)
→ Checked by: Orthogonality diagnostic
Cross-fitting / product-rate
In plain English: Use fold-honest training so nuisance learning (fitting the critic) doesn't bias inference. Train on one split, evaluate on another.
Why it matters: Overfitting the critic to the same data you evaluate on creates bias that breaks DR's guarantees.
→ Enforced by: Workflow (use cross-fitting in your implementation)
What certifies DR validity
Orthogonality CI covering zero (after SIMCal) + routine reliability/coverage on AutoCal-R
If not: improve critic and/or weight quality; re-run orthogonality test.
→ See full DR guideSummary: Assumptions → Diagnostics → Fixes
Every assumption has a corresponding diagnostic you can check, and every diagnostic has actionable fixes. The workflow is:
- 1. Understand which assumptions apply to your chosen mode (DM/IPS/DR)
- 2. Run the corresponding diagnostics after estimation
- 3. If alerts fire, apply the fixes from the diagnostics table
- 4. Re-run estimation and verify diagnostics pass
- 5. Report estimates with confidence that they're causally interpretable