CIMO LabsCIMO Labs
← Back to CJE Overview

Assumptions for Causal Interpretation

When your estimates can be read as "what would happen if we shipped policy π"

These are the conditions under which your CJE estimates have a causal interpretation—meaning they predict what would actually happen if you deployed the candidate policy. Each assumption maps to specific diagnostics you can check.

Assumptions shared by all modes

These apply to DM, IPS, and DR

1. Surrogate adequacy

In plain English: Once you know the judge score (S), the average outcome depends primarily on S. The score is a sufficient summary for the mean—you don't need additional features to predict outcomes.

Why it matters: If the judge misses important drivers of your KPI (e.g., latency, formatting), your calibrated estimates will be systematically biased.

What you do:

  • • Fit AutoCal-R and inspect the reliability plot
  • • If regional errors persist, include a small index (e.g., prompt family or length)
  • • Gather a few more labels in regions with poor fit

→ Checked by: Reliability diagnostic

2. Transport at fixed S

In plain English: The mapping from S to outcome learned on the oracle slice still holds where you evaluate. A score of 8 means the same thing on your eval set as it did on your labeled data.

Why it matters: If the judge's meaning drifts (different time period, different prompts), your calibration won't transfer correctly.

What you do:

  • • Check S-coverage: is your eval S-range within your labeled S-range?
  • • Check boundary slopes (flat edges indicate poor coverage)
  • • If coverage is thin, add labels targeted at missing S bins

→ Checked by: S-coverage diagnostic

3. Score stability

In plain English: The judge means the same thing across time, versions, and slices. You're not comparing apples (old judge) to oranges (new judge).

Why it matters: Judge drift invalidates calibration. If the judge's scoring function changes, your S → R mapping breaks.

What you do:

  • • Run simple drift checks: rank stability on a small anchor set over time
  • • Use consistent judge configs (same model, temperature, prompt)
  • • Freeze judge version during an evaluation window
  • • Re-calibrate if you change the judge

→ Checked by: Rank correlation on anchor set, version pinning

DM-specific assumptions

Shared prompts

In plain English: Policies are evaluated on the same prompt set. Paired design is preferred—each prompt gets scored under both π₀ and π′.

Why it matters: Comparing different prompt sets introduces selection bias. DM needs apples-to-apples comparison.

→ Enforced by: Experimental design (use same prompts for all policies)

Fresh draws and judge availability

In plain English: You can generate outputs under each policy and score them with your judge.

Why it matters: DM requires generating new data—can't use pre-existing logs alone.

→ Enforced by: Workflow (if you can't generate, use IPS/DR instead)

What certifies DM validity

Good AutoCal-R reliability + adequate S-coverage

If not: add labels where missing; consider narrowing the prompt set to the target use case.

→ See full DM guide

IPS-specific assumptions

Overlap / positivity

In plain English: The logger (π₀) produced outputs that have non-negligible probability under the candidate policy (π′). There's enough overlap to reweight.

Why it matters: Without overlap, importance weights explode. You can't learn about π′ from data that π′ would never produce.

Check:

  • ESS fraction: effective sample size after SIMCal
  • Tail index: heavy tails indicate poor overlap
  • • Overlap heatmaps (weights vs S or prompt features)

→ Checked by: ESS and Tail diagnostics

Teacher-forcing coherence

In plain English: The per-token log-probabilities used for weights match the model used for generation. No hidden decode tricks or tokenization mismatches.

Why it matters: Incoherent likelihoods create spurious weights and invalidate IPS.

Check:

  • • Provider additivity checks (sum of token logprobs = sequence logprob)
  • • Stable teacher-forcing configs across runs
  • • Verify tokenization is deterministic

→ Checked by: Additivity tests, reproducibility checks

Tail control

In plain English: Weights should not be dominated by a few extreme samples. Even after SIMCal, you want many samples contributing to the estimate.

Why it matters: A few outlier weights can make variance explode and estimates unreliable.

→ Checked by: Tail heaviness diagnostic

What certifies IPS validity

Healthy ESS/tails after SIMCal + reliable AutoCal-R

If not: restrict cohort to better overlap; gather more diverse logs; or switch to DR.

→ See full IPS guide

DR-specific assumptions

Either a decent critic OR healthy weights

In plain English: DR is valid if either the critic (outcome model) approximates E[R|X,A] or the stabilized weights satisfy IPS assumptions. You don't need both—one is enough.

Why it matters: This is the "doubly robust" guarantee. Even if your critic is mediocre, good weights can save you (and vice versa).

Check:

  • Orthogonality test: weighted mean of (R - q̂) should be near zero with CI
  • • Plus all IPS checks (ESS, tails)

→ Checked by: Orthogonality diagnostic

Cross-fitting / product-rate

In plain English: Use fold-honest training so nuisance learning (fitting the critic) doesn't bias inference. Train on one split, evaluate on another.

Why it matters: Overfitting the critic to the same data you evaluate on creates bias that breaks DR's guarantees.

→ Enforced by: Workflow (use cross-fitting in your implementation)

What certifies DR validity

Orthogonality CI covering zero (after SIMCal) + routine reliability/coverage on AutoCal-R

If not: improve critic and/or weight quality; re-run orthogonality test.

→ See full DR guide

Summary: Assumptions → Diagnostics → Fixes

Every assumption has a corresponding diagnostic you can check, and every diagnostic has actionable fixes. The workflow is:

  1. 1. Understand which assumptions apply to your chosen mode (DM/IPS/DR)
  2. 2. Run the corresponding diagnostics after estimation
  3. 3. If alerts fire, apply the fixes from the diagnostics table
  4. 4. Re-run estimation and verify diagnostics pass
  5. 5. Report estimates with confidence that they're causally interpretable