CIMO Labs — Causal Information Manifold Optimization

Quickstart (7 steps)

Fix your prompt set & candidate policies. Define what you're comparing—baseline vs new model, old vs new prompt, etc.
Collect a small labeled slice. Sample 1-5% of your data randomly, label with ground-truth KPI outcomes (Y).
Fit AutoCal-R and audit reliability. Learn the judge score → KPI mapping, check the reliability plot and coverage.
Pick your mode. Can you generate fresh outputs for all policies on the same prompts? → DM. Otherwise → IPS (or DR if you can train a critic).
If IPS/DR: run SIMCal. Stabilize importance weights using monotone-in-judge-score projection.
Estimate and add OUA. Compute your point estimate and add outcome uncertainty adjustment for honest CIs.
Check diagnostics. Review coverage, reliability, ESS, tails, and orthogonality (DR only). Apply fixes if alerts fire.

✓ Yes → Use Direct Method (DM)

This is the simplest, most reliable path. Pairs each prompt across policies for clean comparison.

✓ Yes → Use Calibrated IPS (with SIMCal)

Reweight logged data using likelihood ratios. Check ESS and tail diagnostics after SIMCal.

✓ Yes → Use Calibrated DR

Hedge IPS with an outcome model for doubly robust estimation. Verify orthogonality test passes.

Example: E-commerce chatbot

You test a new "concise response" prompt vs your GPT-4 baseline:

1. Fit AutoCal-R on your labeled slice: S → R
2. Compute raw importance weights via teacher forcing: W = p_π'(A|X) / p_π₀(A|X)
3. Run SIMCal to stabilize weights (monotone in judge score, mean-preserving)
4. Compute weighted mean: V̂_IPS = (1/n) ∑ W^calᵢ Rᵢ
5. Add OUA to your CI
6. Report: point estimate, 95% CI, OUA share, ESS fraction, max-weight share, tail index, overlap heatmap

1. Fit AutoCal-R on labeled slice: S → R
2. Train outcome models (cross-fit): μ_π'(X) for target policy, q(X,A) for logger
3. Compute raw weights via teacher forcing, then stabilize with SIMCal
4. Compute DR estimate: V̂_DR = (1/n) ∑ [μ_π'(Xᵢ) + W^cal_i(Rᵢ - q(Xᵢ,Aᵢ))]
5. Add OUA to your CI
6. Run orthogonality test: check weighted mean of (R - q̂) with CI
7. Report: point estimate, 95% CI, OUA share, ESS, orthogonality test result

Rules of thumb

When CIs are dominated by OUA: Adding more labels helps most. Double your labeled sample to cut OUA variance by ~√2.
When CIs are dominated by sampling variance: More prompts (or samples) dominate. This is typical for DM once you have ~500-1000 labels.
For IPS/DR: If ESS is low (<30% after SIMCal), prioritize improving overlap (restrict cohort, better policies) before adding more data.

For a binary KPI with base rate p ≈ 0.2 and desired CI width of ±0.03:

These scale roughly as σ²/δ² where σ is outcome variance and δ is your target precision.