CIMO LabsCIMO Labs

CJE Standards

Reference implementation standards for reproducible off-policy evaluation.

Fold Construction

We use K=5 folds by default with deterministic assignment for reproducibility:

F(x_id) = hash(x_id) mod 5

All modules (AutoCal-R, SIMCal-W, DR nuisances) share identical OOF boundaries. Oracle folds are derived by intersecting F(i) with L_i=1.

Teacher Forcing Contract

Requirements

  • • Single-call, chat-native TF API
  • • Returns per-token and summed log p(A|X)
  • • Fixed template, tokenizer, and snapshot
  • • Deterministic (bit-identical for same inputs)

Additivity Check

Verify TF consistency with this test:

# TF must satisfy additivity invariant
lpX, lpXA = TF_logp(X), TF_logp(X+A)
assert abs(lpXA - (lpX + lp.sum)) < 1e-7
assert lp.sum <= 0

Weight Normalization

Sample-mean-one (SNIPS) normalization:

W_π' = exp{log W - logsumexp(log W) + log n}

Single global denominator enforces ∑W_i = n exactly.

SIMCal-W Projection

Mean-one isotonic projection onto S-monotone cone:

IsoMeanOneS(w) = argmin ∑(u_i - w_i)²
                     s.t. u ∈ M↑(S), (1/n)∑u_i = 1

Guarantee: ESS(Ŵ) ≥ ESS(W) deterministically via majorization.

Diagnostic Thresholds

GateMetricThresholdAction if Failed
OVERLAPESS/n≥ 0.30Use overlap weights
Hill α≥ 2Cohort restriction
JUDGECoverage≥ 95%Extend oracle slice
IDENTIFICATIONOutOfRange≤ 5%REFUSE-LEVEL

Oracle Uncertainty (OUA)

Delete-one-oracle-fold jackknife:

Var_total = Var_main + Var_oracle

where:
  Var_oracle = (K-1)/K ∑(V̂^(-k) - V̄)²

Propagates calibration uncertainty for honest confidence intervals.

Implementation Resources