CJE Standards
Reference implementation standards for reproducible off-policy evaluation.
Fold Construction
We use K=5 folds by default with deterministic assignment for reproducibility:
F(x_id) = hash(x_id) mod 5
All modules (AutoCal-R, SIMCal-W, DR nuisances) share identical OOF boundaries. Oracle folds are derived by intersecting F(i) with L_i=1.
Teacher Forcing Contract
Requirements
- • Single-call, chat-native TF API
- • Returns per-token and summed log p(A|X)
- • Fixed template, tokenizer, and snapshot
- • Deterministic (bit-identical for same inputs)
Additivity Check
Verify TF consistency with this test:
# TF must satisfy additivity invariant
lpX, lpXA = TF_logp(X), TF_logp(X+A)
assert abs(lpXA - (lpX + lp.sum)) < 1e-7
assert lp.sum <= 0
Weight Normalization
Sample-mean-one (SNIPS) normalization:
W_π' = exp{log W - logsumexp(log W) + log n}
Single global denominator enforces ∑W_i = n exactly.
SIMCal-W Projection
Mean-one isotonic projection onto S-monotone cone:
IsoMeanOneS(w) = argmin ∑(u_i - w_i)²
s.t. u ∈ M↑(S), (1/n)∑u_i = 1
Guarantee: ESS(Ŵ) ≥ ESS(W) deterministically via majorization.
Diagnostic Thresholds
Gate | Metric | Threshold | Action if Failed |
---|---|---|---|
OVERLAP | ESS/n | ≥ 0.30 | Use overlap weights |
Hill α | ≥ 2 | Cohort restriction | |
JUDGE | Coverage | ≥ 95% | Extend oracle slice |
IDENTIFICATION | OutOfRange | ≤ 5% | REFUSE-LEVEL |
Oracle Uncertainty (OUA)
Delete-one-oracle-fold jackknife:
Var_total = Var_main + Var_oracle
where:
Var_oracle = (K-1)/K ∑(V̂^(-k) - V̄)²
Propagates calibration uncertainty for honest confidence intervals.