CIMO LabsCIMO Labs
← Back to conceptual overview

AI Quality as Surrogacy for Idealized Deliberation: Technical Appendix

Eddie Landesberg25 min read

Formal framework with precise definitions, identification results, influence functions, and asymptotic theory.

Abstract. Formal framework with precise definitions, identification results, influence functions, and asymptotic theory for treating AI quality measurement as a surrogate endpoint problem. We establish three regimes of surrogacy (no surrogacy, local, and global transport), provide estimators for Direct, IPS, and DR modes with oracle-uncertainty-aware inference, and give testable diagnostics for transportability.

Canonical Definitions

For canonical definitions of Y vs Y*, assumptions (A0, J1, S1-S3, L1-L2), and core concepts, see the CIMO Glossary.

Prerequisites: This appendix assumes familiarity with semiparametric efficiency theory, influence functions, and causal inference. For the conceptual introduction, see the main post.

0. Notation and spaces

  • Context space X\mathcal{X}, action space A\mathcal{A} (text, code, plans), score space SRd\mathcal{S} \subset \mathbb{R}^d.
  • A policy π\pi maps xXx \in \mathcal{X} to a distribution π(x)\pi(\cdot \mid x) on A\mathcal{A}. Let Π\Pi be a class of admissible policies.
  • XPXX \sim P_X denotes the population distribution of contexts; we treat single-turn first and extend to trajectories in §10.
  • An Operational Oracle Y:X×A[0,1]Y: \mathcal{X} \times \mathcal{A} \to [0,1] is the measurable, expensive evaluation label we can collect (e.g., human preference, GPT-5 judgment, expert audit).
  • An Idealized Deliberation Oracle (IDO) is a functional Y:X×A[0,1]Y^*: \mathcal{X} \times \mathcal{A} \to [0,1] representing the normalized evaluation under idealized deliberation. See the utility semantics box below for a precise definition.
  • A judge (or surrogate measurement process) JJ maps (x,a)(x,a) to a random score SSS \in \mathcal{S}. We allow a ladder of rungs
S(0),S(1),,S(K)(increasing effort 0<1<<K)S^{(0)}, S^{(1)}, \ldots, S^{(K)} \qquad (\text{increasing effort } 0 < 1 < \cdots < K)

induced by a filtration F0FK\mathcal{F}_0 \subset \cdots \subset \mathcal{F}_K with S(k)S^{(k)} measurable w.r.t. Fk\mathcal{F}_k.

Target quality

For any πΠ\pi \in \Pi,

V(π):=E[Y(X,Aπ(X))],Aπ(X)π(X)V(\pi) := \mathbb{E}[Y^*(X, A_\pi(X))], \qquad A_\pi(X) \sim \pi(\cdot \mid X)

IDO semantics (utility view)

Fix an outcome space Ω\Omega, a kernel P(dωx,a)P(d\omega \mid x,a), a utility U:ΩRmU:\Omega\to\mathbb{R}^m, an optional social aggregator W:RmRW:\mathbb{R}^m\to\mathbb{R}, a risk/aggregation functional F:P(R)RF:\mathcal{P}(\mathbb{R})\to\mathbb{R} (e.g. mean, CVaR), and a strictly increasing normalization N:R[0,1]N:\mathbb{R}\to[0,1].

Y(x,a)  =  N ⁣(F(Law[W(U(ω))X=x,A=a]))Y^*(x,a) \;=\; N\!\Big(F\big(\mathsf{Law}[\,W(U(\omega))\,\mid X{=}x,A{=}a]\big)\Big)

Defaults: if single-stakeholder (m=1m=1), WW is identity; otherwise pick WW (e.g., weighted sum, max–min). FF expectation, NN reference-policy anchoring: N(u)=(uF(PUπlow))/(F(PUπhigh)F(PUπlow))N(u) = (u - F(P_{U\mid \pi_{\text{low}}})) / (F(P_{U\mid \pi_{\text{high}}}) - F(P_{U\mid \pi_{\text{low}}})). Record (U,F,N,W)(U,F,N,W) in the assumptions ledger.

1. Axioms for the IDO (normative)

Let Y(x,a)Y^*(x,a) be the limiting value of a deliberation procedure.

  • A1 (Deliberative stability). There exists a sequence of increasing-effort labels Y(k)(x,a)Y^{(k)}(x,a) s.t. Y(k)(x,a)Y(x,a)Y^{(k)}(x,a) \to Y^*(x,a) in L2L^2 as kk \to \infty.
  • A2 (Evidence monotonicity). If FkFk\mathcal{F}_k \subseteq \mathcal{F}_{k'}, then
    E[(YE[YFk])2]E[(YE[YFk])2]\mathbb{E}[(Y^* - \mathbb{E}[Y^* \mid \mathcal{F}_{k'}])^2] \le \mathbb{E}[(Y^* - \mathbb{E}[Y^* \mid \mathcal{F}_k])^2]
  • A3 (Instrumental invariance). If two procedures yield the same world-state relevant to the objective, they have equal YY^*.

A1–A3 make YY^* a well-defined limit of a "deliberation ladder."

1.1. The Bridge Assumption: Connecting Y to Y*

In practice, we cannot directly measure the idealized oracle YY^*. Instead, we collect operational oracle labels YY—expensive but measurable evaluations such as human preferences, expert audits, or high-quality model judgments (e.g., GPT-5).

The Bridge Assumption (A0) formalizes the alignment between the operational oracle and the idealized target:

A0 (Bridge Assumption)

E[YX,A]=E[YX,A]\mathbb{E}[Y^* \mid X, A] = \mathbb{E}[Y \mid X, A]

The operational oracle YY is sufficiently aligned with the idealized deliberation oracle YY^* such that optimizing for YY approximates optimizing for YY^*.

Validation: A0 is validated via the Bridge Validation Protocol (BVP), which consists of three pillars:

  • Pillar 1: Predictive Transportability Experiment (PTE) — Empirical test that YY predicts held-out outcomes on YY^*-relevant metrics (e.g., user satisfaction, task success).
  • Pillar 2: Construct Validity Audits — Expert review and stakeholder feedback confirming that YY captures the intended welfare construct.
  • Pillar 3: Stability Monitoring — Continuous tracking of the Y→Y* relationship to detect drift (see CLOVER governance framework in §6).

Implication for Calibration: The statistical calibration machinery (Assumptions S1-S2 below, Sections 3-5) operates on the measurable label Y. The Bridge Assumption (A0) ensures that optimizing policy value V(π)=E[Y(Aπ)]V(\pi) = \mathbb{E}[Y(A_\pi)] serves the idealized target YY^*. This separation keeps the statistical framework (Layers 2-6) operating on observables while making the alignment to YY^* a governance question (Layer 0).

Note. If (U,F,N,W)(U,F,N,W) change across environments, selection enters YY^* and the calibration fkf_k will not transport (§3.5 table, row "SYS\to Y^*"). Record (U,F,N,W)(U,F,N,W) in the assumptions ledger (§12).

2. Surrogacy (structural) and transport (stability) assumptions

  • S1 (Oracle-surrogate sufficiency at rung kk). There exists a measurable fk:S×X[0,1]f_k: \mathcal{S} \times \mathcal{X} \to [0,1] s.t.
    E[YX,A,S(k)]=fk(S(k),X)a.s.\mathbb{E}[Y \mid X, A, S^{(k)}] = f_k(S^{(k)}, X) \qquad \text{a.s.}
    on supp(π0Πeval)\mathrm{supp}(\pi_0 \cup \Pi_{\mathrm{eval}}) (the joint support of logging and evaluated policies). Optionally add monotonicity in a one-dimensional risk index T=gk(S(k),X)T = g_k(S^{(k)}, X).

    Scope: S1 is required only on the overlap region, not for arbitrary actions outside supp(π0Πeval)\mathrm{supp}(\pi_0 \cup \Pi_{\mathrm{eval}}). This is the same support condition needed for standard overlap (S3).

    Note: S1 targets the operational oracle YY, not the idealized YY^*. Under the Bridge Assumption (A0), calibrating to YY serves YY^*.

  • S2 (Transportability across policies/time). For a collection G\mathcal{G} of environments (policies, cohorts, time), the same fkf_k works: for all gGg \in \mathcal{G},
    Eg[YX,A,S(k)]=fk(S(k),X)\mathbb{E}_g[Y \mid X, A, S^{(k)}] = f_k(S^{(k)}, X)
    on supp(π0Πeval)\mathrm{supp}(\pi_0 \cup \Pi_{\mathrm{eval}}) in each environment.Graphical test (Pearl & Bareinboim, 2014[7]): In a selection diagram modeling environment differences via selection nodes, S2 holds if S(k)S^{(k)} is S-admissible: Y ⁣ ⁣ ⁣SelX,A,S(k)Y \perp \!\!\! \perp \mathrm{Sel} \mid X, A, S^{(k)} in the diagram with incoming arrows to AA removed, where Sel\mathrm{Sel} represents selection nodes. Intuitively: calibration transports if no selection node points into YY given the surrogate. See §2.5 for the ladder of surrogacy regimes.
  • S3 (Positivity/overlap for off-policy re-use). If estimating V(π)V(\pi) from logs of π0\pi_0, then π(ax)>0π0(ax)>0\pi(a \mid x) > 0 \Rightarrow \pi_0(a \mid x) > 0 a.s.
  • S4 (Judge availability). For any π\pi used in Direct mode, we can obtain S(k)(X,Aπ(X))S^{(k)}(X, A_\pi(X)) at scale; for OPE/DR, we have S(k)(X,Aπ0(X))S^{(k)}(X, A_{\pi_0}(X)) in logs.
  • L1 (Oracle MAR). Let L{0,1}L \in \{0,1\} indicate whether an example received an oracle label YY. Then LY(X,A,S(k))L \perp Y \mid (X, A, S^{(k)}). Oracle labeling is ignorable conditional on observed surrogates and covariates.
  • L2 (Oracle positivity). P(L=1X,A,S(k))>0P(L=1 \mid X, A, S^{(k)}) > 0 on the support where fkf_k will be applied. Ensures calibration function is identifiable and transportable.

2.5. A Ladder of Surrogacy Regimes

We organize identification and estimation strategies into three regimes, from weakest to strongest. Throughout, all assumptions are required only on the relevant support supp(π0Πeval)\mathrm{supp}(\pi_0 \cup \Pi_{\mathrm{eval}}) (the states/actions seen under logging and candidate policies). This is the same support condition required for standard overlap and does not strengthen any results.

Regime 1: No Surrogacy (K&M-style fallback)

Assumptions. We make no sufficiency assumption about S(k)S^{(k)}. We assume the standard conditions needed to learn from partially labeled YY^*: (i) missingness of YY^* is conditionally random given (X,A,S(k))(X,A,S^{(k)}) and logging data, (ii) overlap for actions taken by π0\pi_0 vs. π\pi.

Identification. Value V(π)=E[Y(A=π(X))]V(\pi) = \mathbb{E}[Y^*(A=\pi(X))] is identified via standard IPW/DR machinery using the available YY^* labels in each evaluation context.

Estimator. Use your DR estimator with YY^* on labeled rows; use S(k)S^{(k)} only as features for the outcome/propensity models (efficiency only).

Label burden. Requires YY^* labels whenever you change environment gg or substantially shift (X,A)(X,A).

When to use. Diagnostics indicate surrogacy is unreliable (or the transport diagrams fail), but you still need a valid evaluation. See §4.6 for the K&M drop-in estimator.

Note: If your primary target is the IDO outcome, take YYY \equiv Y^* and use standard IPW/DR with labeled YY^* in each context; when comparing two policies as treatments (T{0,1}T \in \{0,1\}), the K&M estimator in §4.6 applies directly. For multi-policy evaluations, K&M can be applied pairwise by encoding each comparison as a binary TT, but the original theory is for a 2-arm ATE.

Literature: This corresponds to the Kallus–Mao (2020) setting: surrogates aid efficiency but do not replace YY, so each evaluation context requires YY labels. K&M is framed around binary treatment ATE (T{0,1}T \in \{0,1\}). See §4.6 for the estimator.

Regime 2: Local Surrogacy (single-environment amortization)

Assumption (Local S1). In a fixed environment gg^\star,

E[YX,A,S(k),g]=fk(S(k),X)on supp(π0Πeval)\mathbb{E}[Y\mid X,A,S^{(k)}, g^\star] = f_k(S^{(k)},X) \quad \text{on } \mathrm{supp}(\pi_0 \cup \Pi_{\mathrm{eval}})
No cross-environment invariance is assumed.

Identification. Once fkf_k is calibrated using YY labels in gg^\star,

V(π;g)=E ⁣[R(k)(X,S(k))A=π(X),g]V(\pi; g^\star)=\mathbb{E}\!\left[\,R^{(k)}(X,S^{(k)}) \,\big|\, A=\pi(X), g^\star \right]
Thus you may calibrate once and evaluate many policies that all run in gg^\star, using only judge scores S(k)S^{(k)} at evaluation time.

Estimator. Replace YY by R(k)R^{(k)} in the value estimator; use standard OPE (e.g., DR) within gg^\star.

Label burden. Labels needed once per environment you care about (re-calibrate if you move to ggg \neq g^\star).

Diagnostics. (i) Held-out calibration of R(k)R^{(k)} vs. YY inside gg^\star, (ii) sensitivity to action mix and covariate shift within gg^\star.

When to use. You do not need cross-environment transport (single deployment context), or your invariance tests are inconclusive.

Literature: Closest in spirit to Athey–Chetty–Imbens–Kang when the target is a binary-treatment ATE and surrogacy is assumed only within an environment (Prentice-style).

Regime 3: Global Surrogacy with Transport (flagship CJE, "Causal Judge Evaluation")

Assumptions (S1 + S2).

  • S1 (Surrogacy sufficiency): E[YX,A,S(k),g]=fk(S(k),X)\mathbb{E}[Y^* \mid X,A,S^{(k)}, g] = f_k(S^{(k)},X) for all gg in the set of admissible environments and on supp(π0Πeval)\mathrm{supp}(\pi_0 \cup \Pi_{\mathrm{eval}}).
  • S2 (Invariance/transport): The same fkf_k is valid across those environments; i.e., R(k)R^{(k)} transports under the selection diagram conditions (S-admissibility).

Identification. Calibrate fkf_k once (in any admissible environment with labels), then for any admissible gg:

V(π;g)=E ⁣[R(k)(X,S(k))A=π(X),g]V(\pi; g)=\mathbb{E}\!\left[\,R^{(k)}(X,S^{(k)}) \,\big|\, A=\pi(X), g \right]

Estimator. Your CJE estimator as written: calibrate once, evaluate across policies and environments using judge scores only, subject to the transport diagnostics.

Label burden. One-time (per surrogate family kk) provided diagnostics keep passing as you move across gg.

Diagnostics. Your existing selection diagrams + invariance tests on R(k)R^{(k)} across gg; backstop to Regime 2 or 1 if they fail. See §3.5 and §6.

Literature: Flagship CJE: S1+S2 yield "calibrate once, evaluate many" across admissible environments.

RegimeSurrogacy assumptionRe-labels needed?Where you can evaluate without new Y*Typical useLiterature
1. No surrogacyNoneYes, every environmentNowhere beyond the labeled contextStrict validity when surrogacy failsKallus–Mao (2020)
2. LocalS1 in one gOnce per environmentAny policy within gSingle deployment contextAthey et al. (2019) for binary ATE
3. Global (CJE)S1 + S2 across admissible gOnce totalPolicies and environments in admissible set"Calibrate once, evaluate many"CJE (this work)

Practical decision rule

  1. Try Regime 3. Run transport/invariance diagnostics for R(k)R^{(k)}. If they pass → use CJE as the mainline.
  2. If transport is shaky: Drop to Regime 2; re-calibrate fkf_k in the target environment, then evaluate policies there using S(k)S^{(k)}.
  3. If even local surrogacy is weak: Use Regime 1 (DR with YY^* labels) until you can improve the surrogate set S(k)S^{(k)} or collect more labels.

Where Athey–Chetty–Imbens–Kang (2019) fits

ACIK estimate binary-treatment ATEs using short-run surrogates under Prentice surrogacy WY(S,X)W \perp Y \mid (S,X). This is weaker than our S1 and targets a different estimand: effects of a binary treatment on YY, not policy value over rich action spaces. They also allow and bound surrogacy violations. Use ACIK-style methods when your target is a binary ATE and you can collect YY in each evaluation context; otherwise prefer Regime 2 or 3.

See [8] for details.

2.6. The Causal Requirement: Mediation vs. Correlation under Optimization

The surrogacy regimes (1-3) address evaluation—estimating V(π)V(\pi) for a fixed policy π\pi. They rely on the Prentice criteria (Prentice, 1989), which define surrogacy based on statistical sufficiency: YπSY^* \perp \pi \mid S. This ensures the surrogate SS is predictive of the outcome YY^*, enabling unbiased estimation.

However, Prentice sufficiency is insufficient for optimization (e.g., RLHF, Best-of-N sampling), where the surrogate SS becomes the target and the policy π\pi is actively modified to maximize it.

Use CaseGoalSurrogate RoleRequirement
Evaluation
(Regimes 1-3)
Estimate V(π)V(\pi) for fixed π\piS predicts Y* (passive measurement)Prentice Sufficiency
Correlation / Predictive validity
Optimization
(RLHF, BoN)
Improve π\pi to maximize Y* via SS guides optimization (active control)Causal Mediation
Optimization flows through welfare

2.6.1. Regime 4: Optimization

We now formally introduce Regime 4: Optimization, where the surrogate is no longer a passive measurement instrument but an active control signal for policy improvement.

Definition (Optimization Regime)

Given a surrogate SS, a policy family {πθ}θΘ\{\pi_\theta\}_{\theta \in \Theta}, and a welfare outcome YY^*, the optimization problem is:

maxθΘE[Yπθ]subject toθargmaxθE[Sπθ]\max_{\theta \in \Theta} \mathbb{E}[Y^*_{\pi_\theta}] \quad \text{subject to} \quad \theta \in \arg\max_{\theta'} \mathbb{E}[S_{\pi_{\theta'}}]

That is, we seek to maximize YY^* by optimizing πθ\pi_\theta against the surrogate SS.

Requirement: Causal Mediation. For this optimization to be safe (i.e., for increases in SS to reliably correspond to increases in YY^*), the surrogate must satisfy Causal Mediation (Frangakis & Rubin, 2002). Formally, this requires that the causal effect of π\pi on YY^* flows through SS:

π → S → Y*

This is a stronger condition than Prentice sufficiency (YπSY^* \perp \pi \mid S). Causal mediation requires blocking side channels—alternative causal paths from π\pi to SS that do not pass through YY^* (e.g., πLengthS\pi \to \text{Length} \to S, πSycophancyS\pi \to \text{Sycophancy} \to S).

Failure Mode: Dissociative Effects. When causal mediation is violated, optimization exploits Dissociative Effects (F&R terminology)—changes to SS that are not mediated by YY^*. This is precisely the mechanism underlying the Surrogate Paradox and reward hacking.

The Surrogate Paradox

When optimization pressure is applied, models exploit any correlation that increases the surrogate, even if it harms the outcome. This is the Surrogate Paradox (illustrated by the CAST study in medicine: anti-arrhythmic drugs suppressed irregular heartbeats but increased mortality). In AI, this manifests as reward hacking—verbosity bias, sycophancy, or confident hallucination (Gao et al., 2022). Gao et al. further demonstrate that this is a scaling phenomenon: the divergence between SS and YY^* follows a predictable parabolic curve as optimization pressure increases.

Metrics for Optimization Robustness. To quantify the safety of optimization in Regime 4, CIMO will introduce two new metrics (formalized in an upcoming technical post):

  • Goodhart Point (GHP): The level of optimization pressure (e.g., KL divergence, Best-of-N sample count) at which the gold reward YY^* peaks and begins to crash. A higher GHP indicates greater optimization robustness.
  • Optimization Gap (OG): The divergence E[Soptimized]E[Yoptimized]\mathbb{E}[S_{\text{optimized}}] - \mathbb{E}[Y^*_{\text{optimized}}] under optimization pressure. A smaller gap indicates the surrogate remains aligned with welfare even when actively optimized.

These metrics extend CJE's static validation framework to dynamic stress testing, enabling practitioners to measure whether a judge remains valid when used as an optimization target.

Topology Enforcement in Practice. In the CIMO stack, Y*-Alignment and the Standard Deliberation Protocol (SDP) are the mechanisms that enforce this causal topology. By requiring the judge (SS) to evaluate the process of welfare generation (via SDP), we block the "side channels" (e.g., length, tone) that allow the model to increase SS without increasing YY^*.

Summary

CJE uses Prentice sufficiency for estimation (Regimes 1-3). Regime 4: Optimization requires Causal Mediation, which the broader CIMO stack strengthens through Y*-Alignment and SDP. Robustness is quantified by the Goodhart Point (GHP) and Optimization Gap (OG).

For a detailed explanation of how SDP strengthens mediation through side-channel cost elevation, see The Surrogate Paradox.

3. Identification

Let R(k)=fk(S(k),X)R^{(k)} = f_k(S^{(k)}, X) be the calibrated reward on the IDO scale.

Proposition 1 (Direct identification)

Under S1 (and S2 + L1–L2 if fkf_k learned out-of-domain),

V(π)=E[Rπ(k)],Rπ(k):=fk(S(k)(X,Aπ(X)),X)V(\pi) = \mathbb{E}[R^{(k)}_\pi], \qquad R^{(k)}_\pi := f_k(S^{(k)}(X, A_\pi(X)), X)

Proof sketch. E[YX,A,S(k)]=R(k)E[YX,Aπ(X)]=E[Rπ(k)X]\mathbb{E}[Y^* \mid X, A, S^{(k)}] = R^{(k)} \Rightarrow \mathbb{E}[Y^* \mid X, A_\pi(X)] = \mathbb{E}[R^{(k)}_\pi \mid X]. Take expectations over XX. See §2.5 for local vs. global surrogacy regimes.

Proposition 2 (IPS identification)

Under S1, S3 (and S2 + L1–L2 if fkf_k learned out-of-domain), from logs (X,A,S(k))π0(X, A, S^{(k)}) \sim \pi_0,

V(π)=E[wπ(X,A)R(k)],wπ(X,A):=π(AX)π0(AX)V(\pi) = \mathbb{E}[w_\pi(X, A) R^{(k)}], \qquad w_\pi(X, A) := \frac{\pi(A \mid X)}{\pi_0(A \mid X)}

Proposition 3 (DR identification)

Under S1, S3 (and S2 + L1–L2 if fkf_k learned out-of-domain). Let Qη(x,a):=E[R(k)X=x,A=a]Q_\eta(x,a) := \mathbb{E}[R^{(k)} \mid X=x, A=a] be any outcome model ("critic"). Then

V(π)=E[wπ(X,A)(R(k)Qη(X,A))+Qηπ(X)]V(\pi) = \mathbb{E}[w_\pi(X,A)(R^{(k)} - Q_\eta(X,A)) + Q_\eta^\pi(X)]

where Qηπ(X):=Eaπ(X)[Qη(X,a)]Q_\eta^\pi(X) := \mathbb{E}_{a \sim \pi(\cdot \mid X)}[Q_\eta(X,a)].

This holds even if either wπw_\pi or QηQ_\eta is misspecified (doubly robust).

3.5. Transport formulas (cross-environment evaluation)

When evaluating π\pi in a target environment that differs from the calibration source, Pearl & Bareinboim's transport framework [7] tells us exactly which target quantities to measure. Below are the three common deployment scenarios:

Case A: Covariate shift only (selection into X)

Scenario: Prompt distribution changes (new user population, different time period), but judge mechanism P(S(k)X,A)P(S^{(k)} \mid X,A) and oracle meaning are invariant.

Transport formula:

V(π)=EXP[Eaπ(X)[Q(X,a)]],Q(X,a):=EP[fk(S(k),X)X,a]V_*(\pi) = \mathbb{E}_{X \sim P_*} \left[ \mathbb{E}_{a \sim \pi(\cdot \mid X)} \left[ Q(X,a) \right] \right], \quad Q(X,a) := \mathbb{E}_P[f_k(S^{(k)}, X) \mid X, a]

What you need in target: P(X)P_*(X) (ability to draw prompts from target population). Can keep fkf_k and Q(X,a)Q(X,a) trained on source data.

Case B: Judge/measurement shift (selection into S(k)S^{(k)})

Scenario: Judge model changes (GPT-4.1-nano → GPT-4.5-nano), instrumentation updates, or deliberation depth increases, but prompt distribution and oracle meaning are invariant.

Transport formula:

V(π)=EXP[Eaπ(X)[ES(k)P(X,a)[fk(S(k),X)]]]V_*(\pi) = \mathbb{E}_{X \sim P} \left[ \mathbb{E}_{a \sim \pi(\cdot \mid X)} \left[ \mathbb{E}_{S^{(k)} \sim P_*(\cdot \mid X,a)} \big[ f_k(S^{(k)}, X) \big] \right] \right]

What you need in target: P(S(k)X,a)P_*(S^{(k)} \mid X,a) (new judge channel). Can keep fkf_k if S-admissibility holds (no selection into YY^*). If prompts also shift, replace PP by PP_* in the outer expectation (i.e., use Case C).

Case C: Covariate + judge shift (selection into X and S(k)S^{(k)})

Scenario: Both prompt distribution and judge mechanism change (e.g., deploying to new geography with different user base and updated judge model).

Transport formula:

V(π)=EXP[Eaπ(X)[ES(k)P(X,a)[fk(S(k),X)]]]V_*(\pi) = \mathbb{E}_{X \sim P_*} \left[ \mathbb{E}_{a \sim \pi(\cdot \mid X)} \left[ \mathbb{E}_{S^{(k)} \sim P_*(\cdot \mid X,a)} \big[ f_k(S^{(k)}, X) \big] \right] \right]

What you need in target: Both P(X)P_*(X) and P(S(k)X,a)P_*(S^{(k)} \mid X,a).

When transport fails: Selection into Y*

If selection points into YY^* (oracle meaning changed—e.g., safety standards shifted, evaluation criteria evolved), S-admissibility is violated and fkf_k does not transport. You must recalibrate fkf_k with new oracle labels in the target environment, or adopt the Kallus & Mao estimator (§4.6) that targets YY directly per context without assuming transport.

Selection node locationf_k transports?Required target measurementsSource pieces you keep
SXS \to X onlyP(X)P_*(X)fk,Q(X,a)f_k, Q(X,a)
SS(k)S \to S^{(k)} onlyP(S(k)X,a)P_*(S^{(k)} \mid X,a)fkf_k
SX,S(k)S \to X, S^{(k)}P(X),P(S(k)X,a)P_*(X), P_*(S^{(k)} \mid X,a)fkf_k
SYS \to Y^*New oracle labels to recalibrate

4. Estimators

Let Ioracle{1,,n}\mathcal{I}_{\text{oracle}} \subset \{1, \ldots, n\} index examples with expensive IDO labels YY^* (at a top rung one can afford); others have only S(k)S^{(k)}.

4.1. Calibrator

Estimate fkf_k on Ioracle\mathcal{I}_{\text{oracle}} by:

  • Monotone (isotonic): fk(s,x)fk(s)f_k(s,x) \equiv f_k(s) nondecreasing in ss and mean-preserving on the oracle slice.
  • Two-stage: Fit T=gk(S(k),X)T = g_k(S^{(k)}, X) (e.g., spline in (S(k),length)(S^{(k)}, \text{length})), then isotonic TYT \mapsto Y^* with mean preservation.

Note on mean preservation: Mean preservation holds on the calibration slice; after transport to new domains/policies, the mean can differ unless S2 (transport) and L1–L2 (oracle MAR/positivity) hold. Use the transport test (§6) to validate.

Use K-fold cross-fitting: train fkf_k on folds j\neq j, predict on fold jj, to obtain out-of-fold R^(k)\widehat{R}^{(k)}.

4.2. Direct (fresh draws)

With mm prompts scored under π\pi,

V^dir(π)=1mi=1mR^π,i(k)\widehat{V}_{\text{dir}}(\pi) = \frac{1}{m} \sum_{i=1}^m \widehat{R}^{(k)}_{\pi,i}

4.3. IPS (logs only)

V^IPS(π)=i=1nwπ,iR^i(k)i=1nwπ,i(self-normalized)\widehat{V}_{\text{IPS}}(\pi) = \frac{\sum_{i=1}^n w_{\pi,i} \widehat{R}^{(k)}_i}{\sum_{i=1}^n w_{\pi,i}} \quad \text{(self-normalized)}

4.4. DR (logs + critic ± fresh draws)

Fit Qη(x,a)E[R^(k)x,a]Q_\eta(x,a) \approx \mathbb{E}[\widehat{R}^{(k)} \mid x,a] via cross-fitting. If fresh draws from π\pi are available, approximate Qηπ(x)Q_\eta^\pi(x) by Monte Carlo. Then

V^DR(π)=1ni=1n[wπ,i(R^i(k)Qη(Xi,Ai))+Qηπ(Xi)]\widehat{V}_{\text{DR}}(\pi) = \frac{1}{n} \sum_{i=1}^n \left[ w_{\pi,i}(\widehat{R}^{(k)}_i - Q_\eta(X_i, A_i)) + Q_\eta^\pi(X_i) \right]

4.5. Weight stabilization (optional, off-policy)

Project raw weights to a mean-one, score-indexed monotone cone (SIM-style calibration) to boost ESS. This is a bias–variance tradeoff: stabilized weights w~π,i\tilde{w}_{\pi,i} can introduce small bias unless they converge to the true importance ratio. Use weight stabilization inside DR estimators (where outcome models guard against modest weight misspecification), and report diagnostics (ESS, tails).

4.6. Regime 1: Kallus–Mao estimator (no S1)

If diagnostics suggest surrogacy is unreliable, estimate effects on YY directly using a doubly-robust Kallus–Mao estimator that treats SS as auxiliary information (no sufficiency assumed). You'll need a MAR-sampled set of YY labels in the evaluation context; cross-fit nuisances (e,r,μ~,μ)(e, r, \tilde{\mu}, \mu); then compute:

δ^=1ni=1n[μ(1,Xi)μ(0,Xi)+Tie(Xi)e(Xi)(1e(Xi))(μ~(Ti,Xi,Si)μ(Ti,Xi))+TiRie(Xi)r(1,Xi,Si)(Yiμ~(1,Xi,Si))(1Ti)Ri(1e(Xi))r(0,Xi,Si)(Yiμ~(0,Xi,Si))] \widehat{\delta} = \frac{1}{n}\sum_{i=1}^n \Bigl[ \mu(1,X_i) - \mu(0,X_i) + \frac{T_i - e(X_i)}{e(X_i)(1-e(X_i))}\bigl(\tilde{\mu}(T_i,X_i,S_i) - \mu(T_i,X_i)\bigr) + \frac{T_i R_i}{e(X_i)r(1,X_i,S_i)}\bigl(Y_i - \tilde{\mu}(1,X_i,S_i)\bigr) - \frac{(1-T_i)R_i}{(1-e(X_i))r(0,X_i,S_i)}\bigl(Y_i - \tilde{\mu}(0,X_i,S_i)\bigr) \Bigr]

Report IF-based SEs with cross-fitting. [1]

5. Influence functions and inference

Assume pathwise differentiability and regularity (bounded moments, entropy conditions satisfied via cross-fitting).

5.1. Efficient influence function (EIF) for V(π)

Under S1 and known fkf_k,

ϕπ(Z)=Rπ(k)V(π),Z=(X,S(k),A if needed)\phi_\pi(Z) = R^{(k)}_\pi - V(\pi), \qquad Z = (X, S^{(k)}, A \text{ if needed})

With DR structure and nuisances η=(Qη,wπ)\eta = (Q_\eta, w_\pi),

ϕπ(Z)=wπ(X,A)(R(k)Qη(X,A))+Qηπ(X)V(π)\phi_\pi(Z) = w_\pi(X,A)(R^{(k)} - Q_\eta(X,A)) + Q_\eta^\pi(X) - V(\pi)

which is Neyman-orthogonal to first-order perturbations of (Qη,wπ)(Q_\eta, w_\pi) holding fkf_k fixed. Uncertainty from learning fkf_k on the oracle slice is added separately via OUA (§5.3). If desired, one can treat fkf_k as a nuisance and cross-fit it jointly to achieve formal orthogonality. We separate it and account for uncertainty via OUA for transparency and modularity.

5.2. Asymptotics and SEs

With K-fold cross-fitting,

n(V^(π)V(π))N(0,V[ϕπ(Z)])\sqrt{n}(\widehat{V}(\pi) - V(\pi)) \rightsquigarrow \mathcal{N}(0, \mathbb{V}[\phi_\pi(Z)])

Estimate variance with the empirical variance of ϕπ\phi_\pi (cluster-robust if needed).

5.3. Oracle-uncertainty aware (OUA) variance

If fkf_k is learned from a finite oracle slice, add delete-one-fold jackknife over oracle folds:

Var^OUA=K1Kj=1K(V^(j)(π)Vˉ)2,Vˉ=1KjV^(j)\widehat{\text{Var}}_{\text{OUA}} = \frac{K-1}{K} \sum_{j=1}^K (\widehat{V}^{(-j)}(\pi) - \bar{V})^2, \quad \bar{V} = \frac{1}{K} \sum_j \widehat{V}^{(-j)}

Total variance: Var^main+Var^OUA\widehat{\text{Var}}_{\text{main}} + \widehat{\text{Var}}_{\text{OUA}}. Use Satterthwaite df for small-sample t-intervals if desired.

5.4. Relationship to Conformal Prediction

Conformal Prediction (CP) (Vovk et al., 2005; Angelopoulos & Bates, 2021) provides distribution-free, finite-sample coverage guarantees for prediction intervals (uncertainty about a future observation YnewY^*_{\text{new}}). OUA addresses a different problem:inference on a population parameter (the policy value V(π)V(\pi)).

CP guarantees coverage for YnewY^*_{\text{new}} assuming the calibration function ff is fixed.

OUA quantifies the epistemic uncertainty of having learned ff from a finite oracle slice.

CJE requires OUA because we need valid confidence intervals on V(π)V(\pi), which necessitates propagating the uncertainty of the learned calibrator itself, not just the prediction uncertainty of individual outcomes.

When to use each: Use CP when you need coverage for individual predictions (e.g., "What is the range of YY^* for this specific user?"). Use OUA when you need inference on aggregate quantities (e.g., "What is the expected policy value across all users, with honest uncertainty?").

6. Testable diagnostics (falsifiable implications)

  • Transport test (policy/time). Per-group residual mean test:
    H0:E[Yfk(S(k),X)G=g]=0gGH_0: \mathbb{E}[Y^* - f_k(S^{(k)}, X) \mid G=g] = 0 \quad \forall g \in \mathcal{G}
    where GG indexes groups (policies, time periods, domains). Use labeled subset; apply multiple-testing correction (e.g., Bonferroni). This is a weaker, testable implication of S-admissibility—if you lack labels in multiple domains, you can only partially test S2.
  • Coverage of surrogate support. Compare histograms of S(k)S^{(k)} on oracle-labeled vs. full sets; flag extrapolation if tails are unlabeled.
  • Overlap diagnostics (off-policy). Effective sample size ESS=(w)2/w2\text{ESS} = (\sum w)^2 / \sum w^2, weight CV, max/median ratio, Hill tail index.
  • OUA share. Report Var^OUA/(Var^main+Var^OUA)\widehat{\text{Var}}_{\text{OUA}} / (\widehat{\text{Var}}_{\text{main}} + \widehat{\text{Var}}_{\text{OUA}})to guide budget (more labels vs. more prompts).
  • Prentice test (surrogacy sufficiency / S1). On oracle-labeled subsets, regress YY^* on (X,A,S(k))(X, A, S^{(k)}) and test whether adding AA (and A×S(k)A \times S^{(k)}) improves fit. Failing to reject supports S1 (surrogacy sufficiency). For S-admissibility (S2, cross-domain), use a domain indicator GG and test Y ⁣ ⁣ ⁣GX,A,S(k)Y^* \perp \!\!\! \perp G \mid X, A, S^{(k)} on pooled labeled data across domains: does GG (and G×S(k)G \times S^{(k)}) improve prediction? If yes, fkf_k does not transport—recalibrate or use K-M estimator (§4.6).

7. Learning with the IDO objective

For parametric πθ\pi_\theta, the policy learning problem is

maxθΘV(πθ)\max_{\theta \in \Theta} V(\pi_\theta)

A plug-in gradient follows from the policy gradient identity with calibrated rewards:

θV(πθ)=E[Eaπθ(X)[θlogπθ(aX)R(k)(X,a)]]\nabla_\theta V(\pi_\theta) = \mathbb{E}\left[\mathbb{E}_{a \sim \pi_\theta(\cdot \mid X)}[\nabla_\theta \log \pi_\theta(a \mid X) R^{(k)}(X,a)]\right]

optionally replacing R(k)R^{(k)} by an advantage R(k)b(X)R^{(k)} - b(X). This "RL with calibrated reward" aligns training with IDO.

For safe deployment, maximize a lower confidence bound V(πθ)z1αSE(V^(πθ))V(\pi_\theta) - z_{1-\alpha} \cdot \text{SE}(\widehat{V}(\pi_\theta)).

8. Multiple stakeholders and social choice

Let U={1,,m}\mathcal{U} = \{1, \ldots, m\} index stakeholders with oracles Yu(x,a)[0,1]Y^*_u(x,a) \in [0,1]. A social aggregator W:[0,1]m[0,1]W: [0,1]^m \to [0,1] defines

V(π)=E[W(Y1(X,Aπ(X)),,Ym(X,Aπ(X)))]V(\pi) = \mathbb{E}[W(Y^*_1(X, A_\pi(X)), \ldots, Y^*_m(X, A_\pi(X)))]

Common choices: weighted utilitarian (W(y)=uwuyuW(y) = \sum_u w_u y_u), max-min (W(y)=minuyuW(y) = \min_u y_u), or constrained variants. Surrogacy extends with fk,u(S(k),X)E[YuX,A,S(k)]f_{k,u}(S^{(k)}, X) \approx \mathbb{E}[Y^*_u \mid X, A, S^{(k)}]; calibrate each and plug into WW.

9. The deliberation ladder as information order

Model rungs by a filtration F0FKF\mathcal{F}_0 \subset \cdots \subset \mathcal{F}_K \subseteq \mathcal{F}_\infty. Define

Y(k)(x,a):=E[Y(x,a)Fk],S(k)=any statistic measurable w.r.t. FkY^{(k)}(x,a) := \mathbb{E}[Y^*(x,a) \mid \mathcal{F}_k], \qquad S^{(k)} = \text{any statistic measurable w.r.t. } \mathcal{F}_k

Then by Blackwell/Doob ordering, kkk' \ge k implies E[(YY(k))2]E[(YY(k))2]\mathbb{E}[(Y^* - Y^{(k')})^2] \le \mathbb{E}[(Y^* - Y^{(k)})^2]. If S(k)S^{(k)} is Blackwell more informative than S(k1)S^{(k-1)}, a calibrated estimator at rung kk is (weakly) more efficient than at rung k1k-1.

10. Extension to trajectories (agents)

Let a trajectory τ=(s0,a0,,sT)\tau = (s_0, a_0, \ldots, s_T) with policy π\pi and environment PP. Define an IDO trajectory value

Y(τ)[0,1]orY(π;X)=EP,π[t=0Tγtu(st,at)X]Y^*(\tau) \in [0,1] \quad \text{or} \quad Y^*(\pi; X) = \mathbb{E}_{P,\pi}\left[\sum_{t=0}^T \gamma^t u^*(s_t, a_t) \mid X\right]

Surrogates may be terminal (ST(k)S_T^{(k)}) or stepwise (St(k)S_t^{(k)}). Direct/IPS/DR estimators extend with clustering by trajectory; sequential IPS is typically ill-conditioned, so prefer Direct or DR with trajectory-level critics.

11. Limits (scope conditions)

  • Non-regular targets. If YY^* or WW induces non-differentiable functionals (e.g., maxima, boundary problems), first-order theory fails; use selective/subsampling or shape-constrained methods.
  • Severe non-transport. If S2 fails (e.g., adversarial policy styles), drop to Regime 2 or 1 (§2.5): recalibrate fkf_k locally per environment, or use K&M estimation with new oracle labels.
  • Overlap failures. If S3 fails, IPS/DR is unreliable even with stabilized weights; collect fresh draws and use Direct.

12. Minimal "assumptions ledger" (for every deployment)

CodeStatementUsed byTest / DiagnosticMitigation
A0E[YX,A]=E[YX,A]\mathbb{E}[Y^* \mid X,A] = \mathbb{E}[Y \mid X,A] (Bridge Assumption)All LayersBVP (Pillar 1: PTE; Pillar 2: Audits; Pillar 3: Stability)SDP-Gov: SDP Patching and Governance
S1fk:E[YX,A,S(k)]=fk(S(k),X)\exists f_k: \mathbb{E}[Y \mid X,A,S^{(k)}] = f_k(S^{(k)},X)AllIncremental signal; residual vs. fkAdd covariates; richer judge; higher rung
S2Y ⁣ ⁣ ⁣SelX,A,S(k)Y \perp \!\!\! \perp \mathrm{Sel} \mid X,A,S^{(k)} (S-admissibility);
fk transports when no selection nodes (Sel) point into Y
All (cross-environment)Per-group residual test (§6); Cross-domain Prentice test with G indicator; diagram review (§3.5)If selection into X or S(k): measure target distributions (§3.5 table).
If selection into Y: recalibrate with target oracle labels
S3ππ0\pi \ll \pi_0 (overlap)IPS/DRESS, tail index, max/medianWeight stabilization; collect draws
A1–A3IDO well-posedAllRung stability checksClarify oracle definition; adjust W
L1LY(X,A,S(k))L \perp Y^* \mid (X,A,S^{(k)}) (Oracle MAR)All (calibration)Oracle selection independent of residualsRandomize oracle sampling; stratify by S,X
L2P(L=1X,A,S(k))>0P(L=1 \mid X,A,S^{(k)}) > 0 (Oracle positivity)All (calibration)Coverage plots; extrapolation warningsLabel tail regions; flag OOD predictions
OUAFinite oracle labelsInferenceOUA shareAdd labels if OUA dominates
NStrictly increasing normalization to [0,1]; anchored to (πlow, πhigh) (or specified benchmarks)All (comparability & reporting)Anchor stability check across releases; report raw F and anchored Y* when anchors changeRe-anchor or freeze anchors; append change log when re-anchoring

13. What you report (template)

For each π\pi:

  • V^(π)\widehat{V}(\pi) on the IDO scale with 95% CI (main + OUA), and DF rule.
  • Diagnostics: transport test p-values, ESS (if OPE/DR), OUA share, oracle coverage plots.
  • If choosing a policy: a decision with one-sided CI (safety margin).

Summary

  • Definition: V(π)=E[Y(X,Aπ(X))]V(\pi) = \mathbb{E}[Y^*(X, A_\pi(X))]
  • Mechanism: use surrogates S(k)S^{(k)} and a calibration fkf_k so that E[YX,A,S(k)]=fk(S(k),X)\mathbb{E}[Y^* \mid X,A,S^{(k)}] = f_k(S^{(k)},X)
  • Identification: Direct (fresh draws), IPS (reweight logs), DR (two chances)
  • Uncertainty: influence-function variance + oracle-learning variance (OUA)
  • Governance: multi-party WW encodes whose IDO matters and how

This turns "AI should do what you'd do with unlimited time" into a measurable target, with estimators, CIs, and failure tests you can run.

Citation

If you use this work, please cite:

BibTeX

@misc{landesberg2025surrogacy,
  author = {Landesberg, Eddie},
  title = {AI Quality and Surrogacy: Technical Appendix},
  year = {2025},
  month = {November},
  url = {https://cimolabs.com/research/ai-quality-surrogacy-technical},
  note = {CIMO Labs Technical Report}
}

Plain Text

Landesberg, E. (2025). AI Quality and Surrogacy: Technical Appendix. CIMO Labs Technical Report. https://cimolabs.com/research/ai-quality-surrogacy-technical

References

References

[1] Kallus, N., & Mao, X. (2020). On the role of surrogates in the efficient estimation of treatment effects with limited outcome data. arXiv:2003.12408. arXiv — Semiparametric efficiency theory for surrogate-assisted treatment effect estimation under MAR assumptions.
[2] Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1–C68. DOI — Foundational DML framework for valid inference with cross-fitting and Neyman orthogonality.
[3] van der Laan, M. J., & Rose, S. (2011). Targeted Learning: Causal Inference for Observational and Experimental Data. Springer. DOI — TMLE and targeted estimation framework for causal parameters.
[4] Dudík, M., Langford, J., & Li, L. (2011). Doubly Robust Policy Evaluation and Learning. arXiv:1103.4601. arXiv — Doubly robust methods for off-policy evaluation.
[5] Blackwell, D. (1953). Equivalent Comparisons of Experiments. Annals of Mathematical Statistics, 24(2), 265–272. DOI — Foundational work on information ordering and sufficiency.
[6] Prentice, R. L. (1989). Surrogate endpoints in clinical trials: Definition and operational criteria. Statistics in Medicine, 8(4), 431–440. DOI — Original formulation of surrogate endpoint criteria in biostatistics.
[7] Pearl, J., & Bareinboim, E. (2014). External validity: From do-calculus to transportability across populations. Statistical Science, 29(4), 579–595. DOI — Formal causal framework for transportability and external validity.
[8] Athey, S., Chetty, R., Imbens, G. W., & Kang, H. (2019). The Surrogate Index: Combining Short-Term Proxies to Estimate Long-Term Treatment Effects More Rapidly and Precisely. NBER Working Paper 25863. NBER — Prentice surrogacy for binary treatment ATE estimation using short-term surrogate indices.
[9] Frangakis, C. E., & Rubin, D. B. (2002). Principal stratification in causal inference. Biometrics, 58(1), 21-29. DOI — Foundational work on principal stratification and causal mediation analysis.
[10] Gao, L., Schulman, J., & Hilton, J. (2022). Scaling Laws for Reward Model Overoptimization. arXiv preprint arXiv:2210.10760. arXiv — Empirical study showing that reward model overoptimization follows predictable scaling laws (Goodhart's Law).
[11] Vovk, V., Gammerman, A., & Shafer, G. (2005). Algorithmic Learning in a Random World. Springer. DOI — Foundational book on conformal prediction and distribution-free uncertainty quantification.
[12] Angelopoulos, A. N., & Bates, S. (2021). A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification. arXiv preprint arXiv:2107.07511. arXiv — Modern tutorial on conformal prediction with practical applications.