Validating the Bridge Assumption (A0): The Bridge Validation Protocol and SDP-Gov

Abstract

The CIMO Framework relies on the Bridge Assumption (A0): that the operational welfare label $Y$ (measured via the Standard Deliberation Protocol, SDP) aligns with the idealized target $Y^*$ . While CIMO rigorously calibrates surrogates $S$ to $Y$ (Layers 2-3), the link $Y \to Y^*$ requires explicit validation. We introduce the Validation Layer ("Layer 0") to address this. It comprises the Bridge Validation Protocol (BVP), a suite of tests for A0, and SDP-Gov, a governance framework for the SDP. The BVP uses the Proportion of Treatment Effect Explained (PTE) as the core metric for empirical alignment against Long-Run Outcomes (LROs). SDP-Gov provides a CI/CD system for the SDP itself. Together, they ensure the operational target remains aligned with true welfare.

Prerequisites: This appendix assumes familiarity with the CIMO Framework architecture, particularly Y*-Aligned Systems (SDP, Y* definition) and AI Quality as Surrogacy (calibration, S-admissibility). For the conceptual introduction, see The CIMO Framework.

0. The A0 Problem and Notation

A0 (Bridge Assumption)

\mathbb{E}[Y^* \mid X, A] = \mathbb{E}[Y \mid X, A]

The operational label $Y$ , produced by the Standard Deliberation Protocol (SDP), is conditionally unbiased for the Idealized Deliberation Oracle outcome $Y^*$ .

The Problem: If A0 fails due to construct drift, protocol gaps, or rater bias, $Y$ diverges from $Y^*$ . The CIMO stack may then rigorously optimize systems toward the wrong objective.

The Irreducibility of A0: A Philosophical Caveat

A0 is the foundational assumption of the CIMO Framework. It cannot be "proven" in an absolute sense. It rests on the philosophical claim that the operational measurement process (Y via SDP) captures what we care about (Y*).

The LRO Validation Problem: BVP tests Y→LRO (do policies that score high on Y produce better long-run outcomes?). But this assumes LRO→Y* (that long-run outcomes are themselves aligned with true welfare). If LROs are poor proxies for Y* (e.g., optimizing for engagement rather than long-term value, or user retention rather than genuine satisfaction), a high PTE provides false confidence.

What BVP provides: Not proof of A0, but evidence and vigilance. BVP establishes that Y predicts improvements in measurable real-world outcomes, provides structured adversarial auditing (Pillar 2), and offers a governance framework to detect drift. This is the best we can do. All evaluation frameworks rest on similar irreducible assumptions about what constitutes "good." CIMO's contribution is making this assumption explicit (A0), providing validation machinery (BVP), and maintaining governance (SDP-Gov) rather than leaving it implicit and unexamined.

Notation & The Measurement Hierarchy

$Y^*$ : Idealized Deliberation Oracle outcome (True Welfare).
The theoretical construct representing welfare under perfect deliberation: complete information, reflective consistency, impartial aggregation. Unobservable in practice.
$Y$ : Operational welfare label (measured via SDP).
The practical measurement produced by the Standard Deliberation Protocol. Approximates $Y^*$ but is measurable at scale. This is what we use for calibration (Layer 2: $S \to Y$ ).
LROs (Long-Run Outcomes): Real-world metrics that approximate $Y^*$ .
Observable metrics measured weeks/months after interaction (e.g., 90-day retention, revenue, task success rate). These are delayed proxies for $Y^*$ , not $Y^*$ itself. We test whether optimizing $Y$ predicts improvements in LROs.
$S$ : Cheap surrogates (e.g., LLM-judge scores).
Fast, scalable signals calibrated to $Y$ (Layer 2).

The validation hierarchy:

Y^* \quad \xleftarrow{\text{A0}} \quad Y \quad \xleftarrow{\text{S1}} \quad S

Layer 0 (A0): $Y \to Y^*$ (Bridge: operational → idealized)
Layer 2 (S1): $S \to Y$ (Surrogacy: cheap signal → operational)

We cannot directly test $Y \to Y^*$ because $Y^*$ is unobservable. Instead, we test $Y \to \text{LROs}$ and assume LROs approximate $Y^*$ (see §3.1.2 for the LRO validation problem).

1. The Validation Layer Framework (Layer 0)

The Validation Layer acts as Layer 0, ensuring the foundation of the CIMO stack (Layer 1: Y* Definition) remains sound. It has two components:

The Bridge Validation Protocol (BVP) for testing A0 via empirical alignment, construct validity, and stability checks.
SDP-Gov for governing the SDP: a CI/CD system that allows safe evolution of the operational welfare definition.

Relationship to other CIMO components

CLOVER (Layer 6) governs judge calibration ( $S \to Y$ ).
SDP-Gov (Layer 0) governs the SDP itself ( $Y \to Y^*$ ).
Y*-Aligned Systems (Layer 4) ensures prompts and judges target the same construct.
Layer 0 validates that the shared construct $Y$ actually aligns with true welfare $Y^*$ .

2. The Bridge Validation Protocol (BVP)

The BVP is a suite of tests executed periodically (e.g., quarterly) or when distributional shifts are detected. It comprises three pillars:

Pillar	Focus	Core Metric	Frequency
1. Empirical Alignment	Does optimizing Y predict LRO gains?	PTE (Proportion of Treatment Effect Explained)	Annual
2. Construct Validity	Does Y capture essential welfare elements?	Expert audits, red team findings	Quarterly
3. Stability & Invariance	Is the SDP well-specified and stable?	Inter-pool reliability, anchor drift	Quarterly

3. Pillar 1: Empirical Alignment (PTE and the A/B Bridge)

This pillar tests the predictive validity of $Y$ : does optimizing $Y$ translate to real-world value (LROs)?

3.1. Estimand: Proportion of Treatment Effect Explained (PTE)

We use the standard surrogacy metric PTE (also known as R-squared or Surrogate Index Validity) to quantify alignment. Let $\pi_A, \pi_B$ be two policies compared in an A/B test.

Let $\Delta_{\text{LRO}} = \mathbb{E}[\text{LRO} \mid \pi_A] - \mathbb{E}[\text{LRO} \mid \pi_B]$ (online effect).
Let $\Delta_Y = \mathbb{E}[Y \mid \pi_A] - \mathbb{E}[Y \mid \pi_B]$ (offline effect, estimated via CJE).

PTE is defined as:

\text{PTE} = 1 - \frac{\mathbb{E}[(\Delta_{\text{LRO}} - \beta \cdot \Delta_Y)^2]}{\mathbb{E}[\Delta_{\text{LRO}}^2]}

where $\beta$ is a scaling factor (calibration slope), and the expectation is taken over the distribution of A/B tests.

Interpretation

PTE ≥ 0.7: Strong alignment. A0 holds. $Y$ is a reliable predictor of $Y^*$ (via LROs).
PTE 0.3 - 0.7: Moderate alignment. $Y$ is useful but misses key components of $Y^*$ . Consider SDP patches (SDP-Gov).
PTE < 0.3: Weak alignment. A0 fails. $Y$ does not track true welfare. Requires SDP redesign or fallback to direct $Y^*$ measurement.

Note: These thresholds are proposed operational defaults, not empirically validated standards. Inspired by surrogate endpoint literature ^[1], but actual thresholds should reflect your domain's cost tradeoffs between false positives and false negatives.

3.1.1. LRO Selection Criteria

LROs must satisfy three criteria to serve as valid proxies for $Y^*$ :

$Y^*$ -relevance: The metric should be causally downstream of the welfare construct $Y^*$ aims to capture. For "developer productivity," use 90-day task completion rates or code reuse, not session clicks or response length.
Delayed observation: Measured weeks or months after interaction. If measured immediately, it's likely a proxy for $Y$ , not $Y^*$ . The delay allows true welfare consequences to manifest.
Robustness to gaming: Hard to manipulate via shallow response features (verbosity, formatting, confidence tone). Should reflect genuine user outcomes.

Practical guidance

Minimum: 3 diverse LROs. If PTE varies widely across LROs (e.g., PTE = 0.8 for retention but 0.2 for revenue), this indicates $Y$ captures only a subset of $Y^*$ → investigate which dimensions are missing.
Diversify: Include metrics from different stakeholder perspectives (user, business, ethical). E.g., user satisfaction, long-term engagement, safety incidents.
Document: Record LRO selection rationale in the A0 Ledger (§7). This makes the validation transparent and auditable.

3.1.2. The LRO Validation Problem

The circularity

LROs are not $Y^*$ itself. They're observable proxies. This creates a validation hierarchy:

$Y^*$ (unobservable ideal)
LROs (delayed, noisy, but "more $Y^*$ -like" than $Y$ )
$Y$ (SDP output, faster but potentially biased)
$S$ (cheap surrogates)

The BVP tests " $Y \to \text{LROs}$ ", assuming $\text{LROs} \approx Y^*$ . This assumption cannot be fully tested, but can be triangulated via:

Multiple diverse LROs: If PTE is high for all LROs, confidence increases that they jointly approximate $Y^*$ .
Expert review: "Do these LROs capture what truly matters for welfare?" Qualitative validation that LRO selection is reasonable.
Sensitivity analysis: Vary LRO definitions (e.g., 60-day vs. 90-day retention), check PTE stability. If PTE is robust, confidence increases.

Document your LRO selection rationale in the A0 Ledger (§7). This makes the implicit assumption " $\text{LROs} \approx Y^*$ " explicit and auditable.

3.2. Estimation Procedure (Meta-Analysis)

Define LROs: Specify key long-run metrics (e.g., 90-day retention, revenue per user, task success rate). See §3.1.1 for selection criteria.
Aggregate A/B Tests: Collect a suite of recent A/B tests (M ≥ 10 recommended). Each test compares two policies $(\pi_A, \pi_B)$ .
Measure $\Delta_{\text{LRO}}$ and $\Delta_Y$ : For each test:
- $\Delta_{\text{LRO}}$ : Online impact (from production A/B test)
- $\Delta_Y$ : Offline CJE estimate (using SDP labels)
Estimate PTE: Use cross-validation over the suite of A/B tests to estimate PTE and its confidence interval. Fit $\beta$ via OLS: $\Delta_{\text{LRO}} \sim \beta \cdot \Delta_Y$ , then compute residual variance.

Critical: The Novelty Mandate

LLMs memorize public benchmarks. If your validation data overlaps with the model's training corpus, the model may be reciting correct answers rather than reasoning to them (Srivastava et al., 2025). This invalidates the bridge validation.

Requirements:

Post-cutoff data: Validation datasets must be collected after the model's training cutoff date. Use private, recent data (e.g., last quarter's user interactions).
Memorization check: Include a diagnostic that tests for verbatim recall of validation examples. If the model can complete partial prompts from memory, the data is contaminated.
Forbidden: Public benchmarks (MMLU, HumanEval, etc.) unless you can verify they were excluded from training.

PTE Estimation Formula

Given M A/B tests with paired $(\Delta_{\text{LRO},i}, \Delta_{Y,i})$ :

\widehat{\beta} = \frac{\sum_{i=1}^M \Delta_{\text{LRO},i} \cdot \Delta_{Y,i}}{\sum_{i=1}^M \Delta_{Y,i}^2}

\widehat{\text{PTE}} = 1 - \frac{\sum_{i=1}^M (\Delta_{\text{LRO},i} - \widehat{\beta} \cdot \Delta_{Y,i})^2}{\sum_{i=1}^M \Delta_{\text{LRO},i}^2}

Use bootstrap or jackknife to estimate 95% CI for PTE. Report $\widehat{\text{PTE}} \in [L, U]$ in the A0 Ledger.

3.3. Secondary Metrics

Directional Consistency (Sign Test): The frequency with which $\text{sign}(\Delta_{\text{LRO}}) = \text{sign}(\Delta_Y)$ . Minimum acceptable: 80%.
Calibration Slope ( $\beta$ ): The scaling factor quantifies "value translation." $\beta \approx 1$ indicates $Y$ and LROs are on similar scales. $\beta \ll 1$ suggests $Y$ overestimates welfare; $\beta \gg 1$ suggests underestimation.

4. Pillar 2: Construct Validity (Audits and Adversarial Testing)

This pillar ensures the SDP captures the essential elements of idealized deliberation through qualitative and adversarial checks.

Test 4.1: Expert Consensus Audit

On a small, high-stakes data slice (n ≈ 50-100 examples), compare $Y$ (SDP output) against "Unbounded Expert Deliberation" (no time/cost constraints). Test for systematic divergence.

Procedure

Sample high-stakes examples (e.g., safety-critical, high-value users, edge cases).
Collect $Y$ labels via standard SDP (time budget T, evidence sources E).
Collect $Y_{\text{expert}}$ labels via unbounded protocol: expert panel, unlimited time, access to all evidence.
Test: $H_0: \mathbb{E}[Y - Y_{\text{expert}}] = 0$ . If rejected at $\alpha = 0.05$ , investigate bias patterns.

Frequency: Annual or when major SDP changes are proposed.

Test 4.2: Red Teaming the SDP

Adversarially search for scenarios where following the SDP yields a high $Y$ score, but expert consensus or known facts indicate low welfare (low $Y^*$ ). This identifies protocol gaps.

Example Red Team Finding

Scenario: Code generation task. Model produces syntactically correct but poorly documented code with hardcoded assumptions.

SDP output: $Y = 0.85$ (code runs, tests pass, follows style guide).

Expert review: Low $Y^*$ (code is unmaintainable, will cause bugs in 6 months).

Protocol gap: SDP rewards "correctness at t=0" but doesn't assess long-term maintainability. Proposed patch: Add obligation "Assess 6-month maintainability" (see §6.2.1).

Test 4.3: Stakeholder and Ethical Review

Periodic review by diverse stakeholders and ethicists to ensure the SDP incorporates perspectives relevant to $Y^*$ that may be missing from empirical data (e.g., fairness considerations, long-tail user needs).

Assemble review panel: domain experts, affected users, ethics researchers.
Present SDP specification and sample (Y, context, response) triples.
Elicit: "What welfare-relevant factors does this SDP miss?"
Document findings; propose SDP patches via SDP-Gov (§6).

5. Pillar 3: Stability and Invariance

This pillar uses statistical checks to ensure the SDP is well-specified and stable across judge pools and time.

Test 5.1: Inter-Pool Reliability

As formalized in the Y*-Aligned Systems appendix (Proposition 2), calibrated scores from different qualified judge pools executing the same SDP must converge. High divergence suggests the SDP is underspecified.

Procedure

Select two independent judge pools (P1, P2) meeting SDP qualification criteria.
Both pools label the same sample (n ≈ 200) following the SDP. Obtain $Y_{P1}$ and $Y_{P2}$ .
Compute intraclass correlation (ICC) or mean absolute difference. Target: ICC > 0.8.
If ICC < 0.7, the SDP is underspecified → clarify obligations, add worked examples, tighten rubric.

Frequency: Quarterly, or when onboarding new judge pools.

Test 5.2: Temporal Stability (Anchor Drift)

Re-evaluate reference policies $(\pi_{\text{low}}, \pi_{\text{high}})$ quarterly. If drift exceeds a threshold (e.g., 0.05), the operational definition of welfare has changed, requiring re-anchoring.

Procedure

At project start, define anchor policies and measure $V(\pi_{\text{low}})$ , $V(\pi_{\text{high}})$ on a fixed holdout set.
Every quarter, re-measure on the same holdout. Compute drift: $\delta = |V_{\text{current}} - V_{\text{baseline}}|$ .
If $\delta > 0.05$ for either anchor, flag drift. Investigate cause (SDP changed? judge pool changed? context distribution changed?).
If drift persists, create Anchor v2.0 and re-normalize historical data for comparability.

6. SDP-Gov: Governance for the SDP

When the BVP detects an A0 failure, SDP-Gov (CLOVER for the SDP) provides the governance framework for improving the protocol. It treats the SDP as a versioned artifact subject to CI/CD principles.

6.1. The SDP-Gov Loop

Audit (BVP Execution): Execute the BVP.
Diagnose (SDP Residual Cards): If failures are detected (e.g., low PTE), analyze the "A0 residuals" $(\Delta_{\text{LRO}} - \beta \cdot \Delta_Y)$ to identify systematic gaps where $Y$ diverges from $Y^*$ . Summarize findings in SDP Residual Cards.
Synthesize Patch ( $\delta$ ): Propose a structured change $\delta$ to the current SDP $\theta$ . Patches should prioritize "obligation-first" changes (add missing welfare dimensions) over "constraint" changes (tighten existing criteria).
Validate Patch (Acceptance Predicate): Test the patch $\theta' = \theta \oplus \delta$ on a holdout validation set.

6.2. The Validation Slice

Validating a patch requires a specialized Validation Slice: a dataset collected on a time-separated holdout, where $Y$ (old SDP), $Y'$ (new SDP), AND LRO data (or a faster proxy) are all measured.

Why time-separated?

To prevent adaptive overfitting. If the validation set is the same data used to identify the problem, any patch will appear to "work." Time separation ensures the patch generalizes to new data.

6.2.1. Example SDP Residual Card

Residual Card #2025-Q3-01

Slice: Complex coding tasks (n=500, data science domain)
Failure mode: Y overestimated welfare (PTE = 0.35 in this slice)
Evidence: SDP-high responses (Y > 0.8) had verbose explanations + correct syntax, but 90-day follow-up showed low code reuse (15%) and high debugging time (+40% vs. baseline).
Root cause: SDP rewards "looks correct at t=0" (syntax, tests pass, follows style guide) but misses long-term maintainability and documentation quality (which impact LROs).
Proposed patch δ: Add obligation: "Assess 6-month maintainability: (a) Is the code well-documented for future readers? (b) Are edge cases and assumptions made explicit? (c) Would a new team member understand this code in 6 months without asking the author?"
Expected impact: Increase PTE from 0.35 to >0.6 on coding tasks by penalizing "quick-but-fragile" code.

6.3. The Acceptance Predicate

A patch is accepted ONLY IF it passes all guardrails on the Validation Slice:

Accept( $\delta$ ) =

[ΔPTE ≥ η] (Empirical Improvement): Must significantly improve PTE against LROs. (Primary Criterion). Default: η = 0.05.
∧ [Green-Slice Non-Inferiority] (Do No Harm): Must not degrade alignment in domains where the current SDP performs well. Test on slices with PTE > 0.7.
∧ [Fairness Non-Regression] (Equity): Must not increase bias or disparate impact across protected groups (requires fairness metrics in the validation set).
∧ [Expert Alignment Improvement] (Validity): Must improve alignment with expert audits on high-stakes examples (Pillar 2).
∧ [Anchor Stability] (Stability): Must not significantly drift the [0,1] scale vs. $\pi_{\text{low}}/\pi_{\text{high}}$ . Test: anchor re-evaluation on holdout.
∧ [Cost/Latency Constraints] (Operational): Must remain within budgeted cost/latency for collecting $Y$ .

6.4. Versioning and Deployment

If accepted, deploy SDP $v_{t+1}$ . This action mandates:

Re-calibration of all associated judges (CIMO Layer 2), as the measurement scale $Y$ has changed.
Changelog entry: Document what changed, why, and the validation evidence.
Backward compatibility: If historical comparisons are needed, maintain ability to score with SDP $v_t$ for a deprecation period.

7. Reporting: The A0 Ledger

Every evaluation report must include the A0 Ledger, summarizing the validation status.

A0 Ledger Template

SDP Version:	v1.1
Empirical Alignment (PTE):	0.78 [0.72, 0.84] on LRO suite v3.0 (M=15 A/B tests, 2024-Q4)
LROs Used:	90-day retention, revenue per user, task success rate
Construct Validity Status:	Passed Expert Audit Q4 (n=50, bias < 0.02); Red Team findings: 2 minor gaps documented
Stability Status:	Inter-Pool Reliability ICC = 0.83; Anchor Drift < 0.02
SDP-Gov Status:	Last patch accepted 2025-11-10 (Residual Card #2025-Q3-01)
A0 Status:	PASS (PTE > 0.7)

Status Thresholds (Proposed Defaults)

PASS: PTE ≥ 0.7, all Pillar 2/3 tests pass
WARN: PTE 0.5-0.7, or minor Pillar 2/3 issues → Monitor closely, consider patches
FAIL: PTE < 0.5, or major Pillar 2/3 failures → SDP redesign required

Calibrate thresholds to your risk tolerance.

8. BVP Cadence and Computational Cost

The BVP requires A/B tests, LRO data collection, and expert time. Run it at appropriate intervals:

Quarterly (Lightweight)

Anchor drift check (Pillar 3, Test 5.2): 1-2 hours, automated
Inter-pool reliability on new data (Pillar 3, Test 5.1): ~40 labeling hours (2 pools × 200 examples × 6 min/example)

Annually (Full Audit)

PTE meta-analysis on year's A/B tests (Pillar 1): ~1 week analysis + LRO data pipeline
Expert consensus audit (Pillar 2, Test 4.1): ~20 expert hours for n=50 examples

Triggered (As Needed)

Major SDP change → Full BVP before deployment
PTE drops below 0.5 → Red team (Pillar 2, Test 4.2) + SDP-Gov patch cycle
Anchor drift > 0.05 → Re-anchor and re-run PTE

Cost-Benefit

The BVP is expensive (A/B testing, LRO tracking), but the cost of optimizing for the wrong objective is much higher. A single quarter of shipping a model optimized to a misaligned $Y$ can waste millions in compute and harm user trust. The BVP makes that risk quantifiable and manageable.

9. Worked Examples

Example 1: PTE Calculation

Scenario

You've run M=12 A/B tests over the past year. For each test i, you have:

$\Delta_{\text{LRO},i}$ : 90-day retention lift (percentage points)
$\Delta_{Y,i}$ : Offline CJE estimate of welfare difference

Data:

Test	Δ_LRO	Δ_Y
1	2.1	0.08
2	-0.5	-0.02
3	3.2	0.12
...	...	...
12	1.8	0.07

Calculation:

Compute β: β̂ = (Σ Δ_LRO · Δ_Y) / (Σ Δ_Y²) = 24.5 (retention points per Y unit)
Compute residuals: e_i = Δ_LRO,i - β̂ · Δ_Y,i
PTE = 1 - (Σ e_i²) / (Σ Δ_LRO²) = 0.81

Result: PTE = 0.81 [0.74, 0.88] → A0 PASS. Y strongly predicts LROs.

Example 2: SDP-Gov Patch Cycle

Timeline

2025-Q3: BVP detects PTE = 0.35 on coding tasks (Residual Card #2025-Q3-01)
2025-Q3: Propose patch δ: Add "6-month maintainability" obligation
2025-Q4: Collect Validation Slice (n=300, time-separated, with Y, Y', and LRO data)
2025-Q4: Test acceptance predicate:
- ✓ ΔPTE = +0.28 (0.35 → 0.63)
- ✓ Green-slice non-inferiority: PTE unchanged on non-coding tasks
- ✓ Fairness: No disparate impact across user segments
- ✓ Expert alignment: Improved on 45/50 high-stakes examples
- ✓ Anchor stability: Drift < 0.01
- ✓ Cost: +15% labeling time (acceptable)
2025-11-10: Deploy SDP v1.2 with patch δ
2025-11-11: Re-calibrate all judges to SDP v1.2 scale

Assumptions Ledger

This table extends the AI Quality as Surrogacy assumptions ledger with A0.

Code	Statement	Used by	Test / Diagnostic	Mitigation
A0	$\mathbb{E}[Y^* \mid X,A] = \mathbb{E}[Y \mid X,A]$ (Bridge Assumption)	All Layers	BVP (Pillar 1: PTE; Pillar 2: Audits; Pillar 3: Stability)	SDP-Gov: SDP Patching and Governance
S1	$\exists f_k: \mathbb{E}[Y \mid X,A,S^{(k)}] = f_k(S^{(k)},X)$	Layer 2-6	Calibration residuals; Prentice test	Add covariates; richer judge; higher rung
S2	$Y \perp \!\!\! \perp \text{Sel} \mid X,A,S^{(k)}$ (S-admissibility)	Layer 2-6 (cross-environment)	Transport test; diagram review	If selection into Y: recalibrate with target oracle labels

Citation

If you use this work, please cite:

BibTeX

@misc{landesberg2025bridgea0,
  author = {Landesberg, Eddie},
  title = {Validating the Bridge Assumption (A0): The Bridge Validation Protocol and SDP-Gov},
  year = {2025},
  month = {November},
  url = {https://cimolabs.com/research/bridge-validation-a0},
  note = {CIMO Labs Technical Report}
}

Plain Text

Landesberg, E. (2025). Validating the Bridge Assumption (A0): The Bridge Validation Protocol and SDP-Gov. CIMO Labs Technical Report. https://cimolabs.com/research/bridge-validation-a0

References

[1] Buyse, M., Molenberghs, G., Burzykowski, T., Renard, D., & Geys, H. (2000). The validation of surrogate endpoints in meta-analyses of randomized experiments. Biostatistics, 1(1), 49-67. DOI. Foundational work on PTE and surrogate validation in clinical trials.

[2] Prentice, R. L. (1989). Surrogate endpoints in clinical trials: Definition and operational criteria. Statistics in Medicine, 8(4), 431-440. DOI. Original formulation of surrogate endpoint criteria.

[3] Pearl, J., & Bareinboim, E. (2014). External validity: From do-calculus to transportability across populations. Statistical Science, 29(4), 579-595. DOI. Formal causal framework for transportability and external validity.

← Back to CIMO Framework overview