CIMO LabsCIMO Labs

Validating the Bridge Assumption (A0): The Bridge Validation Protocol and SDP-Gov

Technical Appendix v1.0

Eddie Landesberg, CIMO Labs35 min read

Abstract

The CIMO Framework relies on the Bridge Assumption (A0): that the operational welfare label YY (measured via the Standard Deliberation Protocol, SDP) aligns with the idealized target YY^*. While CIMO rigorously calibrates surrogates SS to YY (Layers 2-3), the link YYY \to Y^* requires explicit validation. We introduce the Validation Layer ("Layer 0") to address this. It comprises the Bridge Validation Protocol (BVP), a suite of tests for A0, and SDP-Gov, a governance framework for the SDP. The BVP uses the Proportion of Treatment Effect Explained (PTE) as the core metric for empirical alignment against Long-Run Outcomes (LROs). SDP-Gov provides a CI/CD system for the SDP itself. Together, they ensure the operational target remains aligned with true welfare.

Prerequisites: This appendix assumes familiarity with the CIMO Framework architecture, particularly Y*-Aligned Systems (SDP, Y* definition) and AI Quality as Surrogacy (calibration, S-admissibility). For the conceptual introduction, see The CIMO Framework.

0. The A0 Problem and Notation

A0 (Bridge Assumption)

E[YX,A]=E[YX,A]\mathbb{E}[Y^* \mid X, A] = \mathbb{E}[Y \mid X, A]

The operational label YY, produced by the Standard Deliberation Protocol (SDP), is conditionally unbiased for the Idealized Deliberation Oracle outcome YY^*.

The Problem: If A0 fails due to construct drift, protocol gaps, or rater bias, YY diverges from YY^*. The CIMO stack may then rigorously optimize systems toward the wrong objective.

The Irreducibility of A0: A Philosophical Caveat

A0 is the foundational assumption of the CIMO Framework. It cannot be "proven" in an absolute sense—it rests on the philosophical claim that the operational measurement process (Y via SDP) captures what we care about (Y*).

The LRO Validation Problem: BVP tests Y→LRO (do policies that score high on Y produce better long-run outcomes?). But this assumes LRO→Y* (that long-run outcomes are themselves aligned with true welfare). If LROs are poor proxies for Y* (e.g., optimizing for engagement rather than long-term value, or user retention rather than genuine satisfaction), a high PTE provides false confidence.

What BVP provides: Not proof of A0, but evidence and vigilance. BVP establishes that Y predicts improvements in measurable real-world outcomes, provides structured adversarial auditing (Pillar 2), and offers a governance framework to detect drift. This is the best we can do. All evaluation frameworks rest on similar irreducible assumptions about what constitutes "good." CIMO's contribution is making this assumption explicit (A0), providing validation machinery (BVP), and maintaining governance (SDP-Gov) rather than leaving it implicit and unexamined.

Notation & The Measurement Hierarchy

  • YY^*: Idealized Deliberation Oracle outcome (True Welfare).
    The theoretical construct representing welfare under perfect deliberation—complete information, reflective consistency, impartial aggregation. Unobservable in practice.
  • YY: Operational welfare label (measured via SDP).
    The practical measurement produced by the Standard Deliberation Protocol. Approximates YY^* but is measurable at scale. This is what we use for calibration (Layer 2: SYS \to Y).
  • LROs (Long-Run Outcomes): Real-world metrics that approximate YY^*.
    Observable metrics measured weeks/months after interaction (e.g., 90-day retention, revenue, task success rate). These are delayed proxies for YY^*, not YY^* itself. We test whether optimizing YY predicts improvements in LROs.
  • SS: Cheap surrogates (e.g., LLM-judge scores).
    Fast, scalable signals calibrated to YY (Layer 2).

The validation hierarchy:

YA0YS1SY^* \quad \xleftarrow{\text{A0}} \quad Y \quad \xleftarrow{\text{S1}} \quad S
  • Layer 0 (A0): YYY \to Y^* (Bridge: operational → idealized)
  • Layer 2 (S1): SYS \to Y (Surrogacy: cheap signal → operational)

We cannot directly test YYY \to Y^* because YY^* is unobservable. Instead, we test YLROsY \to \text{LROs} and assume LROs approximate YY^* (see §3.1.2 for the LRO validation problem).

1. The Validation Layer Framework (Layer 0)

The Validation Layer acts as Layer 0, ensuring the foundation of the CIMO stack (Layer 1: Y* Definition) remains sound. It has two components:

  • The Bridge Validation Protocol (BVP) for testing A0 via empirical alignment, construct validity, and stability checks.
  • SDP-Gov for governing the SDP—a CI/CD system that allows safe evolution of the operational welfare definition.

Relationship to other CIMO components

  • CLOVER (Layer 6) governs judge calibration (SYS \to Y).
  • SDP-Gov (Layer 0) governs the SDP itself (YYY \to Y^*).
  • Y*-Aligned Systems (Layer 4) ensures prompts and judges target the same construct.
  • Layer 0 validates that the shared construct YY actually aligns with true welfare YY^*.

2. The Bridge Validation Protocol (BVP)

The BVP is a suite of tests executed periodically (e.g., quarterly) or when distributional shifts are detected. It comprises three pillars:

PillarFocusCore MetricFrequency
1. Empirical AlignmentDoes optimizing Y predict LRO gains?PTE (Proportion of Treatment Effect Explained)Annual
2. Construct ValidityDoes Y capture essential welfare elements?Expert audits, red team findingsQuarterly
3. Stability & InvarianceIs the SDP well-specified and stable?Inter-pool reliability, anchor driftQuarterly

3. Pillar 1: Empirical Alignment (PTE and the A/B Bridge)

This pillar tests the predictive validity of YY: does optimizing YY translate to real-world value (LROs)?

3.1. Estimand: Proportion of Treatment Effect Explained (PTE)

We use the standard surrogacy metric PTE (also known as R-squared or Surrogate Index Validity) to quantify alignment. Let πA,πB\pi_A, \pi_B be two policies compared in an A/B test.

  • Let ΔLRO=E[LROπA]E[LROπB]\Delta_{\text{LRO}} = \mathbb{E}[\text{LRO} \mid \pi_A] - \mathbb{E}[\text{LRO} \mid \pi_B] (online effect).
  • Let ΔY=E[YπA]E[YπB]\Delta_Y = \mathbb{E}[Y \mid \pi_A] - \mathbb{E}[Y \mid \pi_B] (offline effect, estimated via CJE).

PTE is defined as:

PTE=1E[(ΔLROβΔY)2]E[ΔLRO2]\text{PTE} = 1 - \frac{\mathbb{E}[(\Delta_{\text{LRO}} - \beta \cdot \Delta_Y)^2]}{\mathbb{E}[\Delta_{\text{LRO}}^2]}

where β\beta is a scaling factor (calibration slope), and the expectation is taken over the distribution of A/B tests.

Interpretation

  • PTE ≥ 0.7: Strong alignment. A0 holds. YY is a reliable predictor of YY^* (via LROs).
  • PTE 0.3 - 0.7: Moderate alignment. YY is useful but misses key components of YY^*. Consider SDP patches (SDP-Gov).
  • PTE < 0.3: Weak alignment. A0 fails. YY does not track true welfare. Requires SDP redesign or fallback to direct YY^* measurement.

Note: These thresholds are proposed operational defaults, not empirically validated standards. Inspired by surrogate endpoint literature [1], but actual thresholds should reflect your domain's cost tradeoffs between false positives and false negatives.

3.1.1. LRO Selection Criteria

LROs must satisfy three criteria to serve as valid proxies for YY^*:

  1. YY^*-relevance: The metric should be causally downstream of the welfare construct YY^* aims to capture. For "developer productivity," use 90-day task completion rates or code reuse, not session clicks or response length.
  2. Delayed observation: Measured weeks or months after interaction. If measured immediately, it's likely a proxy for YY, not YY^*. The delay allows true welfare consequences to manifest.
  3. Robustness to gaming: Hard to manipulate via shallow response features (verbosity, formatting, confidence tone). Should reflect genuine user outcomes.

Practical guidance

  • Minimum: 3 diverse LROs. If PTE varies widely across LROs (e.g., PTE = 0.8 for retention but 0.2 for revenue), this indicates YY captures only a subset of YY^* → investigate which dimensions are missing.
  • Diversify: Include metrics from different stakeholder perspectives (user, business, ethical). E.g., user satisfaction, long-term engagement, safety incidents.
  • Document: Record LRO selection rationale in the A0 Ledger (§7). This makes the validation transparent and auditable.

3.1.2. The LRO Validation Problem

The circularity

LROs are not YY^* itself—they're observable proxies. This creates a validation hierarchy:

  • YY^* (unobservable ideal)
  • LROs (delayed, noisy, but "more YY^*-like" than YY)
  • YY (SDP output, faster but potentially biased)
  • SS (cheap surrogates)

The BVP tests "YLROsY \to \text{LROs}", assuming LROsY\text{LROs} \approx Y^*. This assumption cannot be fully tested, but can be triangulated via:

  1. Multiple diverse LROs: If PTE is high for all LROs, confidence increases that they jointly approximate YY^*.
  2. Expert review: "Do these LROs capture what truly matters for welfare?" Qualitative validation that LRO selection is reasonable.
  3. Sensitivity analysis: Vary LRO definitions (e.g., 60-day vs. 90-day retention), check PTE stability. If PTE is robust, confidence increases.

Document your LRO selection rationale in the A0 Ledger (§7). This makes the implicit assumption "LROsY\text{LROs} \approx Y^*" explicit and auditable.

3.2. Estimation Procedure (Meta-Analysis)

  1. Define LROs: Specify key long-run metrics (e.g., 90-day retention, revenue per user, task success rate). See §3.1.1 for selection criteria.
  2. Aggregate A/B Tests: Collect a suite of recent A/B tests (M ≥ 10 recommended). Each test compares two policies (πA,πB)(\pi_A, \pi_B).
  3. Measure ΔLRO\Delta_{\text{LRO}} and ΔY\Delta_Y: For each test:
    • ΔLRO\Delta_{\text{LRO}}: Online impact (from production A/B test)
    • ΔY\Delta_Y: Offline CJE estimate (using SDP labels)
  4. Estimate PTE: Use cross-validation over the suite of A/B tests to estimate PTE and its confidence interval. Fit β\beta via OLS: ΔLROβΔY\Delta_{\text{LRO}} \sim \beta \cdot \Delta_Y, then compute residual variance.

Critical: The Novelty Mandate

LLMs memorize public benchmarks. If your validation data overlaps with the model's training corpus, the model may be reciting correct answers rather than reasoning to them (Srivastava et al., 2025). This invalidates the bridge validation.

Requirements:

  • Post-cutoff data: Validation datasets must be collected after the model's training cutoff date. Use private, recent data (e.g., last quarter's user interactions).
  • Memorization check: Include a diagnostic that tests for verbatim recall of validation examples. If the model can complete partial prompts from memory, the data is contaminated.
  • Forbidden: Public benchmarks (MMLU, HumanEval, etc.) unless you can verify they were excluded from training.

PTE Estimation Formula

Given M A/B tests with paired (ΔLRO,i,ΔY,i)(\Delta_{\text{LRO},i}, \Delta_{Y,i}):

β^=i=1MΔLRO,iΔY,ii=1MΔY,i2\widehat{\beta} = \frac{\sum_{i=1}^M \Delta_{\text{LRO},i} \cdot \Delta_{Y,i}}{\sum_{i=1}^M \Delta_{Y,i}^2}
PTE^=1i=1M(ΔLRO,iβ^ΔY,i)2i=1MΔLRO,i2\widehat{\text{PTE}} = 1 - \frac{\sum_{i=1}^M (\Delta_{\text{LRO},i} - \widehat{\beta} \cdot \Delta_{Y,i})^2}{\sum_{i=1}^M \Delta_{\text{LRO},i}^2}

Use bootstrap or jackknife to estimate 95% CI for PTE. Report PTE^[L,U]\widehat{\text{PTE}} \in [L, U] in the A0 Ledger.

3.3. Secondary Metrics

  • Directional Consistency (Sign Test): The frequency with which sign(ΔLRO)=sign(ΔY)\text{sign}(\Delta_{\text{LRO}}) = \text{sign}(\Delta_Y). Minimum acceptable: 80%.
  • Calibration Slope (β\beta): The scaling factor quantifies "value translation." β1\beta \approx 1 indicates YY and LROs are on similar scales. β1\beta \ll 1 suggests YY overestimates welfare; β1\beta \gg 1 suggests underestimation.

4. Pillar 2: Construct Validity (Audits and Adversarial Testing)

This pillar ensures the SDP captures the essential elements of idealized deliberation through qualitative and adversarial checks.

Test 4.1: Expert Consensus Audit

On a small, high-stakes data slice (n ≈ 50-100 examples), compare YY (SDP output) against "Unbounded Expert Deliberation" (no time/cost constraints). Test for systematic divergence.

Procedure

  1. Sample high-stakes examples (e.g., safety-critical, high-value users, edge cases).
  2. Collect YY labels via standard SDP (time budget T, evidence sources E).
  3. Collect YexpertY_{\text{expert}} labels via unbounded protocol: expert panel, unlimited time, access to all evidence.
  4. Test: H0:E[YYexpert]=0H_0: \mathbb{E}[Y - Y_{\text{expert}}] = 0. If rejected at α=0.05\alpha = 0.05, investigate bias patterns.

Frequency: Annual or when major SDP changes are proposed.

Test 4.2: Red Teaming the SDP

Adversarially search for scenarios where following the SDP yields a high YY score, but expert consensus or known facts indicate low welfare (low YY^*). This identifies protocol gaps.

Example Red Team Finding

Scenario: Code generation task. Model produces syntactically correct but poorly documented code with hardcoded assumptions.

SDP output: Y=0.85Y = 0.85 (code runs, tests pass, follows style guide).

Expert review: Low YY^* (code is unmaintainable, will cause bugs in 6 months).

Protocol gap: SDP rewards "correctness at t=0" but doesn't assess long-term maintainability. Proposed patch: Add obligation "Assess 6-month maintainability" (see §6.2.1).

Test 4.3: Stakeholder and Ethical Review

Periodic review by diverse stakeholders and ethicists to ensure the SDP incorporates perspectives relevant to YY^* that may be missing from empirical data (e.g., fairness considerations, long-tail user needs).

  • Assemble review panel: domain experts, affected users, ethics researchers.
  • Present SDP specification and sample (Y, context, response) triples.
  • Elicit: "What welfare-relevant factors does this SDP miss?"
  • Document findings; propose SDP patches via SDP-Gov (§6).

5. Pillar 3: Stability and Invariance

This pillar uses statistical checks to ensure the SDP is well-specified and stable across judge pools and time.

Test 5.1: Inter-Pool Reliability

As formalized in the Y*-Aligned Systems appendix (Proposition 2), calibrated scores from different qualified judge pools executing the same SDP must converge. High divergence suggests the SDP is underspecified.

Procedure

  1. Select two independent judge pools (P1, P2) meeting SDP qualification criteria.
  2. Both pools label the same sample (n ≈ 200) following the SDP. Obtain YP1Y_{P1} and YP2Y_{P2}.
  3. Compute intraclass correlation (ICC) or mean absolute difference. Target: ICC > 0.8.
  4. If ICC < 0.7, the SDP is underspecified → clarify obligations, add worked examples, tighten rubric.

Frequency: Quarterly, or when onboarding new judge pools.

Test 5.2: Temporal Stability (Anchor Drift)

Re-evaluate reference policies (πlow,πhigh)(\pi_{\text{low}}, \pi_{\text{high}}) quarterly. If drift exceeds a threshold (e.g., 0.05), the operational definition of welfare has changed, requiring re-anchoring.

Procedure

  1. At project start, define anchor policies and measure V(πlow)V(\pi_{\text{low}}), V(πhigh)V(\pi_{\text{high}}) on a fixed holdout set.
  2. Every quarter, re-measure on the same holdout. Compute drift: δ=VcurrentVbaseline\delta = |V_{\text{current}} - V_{\text{baseline}}|.
  3. If δ>0.05\delta > 0.05 for either anchor, flag drift. Investigate cause (SDP changed? judge pool changed? context distribution changed?).
  4. If drift persists, create Anchor v2.0 and re-normalize historical data for comparability.

6. SDP-Gov: Governance for the SDP

When the BVP detects an A0 failure, SDP-Gov (CLOVER for the SDP) provides the governance framework for improving the protocol. It treats the SDP as a versioned artifact subject to CI/CD principles.

6.1. The SDP-Gov Loop

  1. Audit (BVP Execution): Execute the BVP.
  2. Diagnose (SDP Residual Cards): If failures are detected (e.g., low PTE), analyze the "A0 residuals" (ΔLROβΔY)(\Delta_{\text{LRO}} - \beta \cdot \Delta_Y) to identify systematic gaps where YY diverges from YY^*. Summarize findings in SDP Residual Cards.
  3. Synthesize Patch (δ\delta): Propose a structured change δ\delta to the current SDP θ\theta. Patches should prioritize "obligation-first" changes (add missing welfare dimensions) over "constraint" changes (tighten existing criteria).
  4. Validate Patch (Acceptance Predicate): Test the patch θ=θδ\theta' = \theta \oplus \delta on a holdout validation set.

6.2. The Validation Slice

Validating a patch requires a specialized Validation Slice: a dataset collected on a time-separated holdout, where YY (old SDP), YY' (new SDP), AND LRO data (or a faster proxy) are all measured.

Why time-separated?

To prevent adaptive overfitting. If the validation set is the same data used to identify the problem, any patch will appear to "work." Time separation ensures the patch generalizes to new data.

6.2.1. Example SDP Residual Card

Residual Card #2025-Q3-01

  • Slice: Complex coding tasks (n=500, data science domain)
  • Failure mode: Y overestimated welfare (PTE = 0.35 in this slice)
  • Evidence: SDP-high responses (Y > 0.8) had verbose explanations + correct syntax, but 90-day follow-up showed low code reuse (15%) and high debugging time (+40% vs. baseline).
  • Root cause: SDP rewards "looks correct at t=0" (syntax, tests pass, follows style guide) but misses long-term maintainability and documentation quality (which impact LROs).
  • Proposed patch δ: Add obligation: "Assess 6-month maintainability: (a) Is the code well-documented for future readers? (b) Are edge cases and assumptions made explicit? (c) Would a new team member understand this code in 6 months without asking the author?"
  • Expected impact: Increase PTE from 0.35 to >0.6 on coding tasks by penalizing "quick-but-fragile" code.

6.3. The Acceptance Predicate

A patch is accepted ONLY IF it passes all guardrails on the Validation Slice:

Accept(δ\delta) =

  • [ΔPTE ≥ η] (Empirical Improvement): Must significantly improve PTE against LROs. (Primary Criterion). Default: η = 0.05.
  • ∧ [Green-Slice Non-Inferiority] (Do No Harm): Must not degrade alignment in domains where the current SDP performs well. Test on slices with PTE > 0.7.
  • ∧ [Fairness Non-Regression] (Equity): Must not increase bias or disparate impact across protected groups (requires fairness metrics in the validation set).
  • ∧ [Expert Alignment Improvement] (Validity): Must improve alignment with expert audits on high-stakes examples (Pillar 2).
  • ∧ [Anchor Stability] (Stability): Must not significantly drift the [0,1] scale vs. πlow/πhigh\pi_{\text{low}}/\pi_{\text{high}}. Test: anchor re-evaluation on holdout.
  • ∧ [Cost/Latency Constraints] (Operational): Must remain within budgeted cost/latency for collecting YY.

6.4. Versioning and Deployment

If accepted, deploy SDP vt+1v_{t+1}. This action mandates:

  • Re-calibration of all associated judges (CIMO Layer 2), as the measurement scale YY has changed.
  • Changelog entry: Document what changed, why, and the validation evidence.
  • Backward compatibility: If historical comparisons are needed, maintain ability to score with SDP vtv_t for a deprecation period.

7. Reporting: The A0 Ledger

Every evaluation report must include the A0 Ledger, summarizing the validation status.

A0 Ledger Template

SDP Version:v1.1
Empirical Alignment (PTE):0.78 [0.72, 0.84] on LRO suite v3.0 (M=15 A/B tests, 2024-Q4)
LROs Used:90-day retention, revenue per user, task success rate
Construct Validity Status:Passed Expert Audit Q4 (n=50, bias < 0.02); Red Team findings: 2 minor gaps documented
Stability Status:Inter-Pool Reliability ICC = 0.83; Anchor Drift < 0.02
SDP-Gov Status:Last patch accepted 2025-11-10 (Residual Card #2025-Q3-01)
A0 Status:PASS (PTE > 0.7)

Status Thresholds (Proposed Defaults)

  • PASS: PTE ≥ 0.7, all Pillar 2/3 tests pass
  • WARN: PTE 0.5-0.7, or minor Pillar 2/3 issues → Monitor closely, consider patches
  • FAIL: PTE < 0.5, or major Pillar 2/3 failures → SDP redesign required

Calibrate thresholds to your risk tolerance.

8. BVP Cadence and Computational Cost

The BVP requires A/B tests, LRO data collection, and expert time. Run it at appropriate intervals:

Quarterly (Lightweight)

  • Anchor drift check (Pillar 3, Test 5.2): 1-2 hours, automated
  • Inter-pool reliability on new data (Pillar 3, Test 5.1): ~40 labeling hours (2 pools × 200 examples × 6 min/example)

Annually (Full Audit)

  • PTE meta-analysis on year's A/B tests (Pillar 1): ~1 week analysis + LRO data pipeline
  • Expert consensus audit (Pillar 2, Test 4.1): ~20 expert hours for n=50 examples

Triggered (As Needed)

  • Major SDP change → Full BVP before deployment
  • PTE drops below 0.5 → Red team (Pillar 2, Test 4.2) + SDP-Gov patch cycle
  • Anchor drift > 0.05 → Re-anchor and re-run PTE

Cost-Benefit

The BVP is expensive (A/B testing, LRO tracking), but the cost of optimizing for the wrong objective is much higher. A single quarter of shipping a model optimized to a misaligned YY can waste millions in compute and harm user trust. The BVP makes that risk quantifiable and manageable.

9. Worked Examples

Example 1: PTE Calculation

Scenario

You've run M=12 A/B tests over the past year. For each test i, you have:

  • ΔLRO,i\Delta_{\text{LRO},i}: 90-day retention lift (percentage points)
  • ΔY,i\Delta_{Y,i}: Offline CJE estimate of welfare difference

Data:

TestΔLROΔY
12.10.08
2-0.5-0.02
33.20.12
.........
121.80.07

Calculation:

  1. Compute β: β̂ = (Σ ΔLRO · ΔY) / (Σ ΔY2) = 24.5 (retention points per Y unit)
  2. Compute residuals: ei = ΔLRO,i - β̂ · ΔY,i
  3. PTE = 1 - (Σ ei2) / (Σ ΔLRO2) = 0.81

Result: PTE = 0.81 [0.74, 0.88] → A0 PASS. Y strongly predicts LROs.

Example 2: SDP-Gov Patch Cycle

Timeline

  • 2025-Q3: BVP detects PTE = 0.35 on coding tasks (Residual Card #2025-Q3-01)
  • 2025-Q3: Propose patch δ: Add "6-month maintainability" obligation
  • 2025-Q4: Collect Validation Slice (n=300, time-separated, with Y, Y', and LRO data)
  • 2025-Q4: Test acceptance predicate:
    • ✓ ΔPTE = +0.28 (0.35 → 0.63)
    • ✓ Green-slice non-inferiority: PTE unchanged on non-coding tasks
    • ✓ Fairness: No disparate impact across user segments
    • ✓ Expert alignment: Improved on 45/50 high-stakes examples
    • ✓ Anchor stability: Drift < 0.01
    • ✓ Cost: +15% labeling time (acceptable)
  • 2025-11-10: Deploy SDP v1.2 with patch δ
  • 2025-11-11: Re-calibrate all judges to SDP v1.2 scale

Assumptions Ledger

This table extends the AI Quality as Surrogacy assumptions ledger with A0.

CodeStatementUsed byTest / DiagnosticMitigation
A0E[YX,A]=E[YX,A]\mathbb{E}[Y^* \mid X,A] = \mathbb{E}[Y \mid X,A] (Bridge Assumption)All LayersBVP (Pillar 1: PTE; Pillar 2: Audits; Pillar 3: Stability)SDP-Gov: SDP Patching and Governance
S1fk:E[YX,A,S(k)]=fk(S(k),X)\exists f_k: \mathbb{E}[Y \mid X,A,S^{(k)}] = f_k(S^{(k)},X)Layer 2-6Calibration residuals; Prentice testAdd covariates; richer judge; higher rung
S2Y ⁣ ⁣ ⁣SelX,A,S(k)Y \perp \!\!\! \perp \text{Sel} \mid X,A,S^{(k)} (S-admissibility)Layer 2-6 (cross-environment)Transport test; diagram reviewIf selection into Y: recalibrate with target oracle labels

Citation

If you use this work, please cite:

BibTeX

@misc{landesberg2025bridgea0,
  author = {Landesberg, Eddie},
  title = {Validating the Bridge Assumption (A0): The Bridge Validation Protocol and SDP-Gov},
  year = {2025},
  month = {November},
  url = {https://cimolabs.com/research/bridge-validation-a0},
  note = {CIMO Labs Technical Report}
}

Plain Text

Landesberg, E. (2025). Validating the Bridge Assumption (A0): The Bridge Validation Protocol and SDP-Gov. CIMO Labs Technical Report. https://cimolabs.com/research/bridge-validation-a0

References

References

[1] Buyse, M., Molenberghs, G., Burzykowski, T., Renard, D., & Geys, H. (2000). The validation of surrogate endpoints in meta-analyses of randomized experiments. Biostatistics, 1(1), 49-67. DOI — Foundational work on PTE and surrogate validation in clinical trials.
[2] Prentice, R. L. (1989). Surrogate endpoints in clinical trials: Definition and operational criteria. Statistics in Medicine, 8(4), 431-440. DOI — Original formulation of surrogate endpoint criteria.
[3] Pearl, J., & Bareinboim, E. (2014). External validity: From do-calculus to transportability across populations. Statistical Science, 29(4), 579-595. DOI — Formal causal framework for transportability and external validity.