Validating the Bridge Assumption (A0): The Bridge Validation Protocol and SDP-Gov
Technical Appendix v1.0
Abstract
The CIMO Framework relies on the Bridge Assumption (A0): that the operational welfare label (measured via the Standard Deliberation Protocol, SDP) aligns with the idealized target . While CIMO rigorously calibrates surrogates to (Layers 2-3), the link requires explicit validation. We introduce the Validation Layer ("Layer 0") to address this. It comprises the Bridge Validation Protocol (BVP), a suite of tests for A0, and SDP-Gov, a governance framework for the SDP. The BVP uses the Proportion of Treatment Effect Explained (PTE) as the core metric for empirical alignment against Long-Run Outcomes (LROs). SDP-Gov provides a CI/CD system for the SDP itself. Together, they ensure the operational target remains aligned with true welfare.
Prerequisites: This appendix assumes familiarity with the CIMO Framework architecture, particularly Y*-Aligned Systems (SDP, Y* definition) and AI Quality as Surrogacy (calibration, S-admissibility). For the conceptual introduction, see The CIMO Framework.
0. The A0 Problem and Notation
A0 (Bridge Assumption)
The operational label , produced by the Standard Deliberation Protocol (SDP), is conditionally unbiased for the Idealized Deliberation Oracle outcome .
The Problem: If A0 fails due to construct drift, protocol gaps, or rater bias, diverges from . The CIMO stack may then rigorously optimize systems toward the wrong objective.
The Irreducibility of A0: A Philosophical Caveat
A0 is the foundational assumption of the CIMO Framework. It cannot be "proven" in an absolute sense—it rests on the philosophical claim that the operational measurement process (Y via SDP) captures what we care about (Y*).
The LRO Validation Problem: BVP tests Y→LRO (do policies that score high on Y produce better long-run outcomes?). But this assumes LRO→Y* (that long-run outcomes are themselves aligned with true welfare). If LROs are poor proxies for Y* (e.g., optimizing for engagement rather than long-term value, or user retention rather than genuine satisfaction), a high PTE provides false confidence.
What BVP provides: Not proof of A0, but evidence and vigilance. BVP establishes that Y predicts improvements in measurable real-world outcomes, provides structured adversarial auditing (Pillar 2), and offers a governance framework to detect drift. This is the best we can do. All evaluation frameworks rest on similar irreducible assumptions about what constitutes "good." CIMO's contribution is making this assumption explicit (A0), providing validation machinery (BVP), and maintaining governance (SDP-Gov) rather than leaving it implicit and unexamined.
Notation & The Measurement Hierarchy
- : Idealized Deliberation Oracle outcome (True Welfare).
The theoretical construct representing welfare under perfect deliberation—complete information, reflective consistency, impartial aggregation. Unobservable in practice. - : Operational welfare label (measured via SDP).
The practical measurement produced by the Standard Deliberation Protocol. Approximates but is measurable at scale. This is what we use for calibration (Layer 2: ). - LROs (Long-Run Outcomes): Real-world metrics that approximate .
Observable metrics measured weeks/months after interaction (e.g., 90-day retention, revenue, task success rate). These are delayed proxies for , not itself. We test whether optimizing predicts improvements in LROs. - : Cheap surrogates (e.g., LLM-judge scores).
Fast, scalable signals calibrated to (Layer 2).
The validation hierarchy:
- Layer 0 (A0): (Bridge: operational → idealized)
- Layer 2 (S1): (Surrogacy: cheap signal → operational)
We cannot directly test because is unobservable. Instead, we test and assume LROs approximate (see §3.1.2 for the LRO validation problem).
1. The Validation Layer Framework (Layer 0)
The Validation Layer acts as Layer 0, ensuring the foundation of the CIMO stack (Layer 1: Y* Definition) remains sound. It has two components:
- The Bridge Validation Protocol (BVP) for testing A0 via empirical alignment, construct validity, and stability checks.
- SDP-Gov for governing the SDP—a CI/CD system that allows safe evolution of the operational welfare definition.
Relationship to other CIMO components
- CLOVER (Layer 6) governs judge calibration ().
- SDP-Gov (Layer 0) governs the SDP itself ().
- Y*-Aligned Systems (Layer 4) ensures prompts and judges target the same construct.
- Layer 0 validates that the shared construct actually aligns with true welfare .
2. The Bridge Validation Protocol (BVP)
The BVP is a suite of tests executed periodically (e.g., quarterly) or when distributional shifts are detected. It comprises three pillars:
| Pillar | Focus | Core Metric | Frequency |
|---|---|---|---|
| 1. Empirical Alignment | Does optimizing Y predict LRO gains? | PTE (Proportion of Treatment Effect Explained) | Annual |
| 2. Construct Validity | Does Y capture essential welfare elements? | Expert audits, red team findings | Quarterly |
| 3. Stability & Invariance | Is the SDP well-specified and stable? | Inter-pool reliability, anchor drift | Quarterly |
3. Pillar 1: Empirical Alignment (PTE and the A/B Bridge)
This pillar tests the predictive validity of : does optimizing translate to real-world value (LROs)?
3.1. Estimand: Proportion of Treatment Effect Explained (PTE)
We use the standard surrogacy metric PTE (also known as R-squared or Surrogate Index Validity) to quantify alignment. Let be two policies compared in an A/B test.
- Let (online effect).
- Let (offline effect, estimated via CJE).
PTE is defined as:
where is a scaling factor (calibration slope), and the expectation is taken over the distribution of A/B tests.
Interpretation
- PTE ≥ 0.7: Strong alignment. A0 holds. is a reliable predictor of (via LROs).
- PTE 0.3 - 0.7: Moderate alignment. is useful but misses key components of . Consider SDP patches (SDP-Gov).
- PTE < 0.3: Weak alignment. A0 fails. does not track true welfare. Requires SDP redesign or fallback to direct measurement.
Note: These thresholds are proposed operational defaults, not empirically validated standards. Inspired by surrogate endpoint literature [1], but actual thresholds should reflect your domain's cost tradeoffs between false positives and false negatives.
3.1.1. LRO Selection Criteria
LROs must satisfy three criteria to serve as valid proxies for :
- -relevance: The metric should be causally downstream of the welfare construct aims to capture. For "developer productivity," use 90-day task completion rates or code reuse, not session clicks or response length.
- Delayed observation: Measured weeks or months after interaction. If measured immediately, it's likely a proxy for , not . The delay allows true welfare consequences to manifest.
- Robustness to gaming: Hard to manipulate via shallow response features (verbosity, formatting, confidence tone). Should reflect genuine user outcomes.
Practical guidance
- Minimum: 3 diverse LROs. If PTE varies widely across LROs (e.g., PTE = 0.8 for retention but 0.2 for revenue), this indicates captures only a subset of → investigate which dimensions are missing.
- Diversify: Include metrics from different stakeholder perspectives (user, business, ethical). E.g., user satisfaction, long-term engagement, safety incidents.
- Document: Record LRO selection rationale in the A0 Ledger (§7). This makes the validation transparent and auditable.
3.1.2. The LRO Validation Problem
The circularity
LROs are not itself—they're observable proxies. This creates a validation hierarchy:
- (unobservable ideal)
- LROs (delayed, noisy, but "more -like" than )
- (SDP output, faster but potentially biased)
- (cheap surrogates)
The BVP tests "", assuming . This assumption cannot be fully tested, but can be triangulated via:
- Multiple diverse LROs: If PTE is high for all LROs, confidence increases that they jointly approximate .
- Expert review: "Do these LROs capture what truly matters for welfare?" Qualitative validation that LRO selection is reasonable.
- Sensitivity analysis: Vary LRO definitions (e.g., 60-day vs. 90-day retention), check PTE stability. If PTE is robust, confidence increases.
Document your LRO selection rationale in the A0 Ledger (§7). This makes the implicit assumption "" explicit and auditable.
3.2. Estimation Procedure (Meta-Analysis)
- Define LROs: Specify key long-run metrics (e.g., 90-day retention, revenue per user, task success rate). See §3.1.1 for selection criteria.
- Aggregate A/B Tests: Collect a suite of recent A/B tests (M ≥ 10 recommended). Each test compares two policies .
- Measure and : For each test:
- : Online impact (from production A/B test)
- : Offline CJE estimate (using SDP labels)
- Estimate PTE: Use cross-validation over the suite of A/B tests to estimate PTE and its confidence interval. Fit via OLS: , then compute residual variance.
Critical: The Novelty Mandate
LLMs memorize public benchmarks. If your validation data overlaps with the model's training corpus, the model may be reciting correct answers rather than reasoning to them (Srivastava et al., 2025). This invalidates the bridge validation.
Requirements:
- Post-cutoff data: Validation datasets must be collected after the model's training cutoff date. Use private, recent data (e.g., last quarter's user interactions).
- Memorization check: Include a diagnostic that tests for verbatim recall of validation examples. If the model can complete partial prompts from memory, the data is contaminated.
- Forbidden: Public benchmarks (MMLU, HumanEval, etc.) unless you can verify they were excluded from training.
PTE Estimation Formula
Given M A/B tests with paired :
Use bootstrap or jackknife to estimate 95% CI for PTE. Report in the A0 Ledger.
3.3. Secondary Metrics
- Directional Consistency (Sign Test): The frequency with which . Minimum acceptable: 80%.
- Calibration Slope (): The scaling factor quantifies "value translation." indicates and LROs are on similar scales. suggests overestimates welfare; suggests underestimation.
4. Pillar 2: Construct Validity (Audits and Adversarial Testing)
This pillar ensures the SDP captures the essential elements of idealized deliberation through qualitative and adversarial checks.
Test 4.1: Expert Consensus Audit
On a small, high-stakes data slice (n ≈ 50-100 examples), compare (SDP output) against "Unbounded Expert Deliberation" (no time/cost constraints). Test for systematic divergence.
Procedure
- Sample high-stakes examples (e.g., safety-critical, high-value users, edge cases).
- Collect labels via standard SDP (time budget T, evidence sources E).
- Collect labels via unbounded protocol: expert panel, unlimited time, access to all evidence.
- Test: . If rejected at , investigate bias patterns.
Frequency: Annual or when major SDP changes are proposed.
Test 4.2: Red Teaming the SDP
Adversarially search for scenarios where following the SDP yields a high score, but expert consensus or known facts indicate low welfare (low ). This identifies protocol gaps.
Example Red Team Finding
Scenario: Code generation task. Model produces syntactically correct but poorly documented code with hardcoded assumptions.
SDP output: (code runs, tests pass, follows style guide).
Expert review: Low (code is unmaintainable, will cause bugs in 6 months).
Protocol gap: SDP rewards "correctness at t=0" but doesn't assess long-term maintainability. Proposed patch: Add obligation "Assess 6-month maintainability" (see §6.2.1).
Test 4.3: Stakeholder and Ethical Review
Periodic review by diverse stakeholders and ethicists to ensure the SDP incorporates perspectives relevant to that may be missing from empirical data (e.g., fairness considerations, long-tail user needs).
- Assemble review panel: domain experts, affected users, ethics researchers.
- Present SDP specification and sample (Y, context, response) triples.
- Elicit: "What welfare-relevant factors does this SDP miss?"
- Document findings; propose SDP patches via SDP-Gov (§6).
5. Pillar 3: Stability and Invariance
This pillar uses statistical checks to ensure the SDP is well-specified and stable across judge pools and time.
Test 5.1: Inter-Pool Reliability
As formalized in the Y*-Aligned Systems appendix (Proposition 2), calibrated scores from different qualified judge pools executing the same SDP must converge. High divergence suggests the SDP is underspecified.
Procedure
- Select two independent judge pools (P1, P2) meeting SDP qualification criteria.
- Both pools label the same sample (n ≈ 200) following the SDP. Obtain and .
- Compute intraclass correlation (ICC) or mean absolute difference. Target: ICC > 0.8.
- If ICC < 0.7, the SDP is underspecified → clarify obligations, add worked examples, tighten rubric.
Frequency: Quarterly, or when onboarding new judge pools.
Test 5.2: Temporal Stability (Anchor Drift)
Re-evaluate reference policies quarterly. If drift exceeds a threshold (e.g., 0.05), the operational definition of welfare has changed, requiring re-anchoring.
Procedure
- At project start, define anchor policies and measure , on a fixed holdout set.
- Every quarter, re-measure on the same holdout. Compute drift: .
- If for either anchor, flag drift. Investigate cause (SDP changed? judge pool changed? context distribution changed?).
- If drift persists, create Anchor v2.0 and re-normalize historical data for comparability.
6. SDP-Gov: Governance for the SDP
When the BVP detects an A0 failure, SDP-Gov (CLOVER for the SDP) provides the governance framework for improving the protocol. It treats the SDP as a versioned artifact subject to CI/CD principles.
6.1. The SDP-Gov Loop
- Audit (BVP Execution): Execute the BVP.
- Diagnose (SDP Residual Cards): If failures are detected (e.g., low PTE), analyze the "A0 residuals" to identify systematic gaps where diverges from . Summarize findings in SDP Residual Cards.
- Synthesize Patch (): Propose a structured change to the current SDP . Patches should prioritize "obligation-first" changes (add missing welfare dimensions) over "constraint" changes (tighten existing criteria).
- Validate Patch (Acceptance Predicate): Test the patch on a holdout validation set.
6.2. The Validation Slice
Validating a patch requires a specialized Validation Slice: a dataset collected on a time-separated holdout, where (old SDP), (new SDP), AND LRO data (or a faster proxy) are all measured.
Why time-separated?
To prevent adaptive overfitting. If the validation set is the same data used to identify the problem, any patch will appear to "work." Time separation ensures the patch generalizes to new data.
6.2.1. Example SDP Residual Card
Residual Card #2025-Q3-01
- Slice: Complex coding tasks (n=500, data science domain)
- Failure mode: Y overestimated welfare (PTE = 0.35 in this slice)
- Evidence: SDP-high responses (Y > 0.8) had verbose explanations + correct syntax, but 90-day follow-up showed low code reuse (15%) and high debugging time (+40% vs. baseline).
- Root cause: SDP rewards "looks correct at t=0" (syntax, tests pass, follows style guide) but misses long-term maintainability and documentation quality (which impact LROs).
- Proposed patch δ: Add obligation: "Assess 6-month maintainability: (a) Is the code well-documented for future readers? (b) Are edge cases and assumptions made explicit? (c) Would a new team member understand this code in 6 months without asking the author?"
- Expected impact: Increase PTE from 0.35 to >0.6 on coding tasks by penalizing "quick-but-fragile" code.
6.3. The Acceptance Predicate
A patch is accepted ONLY IF it passes all guardrails on the Validation Slice:
Accept() =
- [ΔPTE ≥ η] (Empirical Improvement): Must significantly improve PTE against LROs. (Primary Criterion). Default: η = 0.05.
- ∧ [Green-Slice Non-Inferiority] (Do No Harm): Must not degrade alignment in domains where the current SDP performs well. Test on slices with PTE > 0.7.
- ∧ [Fairness Non-Regression] (Equity): Must not increase bias or disparate impact across protected groups (requires fairness metrics in the validation set).
- ∧ [Expert Alignment Improvement] (Validity): Must improve alignment with expert audits on high-stakes examples (Pillar 2).
- ∧ [Anchor Stability] (Stability): Must not significantly drift the [0,1] scale vs. . Test: anchor re-evaluation on holdout.
- ∧ [Cost/Latency Constraints] (Operational): Must remain within budgeted cost/latency for collecting .
6.4. Versioning and Deployment
If accepted, deploy SDP . This action mandates:
- Re-calibration of all associated judges (CIMO Layer 2), as the measurement scale has changed.
- Changelog entry: Document what changed, why, and the validation evidence.
- Backward compatibility: If historical comparisons are needed, maintain ability to score with SDP for a deprecation period.
7. Reporting: The A0 Ledger
Every evaluation report must include the A0 Ledger, summarizing the validation status.
A0 Ledger Template
| SDP Version: | v1.1 |
| Empirical Alignment (PTE): | 0.78 [0.72, 0.84] on LRO suite v3.0 (M=15 A/B tests, 2024-Q4) |
| LROs Used: | 90-day retention, revenue per user, task success rate |
| Construct Validity Status: | Passed Expert Audit Q4 (n=50, bias < 0.02); Red Team findings: 2 minor gaps documented |
| Stability Status: | Inter-Pool Reliability ICC = 0.83; Anchor Drift < 0.02 |
| SDP-Gov Status: | Last patch accepted 2025-11-10 (Residual Card #2025-Q3-01) |
| A0 Status: | PASS (PTE > 0.7) |
Status Thresholds (Proposed Defaults)
- PASS: PTE ≥ 0.7, all Pillar 2/3 tests pass
- WARN: PTE 0.5-0.7, or minor Pillar 2/3 issues → Monitor closely, consider patches
- FAIL: PTE < 0.5, or major Pillar 2/3 failures → SDP redesign required
Calibrate thresholds to your risk tolerance.
8. BVP Cadence and Computational Cost
The BVP requires A/B tests, LRO data collection, and expert time. Run it at appropriate intervals:
Quarterly (Lightweight)
- Anchor drift check (Pillar 3, Test 5.2): 1-2 hours, automated
- Inter-pool reliability on new data (Pillar 3, Test 5.1): ~40 labeling hours (2 pools × 200 examples × 6 min/example)
Annually (Full Audit)
- PTE meta-analysis on year's A/B tests (Pillar 1): ~1 week analysis + LRO data pipeline
- Expert consensus audit (Pillar 2, Test 4.1): ~20 expert hours for n=50 examples
Triggered (As Needed)
- Major SDP change → Full BVP before deployment
- PTE drops below 0.5 → Red team (Pillar 2, Test 4.2) + SDP-Gov patch cycle
- Anchor drift > 0.05 → Re-anchor and re-run PTE
Cost-Benefit
The BVP is expensive (A/B testing, LRO tracking), but the cost of optimizing for the wrong objective is much higher. A single quarter of shipping a model optimized to a misaligned can waste millions in compute and harm user trust. The BVP makes that risk quantifiable and manageable.
9. Worked Examples
Example 1: PTE Calculation
Scenario
You've run M=12 A/B tests over the past year. For each test i, you have:
- : 90-day retention lift (percentage points)
- : Offline CJE estimate of welfare difference
Data:
| Test | ΔLRO | ΔY |
|---|---|---|
| 1 | 2.1 | 0.08 |
| 2 | -0.5 | -0.02 |
| 3 | 3.2 | 0.12 |
| ... | ... | ... |
| 12 | 1.8 | 0.07 |
Calculation:
- Compute β: β̂ = (Σ ΔLRO · ΔY) / (Σ ΔY2) = 24.5 (retention points per Y unit)
- Compute residuals: ei = ΔLRO,i - β̂ · ΔY,i
- PTE = 1 - (Σ ei2) / (Σ ΔLRO2) = 0.81
Result: PTE = 0.81 [0.74, 0.88] → A0 PASS. Y strongly predicts LROs.
Example 2: SDP-Gov Patch Cycle
Timeline
- 2025-Q3: BVP detects PTE = 0.35 on coding tasks (Residual Card #2025-Q3-01)
- 2025-Q3: Propose patch δ: Add "6-month maintainability" obligation
- 2025-Q4: Collect Validation Slice (n=300, time-separated, with Y, Y', and LRO data)
- 2025-Q4: Test acceptance predicate:
- ✓ ΔPTE = +0.28 (0.35 → 0.63)
- ✓ Green-slice non-inferiority: PTE unchanged on non-coding tasks
- ✓ Fairness: No disparate impact across user segments
- ✓ Expert alignment: Improved on 45/50 high-stakes examples
- ✓ Anchor stability: Drift < 0.01
- ✓ Cost: +15% labeling time (acceptable)
- 2025-11-10: Deploy SDP v1.2 with patch δ
- 2025-11-11: Re-calibrate all judges to SDP v1.2 scale
Assumptions Ledger
This table extends the AI Quality as Surrogacy assumptions ledger with A0.
| Code | Statement | Used by | Test / Diagnostic | Mitigation |
|---|---|---|---|---|
| A0 | (Bridge Assumption) | All Layers | BVP (Pillar 1: PTE; Pillar 2: Audits; Pillar 3: Stability) | SDP-Gov: SDP Patching and Governance |
| S1 | Layer 2-6 | Calibration residuals; Prentice test | Add covariates; richer judge; higher rung | |
| S2 | (S-admissibility) | Layer 2-6 (cross-environment) | Transport test; diagram review | If selection into Y: recalibrate with target oracle labels |
Citation
If you use this work, please cite:
BibTeX
@misc{landesberg2025bridgea0,
author = {Landesberg, Eddie},
title = {Validating the Bridge Assumption (A0): The Bridge Validation Protocol and SDP-Gov},
year = {2025},
month = {November},
url = {https://cimolabs.com/research/bridge-validation-a0},
note = {CIMO Labs Technical Report}
}Plain Text
Landesberg, E. (2025). Validating the Bridge Assumption (A0): The Bridge Validation Protocol and SDP-Gov. CIMO Labs Technical Report. https://cimolabs.com/research/bridge-validation-a0
