Pillar C
Y*-Aligned Systems
Define what you value (Y*). Operationalize it (SDP). Validate the bridge (A0). Govern the protocol.
Research
Theoretical, not yet implemented
Core Insight: Optimization is a fluidβit flows through the path of least resistance. If you don't control the topology, your model flows into reward hacking. Pillar C uses causal mediation and deliberation protocols to reshape the optimization surface so the easiest path aligns with welfare.
What is Y*-Alignment?
Y*-Alignment is the process of ensuring that optimizing measured outcomes (Y) produces improvements in idealized welfare (Y*). This is distinct from Pillar A (CJE), which calibrates surrogates (S) to measured outcomes (Y).
The Measurement Hierarchy
- S (Surrogate)
- Cheap signals: LLM judge scores, BLEU, perplexity
- Y (Operational Welfare)
- Measured via Standard Deliberation Protocol (SDP)
- Y* (Idealized Welfare)
- What you actually care about (the North Star)
The Hard Truth
All the calibration in the world won't help if Y doesn't align with Y*. This is the Bridge Assumption (A0). If A0 fails, you're optimizing a bureaucracy, not welfare.
Pillar C provides the machinery to:
- Define Y*: What does "good" actually mean? (Idealized Deliberation Oracle)
- Operationalize Y: How do we measure it? (Standard Deliberation Protocol)
- Validate the Bridge: Does Y predict Y*? (Bridge Validation Protocol)
- Govern the Protocol: Keep Y aligned as models evolve (SDP-Gov, CLOVER)
Standard Deliberation Protocols (SDPs)
An SDP is the recipe for generating Y. It defines the operational measurement procedure: what information the evaluator sees, what steps they follow, what rubric they apply.
SDP as Process Reward Model (PRM)
In RL training, Process Reward Models (PRMs) score each step in a reasoning chain, not just the final answer. This makes reward hacking exponentially harder.
SDPs are the evaluation analog: Score each deliberation step (evidence gathering, counter-position, impact assessment), not just response quality. Both strengthen causal mediation by raising the cost of side channels.
Why SDPs Work
The Problem: Models exploit side channels (verbosity, sycophancy, style) because they're easier than improving actual welfare. This is Causal Goodhart.
The Solution: SDPs enforce causal mediation by making side channels costly. To score high on an SDP, you must:
- Gather relevant evidence (not just assert)
- Consider counter-positions (not just confirm bias)
- Assess realistic impacts (not just hypothesize)
- Acknowledge gaps (not just confabulate)
Result: The Goodhart Limit extends from 8-16 optimization steps (naive RM) to 64-128 steps (SDP). You get 8-10Γ more safe optimization budget.
Engineering SDPs
Learn how to design robust deliberation protocols, what makes a good rubric, and how to test for execution fidelity.
β Read: Y*-Aligned Systems (Technical)The Adaptive Gradient Perspective
Understand SDPs through the lens of gradient alignment: maximizing Ξ± = (βS Β· βY*) / (ββSβ ββY*β).
β Read: The Geometry of Goodhart's LawGovernance: CLOVER & SDP-Gov
Alignment isn't static. As models and environments evolve, the calibration and the protocol must be maintained. CIMO provides two governance frameworks, operating at different layers.
The Governance Distinction
CLOVER (Layer 6): Governs the Judge (SβY). Improves the rubric so surrogates better predict operational labels.
SDP-Gov (Layer 0): Governs the Protocol (YβY*). Improves the SDP so operational labels better align with the idealized target.
CLOVER: Continuous LLM-Oracle Validation & Evolution Regime
CLOVER is the governance framework for judges. It detects drift (when SβY calibration degrades) and provides structured rubric improvement.
Use cases: Judge pools change, rubrics need updating, adversarial prompts emerge.
β Read: CLOVER (Technical)SDP-Gov: Protocol Governance
SDP-Gov is the governance framework for the protocol itself. It validates that Y aligns with Y* and adapts the SDP as the target evolves.
Use cases: Business priorities shift, user expectations change, new failure modes emerge.
Note: SDP-Gov is the least developed component of the CIMO Framework. Framework is defined, implementation pending.
Bridge Validation Protocol (BVP)
Empirically validate that Y predicts Y* via Predictive Treatment Effects (PTE) against long-run outcomes.
β Read: Validating the Bridge AssumptionThe Geometry of Optimization
Why does optimization fail? What is the structure of reward hacking? Pillar C provides a geometric perspective on Goodhart's Law and the manifold topology of safe optimization.
The Surrogate Paradox
Why correlation breaks under optimization. A surrogate can be 95% correlated with welfare in passive observation, but catastrophically diverge when you optimize it.
The solution requires causal mediation: Z β Y β S topology. Every path from treatment to outcome must flow through the surrogate.
β Read: The Surrogate ParadoxThe Geometry of Goodhart's Law
The manifold perspective on reward hacking. The Causal Information Manifold is the geometric structure where improvements in S correspond to improvements in Y*. Optimization "off the manifold" leads to Goodhart crashes.
DbP (Pillar A) and SDPs (Pillar C) work together to keep optimization on the manifold.
β Read: The Geometry of Goodhart's LawRobustness Under Pressure
How to test for fragility before deployment. Stress-test your evaluation under distributional shift, adversarial prompts, and extreme optimization.
Uses sensitivity analysis to identify the Goodhart Limit and failure modes.
β Read: Robustness Under PressureRelated Reading
Conceptual
How metrics become manipulable under optimization
The crisis of trust in current evaluation
Technical
Formal treatment of the alignment problem
The final scorecard: Gains - Losses in Y* units
