Pillar C

Y*-Aligned Systems

Define what you value (Y*). Operationalize it (SDP). Validate the bridge (A0). Govern the protocol.

🧪

Research

Theoretical, not yet implemented

Core Insight: Optimization is a fluid—it flows through the path of least resistance. If you don't control the topology, your model flows into reward hacking. Pillar C uses causal mediation and deliberation protocols to reshape the optimization surface so the easiest path aligns with welfare.

What is Y*-Alignment?

Y*-Alignment is the process of ensuring that optimizing measured outcomes (Y) produces improvements in idealized welfare (Y*). This is distinct from Pillar A (CJE), which calibrates surrogates (S) to measured outcomes (Y).

The Measurement Hierarchy

S → Calibration (Pillar A) → Y → Bridge (Pillar C) → Y*

S (Surrogate): Cheap signals: LLM judge scores, BLEU, perplexity
Y (Operational Welfare): Measured via Standard Deliberation Protocol (SDP)
Y* (Idealized Welfare): What you actually care about (the North Star)

The Hard Truth

All the calibration in the world won't help if Y doesn't align with Y*. This is the Bridge Assumption (A0). If A0 fails, you're optimizing a bureaucracy, not welfare.

Pillar C provides the machinery to:

Define Y*: What does "good" actually mean? (Idealized Deliberation Oracle)
Operationalize Y: How do we measure it? (Standard Deliberation Protocol)
Validate the Bridge: Does Y predict Y*? (Bridge Validation Protocol)
Govern the Protocol: Keep Y aligned as models evolve (SDP-Gov, CLOVER)

→ Read: Y*-Aligned Systems (Conceptual)

Standard Deliberation Protocols (SDPs)

An SDP is the recipe for generating Y. It defines the operational measurement procedure: what information the evaluator sees, what steps they follow, what rubric they apply.

SDP as Process Reward Model (PRM)

In RL training, Process Reward Models (PRMs) score each step in a reasoning chain, not just the final answer. This makes reward hacking exponentially harder.

SDPs are the evaluation analog: Score each deliberation step (evidence gathering, counter-position, impact assessment), not just response quality. Both strengthen causal mediation by raising the cost of side channels.

Why SDPs Work

The Problem: Models exploit side channels (verbosity, sycophancy, style) because they're easier than improving actual welfare. This is Causal Goodhart.

The Solution: SDPs enforce causal mediation by making side channels costly. To score high on an SDP, you must:

Gather relevant evidence (not just assert)
Consider counter-positions (not just confirm bias)
Assess realistic impacts (not just hypothesize)
Acknowledge gaps (not just confabulate)

Result: The Goodhart Limit extends from 8-16 optimization steps (naive RM) to 64-128 steps (SDP). You get 8-10× more safe optimization budget.

Engineering SDPs

Learn how to design robust deliberation protocols, what makes a good rubric, and how to test for execution fidelity.

→ Read: Y*-Aligned Systems (Technical)

The Adaptive Gradient Perspective

Understand SDPs through the lens of gradient alignment: maximizing α = (∇S · ∇Y*) / (‖∇S‖ ‖∇Y*‖).

→ Read: The Geometry of Goodhart's Law

Governance: CLOVER & SDP-Gov

Alignment isn't static. As models and environments evolve, the calibration and the protocol must be maintained. CIMO provides two governance frameworks, operating at different layers.

The Governance Distinction

CLOVER (Layer 6): Governs the Judge (S→Y). Improves the rubric so surrogates better predict operational labels.
SDP-Gov (Layer 0): Governs the Protocol (Y→Y*). Improves the SDP so operational labels better align with the idealized target.

CLOVER: Continuous LLM-Oracle Validation & Evolution Regime

CLOVER is the governance framework for judges. It detects drift (when S→Y calibration degrades) and provides structured rubric improvement.

Use cases: Judge pools change, rubrics need updating, adversarial prompts emerge.

→ Read: CLOVER (Technical)

SDP-Gov: Protocol Governance

SDP-Gov is the governance framework for the protocol itself. It validates that Y aligns with Y* and adapts the SDP as the target evolves.

Use cases: Business priorities shift, user expectations change, new failure modes emerge.

Note: SDP-Gov is the least developed component of the CIMO Framework. Framework is defined, implementation pending.

Bridge Validation Protocol (BVP)

Empirically validate that Y predicts Y* via Predictive Treatment Effects (PTE) against long-run outcomes.

→ Read: Validating the Bridge Assumption

The Geometry of Optimization

Why does optimization fail? What is the structure of reward hacking? Pillar C provides a geometric perspective on Goodhart's Law and the manifold topology of safe optimization.

The Surrogate Paradox

Why correlation breaks under optimization. A surrogate can be 95% correlated with welfare in passive observation, but catastrophically diverge when you optimize it.

The solution requires causal mediation: Z → Y → S topology. Every path from treatment to outcome must flow through the surrogate.

→ Read: The Surrogate Paradox

The Geometry of Goodhart's Law

The manifold perspective on reward hacking. The Causal Information Manifold is the geometric structure where improvements in S correspond to improvements in Y*. Optimization "off the manifold" leads to Goodhart crashes.

DbP (Pillar A) and SDPs (Pillar C) work together to keep optimization on the manifold.

→ Read: The Geometry of Goodhart's Law

Robustness Under Pressure

How to test for fragility before deployment. Stress-test your evaluation under distributional shift, adversarial prompts, and extreme optimization.

Uses sensitivity analysis to identify the Goodhart Limit and failure modes.

→ Read: Robustness Under Pressure