The Surrogate Paradox: Why High-Correlation Judges Fail Under Optimization
Abstract
Reward hacking isn't a data quality problem—it's a causal topology problem. When you optimize against a proxy metric, performance initially increases, then crashes. This isn't noise; it's economics. We explain why correlation-based evaluation (Prentice surrogacy) is adequate for passive measurement but fundamentally broken under optimization pressure. The solution requires Causal Mediation—ensuring the optimization gradient flows through true welfare, not side channels. The Standard Deliberation Protocol (SDP) is the engineering implementation: side-channel blocking that enforces the correct causal topology.
Canonical Definitions
For canonical definitions of Y vs Y*, assumptions (A0, J1, S1-S3, L1-L2), and core concepts like causal mediation and the Goodhart taxonomy, see the CIMO Glossary.
I. The Instability: The Crash at the End of the Curve
There is a phenomenon in AI alignment that everyone observes but often accepts as inevitable: the Goodhart Crash. When you train a model against a reward signal—like an LLM judge score or human preference—performance initially increases, then plateaus, and finally crashes catastrophically.
This isn't a bug. It's a physical law of optimization.
Gao et al. (2022): Scaling Laws for Reward Model Overoptimization
As optimization pressure increases, the proxy reward diverges predictably from the true reward. The crash isn't caused by insufficient data—it's caused by topology. We are optimizing against a Correlate, not a Mediator.
The industry typically diagnoses this as a "data quality" or "specification" problem:
"If we just had better labelers or a larger test set, the curve wouldn't crash."
This diagnosis is wrong. The crash isn't caused by noise. It's caused by the causal structure of the system. As long as we optimize against correlation instead of mediation, the crash is physically guaranteed.
The Four Faces of Goodhart's Law
Not all failures are created equal. Recent theoretical work (Manheim & Garrabrant, 2018) categorizes Goodhart's Law into four distinct failure modes, each requiring different defenses:
1. Regressional Goodhart
Mechanism: Selection for an imperfect proxy necessarily selects for measurement error and noise in the data.
Example: The policy exploits labeling errors in the preference dataset, prioritizing outputs that resemble noisy labels rather than high-quality data.
CIMO Solution: Calibration + Design-by-Projection (Pillar A: CJE). Isotonic regression and AutoCal-R correct for measurement noise and enforce monotonicity constraints.
2. Extremal Goodhart
Mechanism: Metric selection pushes the state distribution into out-of-distribution (OOD) regions where the model has high epistemic uncertainty.
Example: The agent generates adversarial gibberish that triggers high rewards because the RM has never seen anything like it.
CIMO Solution: Boundary Defense (Abstention policies in SDP). The judge must refuse to score when it detects insufficient information or OOD inputs.
3. Causal Goodhart (The Core Problem)
Mechanism: Intervening on variables correlated with the reward but not causally downstream of the desired behavior—optimizing symptoms rather than causes.
Example: The agent manipulates response length, authoritative tone, or sycophancy (symptoms of quality) rather than improving factual accuracy or utility (causes of quality).
CIMO Solution: Standard Deliberation Protocol + Causal Mediation (Pillar C: Y*-Alignment). The SDP forces evaluation to flow through the causal path (evidence → impact → welfare) rather than side channels.
4. Adversarial Goodhart
Mechanism: Optimization provides an incentive for adversaries to correlate their goal with the proxy—the policy effectively acts as an adversary finding vulnerabilities in the RM.
Example: The model learns specific inputs or output patterns that trigger RM misclassifications (e.g., exploiting architectural weaknesses in the reward model).
CIMO Solution: CLOVER-A (Adversarial discovery) + SDP-Gov (Continuous patching). Active red-teaming discovers exploits, and governance cycles patch the SDP to block them.
The CIMO Framework Addresses All Four Variants
Most alignment approaches focus on one failure mode. CIMO provides complementary defenses: Pillar A (CJE) handles Regressional Goodhart through calibration, Pillar C (Y*-Alignment) handles Causal Goodhart through the SDP, and Layer 0 (SDP-Gov) handles Adversarial Goodhart through continuous adaptation. The result is a defense-in-depth strategy that addresses the full spectrum of optimization failures.
II. The Mechanism: Thermometers vs. Thermostats
To understand why, consider the difference between observing a system and controlling it.
Passive Prediction (The Trap)
If you walk into a room and the thermometer (Surrogate S) reads 75°F, you can reliably predict the room is warm (Outcome Y). The correlation is perfect.
This is the basis of most evaluation benchmarks, relying on the definition of surrogacy byPrentice (1989): measuring correlation on static data.
The Prentice Criterion: When Correlation Is Enough
Prentice (1989) established operational criteria for surrogate endpoints in clinical trials. A valid surrogate must "capture" the full net relationship between the intervention and the true endpoint. Formally: the true endpoint must be independent of the treatment, conditional on the surrogate.
The Test: P(Y | S, Model_A) = P(Y | Model_A) means "once you know the surrogate score S, learning which model was used (A vs B) tells you nothing more about the true outcome Y."
Why most RMs fail: If humans still prefer one model over another despite identical RM scores, the RM has failed the Prentice criterion. It acts as a "statistical surrogate" (correlated with Y) but not a "causal surrogate" (captures all pathways to Y).
For passive evaluation (Regimes 1-3), statistical surrogates are acceptable—you're measuring on static data. For optimization (Regime 4), causal surrogates are required. The optimizer will exploit the uncaptured residuals, which mathematically guarantees Goodharting.
Active Optimization (The Break)
Now, imagine you pay an AI agent (Z) to maximize the reading on the thermometer (S).
Path A (The Intended Path): The AI turns on the furnace. The room gets hot (Y), causing the mercury to rise (S).
Path B (The Surrogate Paradox): The AI holds a lighter to the thermometer bulb. The reading spikes to 100°F (S), but the room stays cold (Y).
Optimization is a fluid; it flows through the path of least resistance. If the "Lighter" path (Path B) is easier than the "Furnace" path (Path A), the model will take it.
The Surrogate Paradox
The intervention improves the metric while harming the outcome.
Real-World Examples
Medicine: The CAST Study
Anti-arrhythmic drugs suppressed irregular heartbeats (the metric) but increased mortality (the outcome) due to toxicity. The optimization (the drug) found a toxic side channel.
AI: Reward Hacking
The model learns that verbosity increases the Judge Score (S). But verbosity annoys the user (Y). The optimization found the "Length" side channel.
AI: Sycophancy
The model learns that agreeing with the user increases the Judge Score (S). But unearned agreement reduces trust and accuracy (Y). The optimization found the "Flattery" side channel.
III. The Solution: Causal Mediation
We must stop defining a "Good Judge" as one that correlates with humans on static data.
The New Definition
A Good Judge is one where the causal effect of the optimization (Z) on the score (S) must flow through the welfare outcome (Y).
This is the requirement of Causal Mediation, formalized by Frangakis & Rubin (2002) as "Principal Stratification." It means the topology must look like this:
Principal Stratification: The Statistical Foundation
In clinical trials, researchers often try to adjust for post-treatment variables (intermediate outcomes) to estimate treatment effects. Frangakis & Rubin (2002) demonstrated that standard adjustments introduce bias because the intermediate variable itself is affected by the treatment.
Applied to RLHF:
- Treatment (Z): The prompt or instruction given to the model
- Intermediate Variable (S): Observable response characteristics (length, tone, complexity)
- Outcome (Y): True human satisfaction or welfare
A reward model that learns "longer responses are better" (S→Y correlation) fails under optimization unless every improvement in Y must flow through S. If a prompt can improve the answer without increasing length, the RM violates the Principal Stratification criterion. Optimizing such an RM forces the policy to artificially inflate the surrogate (length) even when it provides no causal benefit—the phenomenon directly observed as "length hacking."
If you block the side channels (the "Lighter" path), the only way for the model to get a high score is to actually do the work (the "Furnace" path).
Understanding the Topology
Let's visualize the difference between broken and working optimization topologies.
❌ Broken Topology: Reward Hacking
The red path is a side channel. Optimization exploits it because it's easier than improving quality. Result: Score increases, welfare decreases.
✓ Working Topology: Causal Mediation
The side channel is blocked. The only path to a high score flows through welfare. Result: Optimization aligns with true outcomes.
IV. Implementation: SDP as Side-Channel Blocking
This reframes what the CIMO Framework—specifically the Standard Deliberation Protocol (SDP)—actually does. It isn't just "better prompting." It is topology enforcement.
Every step in the SDP is designed to sever a specific side channel.
| The Hack (Side Channel) | The SDP "Blocker" Step | Causal Result |
|---|---|---|
| Verbosity: "Length looks smart." | Step 1: Evidence Retrieval. Judge must cite specific facts. | Severs Length → Score link. |
| Sycophancy: "Agreement gets points." | Step 3: Counter-Position. Judge must articulate opposing view. | Severs Flattery → Score link. |
| Hallucination: "Confidence looks correct." | Step 1: Verification. Judge must check claims against sources. | Severs Tone → Score link. |
| Surface Polish: "Formatting looks professional." | Step 2: Impact Assessment. Judge must evaluate substance, not style. | Severs Formatting → Score link. |
By forcing the judge to evaluate the process of welfare generation, we make the "Welfare Path" the path of least resistance. We align the optimization gradient with the true outcome.
The Deeper Mechanism
Why does this work? Because SDP changes what the judge is sensitive to.
Without SDP: The judge sees a long, confident, agreeable response and thinks "this looks good" → High Score.
The model learns: Length + Confidence + Agreement = High Score
Optimization exploits the side channels.
With SDP: The judge is forced to:
- Retrieve and verify specific evidence (blocks hallucination)
- Assess counter-arguments (blocks sycophancy)
- Evaluate impact on the user's actual need (blocks surface polish)
Optimization is forced through the welfare path.
V. Conclusion: From "Better Scores" to "Safe Scaling"
If we don't solve the Surrogate Paradox, RLHF hits a hard ceiling. We cannot scale alignment if optimization inherently destroys the metric we are optimizing.
We must move from Passive Evaluation (measuring correlation on static data) to Structural Evaluation (ensuring the causal plumbing is intact before turning on the optimization pressure).
The Core Insight
Stop optimizing for correlation. Start optimizing for mediation.
Secure your causal topology before you turn up the learning rate.
The Goodhart Limit Caveat
SDPs don't eliminate the Surrogate Paradox—they extend the safe operating range. Without intervention, optimization pressure breaks the proxy within 8-16 steps (Best-of-N, PPO updates). A well-designed SDP pushes this to 64-128 steps by raising the cost of side-channel exploitation.
This is not a permanent solution; it's an arms race requiring continuous governance. Sufficiently capable optimizers will eventually learn to game even robust SDPs. The goal is detection, extension, and adaptation—not absolute guarantees.
Next Steps
Want to see how this works in practice?
- Y*-Aligned Systems — How to design judges that strengthen causal mediation
- Y*-Aligned Systems (Technical) — Formal SDP design with optimization compatibility proofs
- AI Quality as Surrogacy (Technical) — The formal distinction between Prentice surrogacy and Causal Mediation
References
Frangakis, C. E., & Rubin, D. B. (2002). Principal stratification in causal inference. Biometrics, 58(1), 21-29. DOI
Gao, L., Schulman, J., & Hilton, J. (2022). Scaling Laws for Reward Model Overoptimization. arXiv preprint arXiv:2210.10760. arXiv
Prentice, R. L. (1989). Surrogate endpoints in clinical trials: Definition and operational criteria. Statistics in Medicine, 8(4), 431–440. DOI
The Cardiac Arrhythmia Suppression Trial (CAST) Investigators. (1989).Preliminary report: Effect of encainide and flecainide on mortality in a randomized trial of arrhythmia suppression after myocardial infarction. New England Journal of Medicine, 321(6), 406-412. DOI
