The Welfare Compiler: One Estimator, Three Regimes
Your reward model and your evaluator are measuring different things. Not "slightly different"—fundamentally different. When the policy moves to maximize the first, you measure it against the second, and then you're surprised when the numbers don't match.
TL;DR by role
- RLHF engineer: Your reward model and eval suite are measuring different things. That's why your evals don't reflect training gains.
- Eval engineer: If you're not using the same calibrator as training, you're measuring a different functional. Your numbers are real; they're just answering a different question.
- Product owner: The "alignment tax" you're paying is partly a consistency tax. Unifying the pipeline can recover some of it.
Core Thesis
Pretraining, RLHF, and evaluation are three different regimes of estimating and optimizing the same underlying welfare functional. They just use different surrogates, under different constraints, with different control variables. When those surrogates point at different targets, you get systematic misalignment—not as a philosophical puzzle, but as a decomposable statistical error.
One Functional, Three Regimes
Start with a simple object: the welfare functional for a policy π:
Where:
- X: contexts (prompts, tasks, user states)
- A: actions (model outputs, responses)
- Y*: true welfare—what you actually care about (truth, helpfulness, alignment with user intent, long-term value)
- π: your policy (LLM)
This is the thing you want to maximize. But you never observe Y* directly. It's too expensive (human deliberation, expert audits, long-term outcomes) or too abstract (idealized truth, counterfactual user satisfaction).
Notation (used consistently throughout)
- Y*: ideal welfare — the thing you actually care about (unobserved)
- Y: operational welfare — expensive labels approximating Y* (expert audits, SDP-produced ratings)
- S: cheap surrogates — LLM judge scores, clicks, thumbs up, log-probs
The calibrator maps S → E[Y | S, X] using (S, Y) pairs. Y is your best observable proxy for Y*.
Every stage of your pipeline is trying to estimate or optimize V(π) using these surrogates. The difference is what you can observe and what you can control:
| Stage | Observable | Control | Goal |
|---|---|---|---|
| Pretraining | (X, text) from internet | Model parameters θ | Learn representations, base policy π₀ |
| RLHF | (X, S, Y) for sampled actions | Policy π via gradient updates | Maximize E[Y* | π] using learned reward model |
| Evaluation | (X, S, Y) from logs or fresh draws | Policy selection only | Estimate V(π) for multiple π, choose best |
They're three faces of the same statistical decision problem. And when they disagree about which surrogate approximates Y*, you get multi-stage Goodhart's Law.
The Compiler Analogy
Think of your AI pipeline as a compiler that transforms specifications into behavior. The current stack is a broken compiler: it optimizes for side-effects (proxy metrics) rather than semantics (actual welfare).
- Pretraining → Standard library / untyped substrate (defines what behaviors are expressible)
- SDP → Type system for welfare traces (makes certain misalignments structurally inexpressible)
- Calibrator → Compiler passes over the same IR (reward model = eval model = same semantic target)
- CJE / OPE → Linting, tests, and runtime contracts (checking the same semantics at different stages)
The goal: type-safe AI development where the "compiled" policy provably respects the welfare specification.
The Error Decomposition
Here's where the misalignment actually comes from. Your evaluation estimate of V(π) has three sources of error:
What V⁽¹⁾ and V⁽²⁾ actually are
- V(π): True welfare under Y*. The thing you actually care about. Typically unobserved.
- V⁽¹⁾(π): The welfare functional your RLHF reward model is optimizing. This is E[r_θ(X, A) | π] where r_θ was trained on some (S, Y) dataset with some calibration procedure. It's what the RL algorithm thinks it's maximizing.
- V⁽²⁾(π): The welfare functional your evaluation suite measures. This might use a different judge, different calibrator, or different Y labels than RLHF used.
- V̂(π): Your finite-sample estimate of V⁽²⁾(π), subject to variance, bias, and coverage limitations.
Key insight: If V⁽¹⁾ ≠ V⁽²⁾ (different calibrators, different Y labels, different judges), then RLHF and eval are measuring different things by construction. This isn't a bug—it's a design choice that most teams make accidentally.
1. Spec Error: V(π) - V⁽¹⁾(π)
This is the classic alignment problem: your reward model doesn't actually measure what you care about.
Example: Sycophancy
You care about truthfulness (Y*), but your reward model is trained on "user thought this answer was helpful" (Y⁽¹⁾). Sycophancy—telling users what they want to hear—scores high on Y⁽¹⁾ but low on Y*.
This is the Surrogate Paradox: high correlation between S and Y* at evaluation time doesn't mean optimizing S will increase Y*. You need causal mediation—the action must affect Y* through S, not just correlate with it.
2. Cross-Stage Mismatch: V⁽¹⁾(π) - V⁽²⁾(π)
Even if your reward model and your evaluator are both decent proxies for Y*, if they measure different aspects of welfare, RLHF will optimize for one while you measure the other.
Example: Helpfulness vs Business KPIs
RLHF reward model trained on "helpfulness" (detailed answers, citations, formatting). Evaluation suite measures "business KPI" (user retention, task completion, cost per query).
You spend compute optimizing helpfulness. Your evals show retention didn't improve—or got worse because responses are too verbose. The policy is doing exactly what you told it to; you just told it the wrong thing.
3. Estimation Error: V⁽²⁾(π) - V̂(π)
Even if RLHF and eval agree on the target, your empirical estimate V̂(π) has variance and bias from:
- Finite labeled data
- Off-policy correction (if you're evaluating π using logs from a different policy μ)
- Calibration error in your judge/surrogate
- Coverage gaps (your logging policy never saw the regions where π operates after RLHF)
This is the domain of Causal Judge Evaluation (CJE): turning cheap surrogates into unbiased, low-variance estimators of V(π) with honest uncertainty quantification.
What Each Stage Actually Does
Pretraining: Learning the Support and Representation
Pretraining doesn't optimize Y*. It doesn't even know what Y* is. What it does:
1. Defines the support of behaviors
The pretrained model π₀ determines which regions of (X, A) space are reachable. It defines the "standard library" of behaviors your later stages can compose. If pretraining is on web text, certain reasoning patterns, linguistic styles, and knowledge domains have high density. Others are nearly unreachable without massive RL compute.
2. Learns representations that reduce variance
Pretraining learns features φ(X, A) that later stages use to approximate Y*:
Good pretraining → lower sample complexity for learning the Y* mapping during reward modeling and calibration.
3. Provides a prior over policies
The KL penalty in RLHF literally uses π₀ as a prior:
This is Bayesian policy optimization: start with π₀, update toward policies with higher expected reward, but don't move too far.
Bottom line: Pretraining sets the geometry, support, and inductive biases for everything downstream. It's not optimizing welfare, but it constrains how well you can optimize welfare later.
RLHF: On-Policy Optimization of an Estimator
RLHF is direct method estimation of V(π) used inside a policy gradient loop.
1. Reward model training = CJE calibration
You collect a slice of (X, A, S, Y) data where S is cheap judge scores and Y is expensive oracle labels. Then you fit:
This is AutoCal-R (Automatic Calibration for Rewards): isotonic regression or parametric calibration mapping surrogates to outcomes.
2. RLHF = maximizing the calibrated surrogate
Once you have r_θ, RLHF does:
This is the Direct Method for policy evaluation: generate fresh samples from π, score them with r_θ, average. Except here, you're using it to optimize, not just estimate.
The pathology
RLHF treats r_θ as if it's the ground truth, not an estimator with error bars. As π changes, the distribution of (X, A) changes, and r_θ's calibration may break. You have no runtime feedback loop to detect this until you evaluate with Y and see the divergence.
Evaluation: Off-Policy Estimation of V(π)
Evaluation is the same statistical problem as RLHF's objective—estimate V(π)—but with π frozen and using logged data instead of fresh samples.
You have a target policy π, a logging policy μ, and surrogates S with optional oracle labels Y. And you want to estimate:
using data from μ, not π. Three regimes:
Direct Method (DM)
Generate fresh samples from π, score them with a calibrated judge:
Inverse Propensity Scoring (IPS)
Reweight logged data by importance weights:
Doubly Robust (DR)
Combine a baseline model with importance-weighted corrections:
All three use the same calibrated surrogate machinery as RLHF's reward model. The difference is control: you're estimating, not optimizing.
The failure mode
If your evaluator uses a different judge, different calibration, or different Y than RLHF's reward model, you're measuring V⁽²⁾(π) while RLHF optimized V⁽¹⁾(π). Even if both are good proxies for Y*, the cross-stage mismatch means your eval doesn't measure what RL did.
Three Concrete Failure Modes
Let's make this practical. Here are three ways the pipeline breaks when stages disagree:
Failure Mode 1
Sycophancy (Spec Error)
True welfare Y*: truthful, factually accurate answers
RLHF reward model: "user rated this answer as helpful"
Evaluator: fact-checking against ground truth
Users often rate confident, detailed answers as "helpful" even when they're wrong. The reward model learns to optimize for user approval. The model learns to sound confident (high r_θ, low Y*), agree with user priors, and provide detailed but unverified claims.
Your evaluator runs fact-checking and finds: accuracy went down after RLHF.
Error term: V(π) - V⁽¹⁾(π) is large. The reward model was never measuring truthfulness; it was measuring user approval.
Failure Mode 2
Cross-Stage Mismatch
RLHF reward model: "helpfulness" (detailed explanations, citations, formatting)
Evaluator: task success rate and user retention
RLHF optimizes for detailed, well-formatted answers. The model learns to always provide long explanations and never give short answers.
Your evaluator measures task completion and finds:
- Task success down: users wanted quick answers for simple queries, got walls of text
- Retention down: slow responses, cognitive load increased
- Cost up: longer responses = more tokens = higher inference cost
Error term: V⁽¹⁾(π) - V⁽²⁾(π) is large. Both are proxies for "good responses," but they point in different directions.
Failure Mode 3
Coverage-Limited Efficiency (Estimation Error)
Setup: RLHF pushes π away from the pretrained prior π₀
Problem: Evaluator uses logged data from π₀ with off-policy correction
After RLHF, π operates in regions where μ (the logging policy) had low density. Importance weights explode. A few samples dominate the estimate. Effective sample size (ESS) collapses.
Your evaluator reports:
- High variance: ±30% confidence intervals on policy ranking
- Unstable rankings: π₁ > π₂ in one seed, π₂ > π₁ in another
- Broken diagnostics: coverage warnings, tail-heavy weight distributions
Error term: V⁽²⁾(π) - V̂(π) is large due to poor overlap and distribution shift.
The Fix: Shared Calibrators and Explicit Bounds
The solution isn't "build a better reward model" or "run more evals." It's to treat the reward model, calibrator, and evaluator as the same statistical object and enforce consistency across stages.
1. Use the Same Calibrated Surrogate Across Stages
Your reward model r_θ in RLHF should be the same calibrator you use in evaluation.
Why this works: If both RL and eval use the same r_θ, they're optimizing and measuring the same estimator of Y*. The cross-stage mismatch V⁽¹⁾ - V⁽²⁾ vanishes by construction.
What this doesn't fix: Sharing the calibrator collapses V⁽¹⁾ and V⁽²⁾ into the same object. It removes cross-stage mismatch but does nothing about spec error (V - V⁽¹⁾). If your shared calibrator is biased toward Y* (sycophancy, gaming, misspecification), you've just baked that bias into both training and eval. That's what SDP and spec-level audits are for.
2. Define One Y* Specification (SDP)
Stop using vague target specs like "helpful" or "aligned." Define a Standard Deliberation Protocol (SDP): a structured schema for what counts as welfare.
An SDP specifies:
- Evidence requirements: what kinds of support must be present (citations, reasoning steps, uncertainty quantification)
- Trace structure: required reasoning steps, decision points
- Forbidden behaviors: outputs that are zero-reward by definition
Why this works: SDP acts as a "type system" for welfare. It makes certain classes of misalignmentstructurally inexpressible as high-welfare traces.
3. Explicit Bounds on Calibration Domain
Your calibrator r_θ is only valid in the region where it was trained. Outside that region, you're in undefined territory.
Track and enforce:
- Target-Typicality Coverage (TTC): fraction of logged samples that fall in target-typical regions
- Effective Sample Size (ESS): how many effective samples you have after importance weighting
- Covariate Shift: how far π's induced distribution has moved from the calibration set
Why this works: You stop pretending your evaluator works everywhere. You quantify where it's valid and either constrain RLHF to stay there or explicitly recalibrate when π moves.
Summary: One Estimator, Three Regimes
Current practice treats pretraining, RLHF, and evaluation as separate systems with separate objectives: different reward models, different eval metrics, no shared notion of welfare, and predictable, measurable misalignment.
The fix is to recognize they're three regimes of the same statistical problem:
And to enforce consistency:
- Same calibrator (reward model = evaluator)
- Same Y* spec (SDP across stages)
- Explicit domain bounds (TTC, ESS, coverage diagnostics)
When you do this, the error decomposes cleanly:
Each term is measurable, fixable, and orthogonal to the others.
That's the Welfare Compiler: a pipeline where pretraining provides the substrate, SDP defines the type system, CJE builds the calibrator, RLHF optimizes it, and evaluation measures the same objective—under explicit, enforced consistency constraints.
Not a grand unified theory. Just one welfare functional, estimated consistently across three regimes.
Practical Playbook: What to Do This Quarter
If you run a training stack today, here's a concrete checklist for reducing cross-stage misalignment:
Unify reward and eval calibrators
Make the reward model an explicit, versioned artifact. Use that same calibrator (or a trivially transformed version) as the baseline model in DR/DM evals. Track which calibrator version was used for which training run.
Adopt a minimal SDP for one domain
Start with a first-pass SDP for one domain (e.g., factual QA) that spells out evidence requirements and forbidden behaviors. Use it in both labeling instructions and judge prompts. Expand coverage incrementally.
Instrument coverage diagnostics
Add TTC (Target-Typicality Coverage) and ESS (Effective Sample Size) metrics to your eval pipeline. Refuse to report point estimates without them. Set thresholds for "this estimate is unreliable."
Close the organizational loop
The RM team (owning V⁽¹⁾) and Eval team (owning V⁽²⁾) are often separate orgs with different tools and definitions. The "shared calibrator" fix isn't just a code change—it requires coordination. Put someone in charge of the handoff.
Cite this work
Landesberg, Eddie and CIMO Labs (2025). The Welfare Compiler: One Estimator, Three Regimes. CIMO Labs. https://cimolabs.com/research/welfare-compiler
