CIMO LabsCIMO Labs

The Welfare Compiler: One Estimator, Three Regimes

Eddie Landesberg32 min read

Your reward model and your evaluator are measuring different things. Not "slightly different"—fundamentally different. When the policy moves to maximize the first, you measure it against the second, and then you're surprised when the numbers don't match.

TL;DR by role

  • RLHF engineer: Your reward model and eval suite are measuring different things. That's why your evals don't reflect training gains.
  • Eval engineer: If you're not using the same calibrator as training, you're measuring a different functional. Your numbers are real; they're just answering a different question.
  • Product owner: The "alignment tax" you're paying is partly a consistency tax. Unifying the pipeline can recover some of it.

Core Thesis

Pretraining, RLHF, and evaluation are three different regimes of estimating and optimizing the same underlying welfare functional. They just use different surrogates, under different constraints, with different control variables. When those surrogates point at different targets, you get systematic misalignment—not as a philosophical puzzle, but as a decomposable statistical error.

One Functional, Three Regimes

Start with a simple object: the welfare functional for a policy π:

V(π)=EXP(X)[Y(X,A)],Aπ(X)V(\pi) = \mathbb{E}_{X \sim P(X)} \left[ Y^*(X, A) \right], \quad A \sim \pi(\cdot | X)

Where:

  • X: contexts (prompts, tasks, user states)
  • A: actions (model outputs, responses)
  • Y*: true welfare—what you actually care about (truth, helpfulness, alignment with user intent, long-term value)
  • π: your policy (LLM)

This is the thing you want to maximize. But you never observe Y* directly. It's too expensive (human deliberation, expert audits, long-term outcomes) or too abstract (idealized truth, counterfactual user satisfaction).

Notation (used consistently throughout)

  • Y*: ideal welfare — the thing you actually care about (unobserved)
  • Y: operational welfare — expensive labels approximating Y* (expert audits, SDP-produced ratings)
  • S: cheap surrogates — LLM judge scores, clicks, thumbs up, log-probs

The calibrator maps S → E[Y | S, X] using (S, Y) pairs. Y is your best observable proxy for Y*.

Every stage of your pipeline is trying to estimate or optimize V(π) using these surrogates. The difference is what you can observe and what you can control:

StageObservableControlGoal
Pretraining(X, text) from internetModel parameters θLearn representations, base policy π₀
RLHF(X, S, Y) for sampled actionsPolicy π via gradient updatesMaximize E[Y* | π] using learned reward model
Evaluation(X, S, Y) from logs or fresh drawsPolicy selection onlyEstimate V(π) for multiple π, choose best

They're three faces of the same statistical decision problem. And when they disagree about which surrogate approximates Y*, you get multi-stage Goodhart's Law.

The Compiler Analogy

Think of your AI pipeline as a compiler that transforms specifications into behavior. The current stack is a broken compiler: it optimizes for side-effects (proxy metrics) rather than semantics (actual welfare).

  • Pretraining → Standard library / untyped substrate (defines what behaviors are expressible)
  • SDP → Type system for welfare traces (makes certain misalignments structurally inexpressible)
  • Calibrator → Compiler passes over the same IR (reward model = eval model = same semantic target)
  • CJE / OPE → Linting, tests, and runtime contracts (checking the same semantics at different stages)

The goal: type-safe AI development where the "compiled" policy provably respects the welfare specification.

The Error Decomposition

Here's where the misalignment actually comes from. Your evaluation estimate of V(π) has three sources of error:

V(π)V^eval(π)=[V(π)V(1)(π)]spec error+[V(1)(π)V(2)(π)]cross-stage mismatch+[V(2)(π)V^(π)]estimation errorV(\pi) - \hat{V}_{\text{eval}}(\pi) = \underbrace{[V(\pi) - V^{(1)}(\pi)]}_{\text{spec error}} + \underbrace{[V^{(1)}(\pi) - V^{(2)}(\pi)]}_{\text{cross-stage mismatch}} + \underbrace{[V^{(2)}(\pi) - \hat{V}(\pi)]}_{\text{estimation error}}

What V⁽¹⁾ and V⁽²⁾ actually are

  • V(π): True welfare under Y*. The thing you actually care about. Typically unobserved.
  • V⁽¹⁾(π): The welfare functional your RLHF reward model is optimizing. This is E[r_θ(X, A) | π] where r_θ was trained on some (S, Y) dataset with some calibration procedure. It's what the RL algorithm thinks it's maximizing.
  • V⁽²⁾(π): The welfare functional your evaluation suite measures. This might use a different judge, different calibrator, or different Y labels than RLHF used.
  • V̂(π): Your finite-sample estimate of V⁽²⁾(π), subject to variance, bias, and coverage limitations.

Key insight: If V⁽¹⁾ ≠ V⁽²⁾ (different calibrators, different Y labels, different judges), then RLHF and eval are measuring different things by construction. This isn't a bug—it's a design choice that most teams make accidentally.

1. Spec Error: V(π) - V⁽¹⁾(π)

This is the classic alignment problem: your reward model doesn't actually measure what you care about.

Example: Sycophancy

You care about truthfulness (Y*), but your reward model is trained on "user thought this answer was helpful" (Y⁽¹⁾). Sycophancy—telling users what they want to hear—scores high on Y⁽¹⁾ but low on Y*.

This is the Surrogate Paradox: high correlation between S and Y* at evaluation time doesn't mean optimizing S will increase Y*. You need causal mediation—the action must affect Y* through S, not just correlate with it.

2. Cross-Stage Mismatch: V⁽¹⁾(π) - V⁽²⁾(π)

Even if your reward model and your evaluator are both decent proxies for Y*, if they measure different aspects of welfare, RLHF will optimize for one while you measure the other.

Example: Helpfulness vs Business KPIs

RLHF reward model trained on "helpfulness" (detailed answers, citations, formatting). Evaluation suite measures "business KPI" (user retention, task completion, cost per query).

You spend compute optimizing helpfulness. Your evals show retention didn't improve—or got worse because responses are too verbose. The policy is doing exactly what you told it to; you just told it the wrong thing.

3. Estimation Error: V⁽²⁾(π) - V̂(π)

Even if RLHF and eval agree on the target, your empirical estimate V̂(π) has variance and bias from:

  • Finite labeled data
  • Off-policy correction (if you're evaluating π using logs from a different policy μ)
  • Calibration error in your judge/surrogate
  • Coverage gaps (your logging policy never saw the regions where π operates after RLHF)

This is the domain of Causal Judge Evaluation (CJE): turning cheap surrogates into unbiased, low-variance estimators of V(π) with honest uncertainty quantification.

What Each Stage Actually Does

Pretraining: Learning the Support and Representation

Pretraining doesn't optimize Y*. It doesn't even know what Y* is. What it does:

1. Defines the support of behaviors

The pretrained model π₀ determines which regions of (X, A) space are reachable. It defines the "standard library" of behaviors your later stages can compose. If pretraining is on web text, certain reasoning patterns, linguistic styles, and knowledge domains have high density. Others are nearly unreachable without massive RL compute.

2. Learns representations that reduce variance

Pretraining learns features φ(X, A) that later stages use to approximate Y*:

rθ(X,A)=gθ(ϕ(X,A))E[YX,A]r_\theta(X, A) = g_\theta(\phi(X, A)) \approx \mathbb{E}[Y^* | X, A]

Good pretraining → lower sample complexity for learning the Y* mapping during reward modeling and calibration.

3. Provides a prior over policies

The KL penalty in RLHF literally uses π₀ as a prior:

maxπ E[rθ(X,A)]λDKL(ππ0)\max_\pi \ \mathbb{E}[r_\theta(X, A)] - \lambda \, D_{\text{KL}}(\pi \| \pi_0)

This is Bayesian policy optimization: start with π₀, update toward policies with higher expected reward, but don't move too far.

Bottom line: Pretraining sets the geometry, support, and inductive biases for everything downstream. It's not optimizing welfare, but it constrains how well you can optimize welfare later.

RLHF: On-Policy Optimization of an Estimator

RLHF is direct method estimation of V(π) used inside a policy gradient loop.

1. Reward model training = CJE calibration

You collect a slice of (X, A, S, Y) data where S is cheap judge scores and Y is expensive oracle labels. Then you fit:

rθ(X,A)=fθ(S,X)E[YS,X]r_\theta(X, A) = f_\theta(S, X) \approx \mathbb{E}[Y^* | S, X]

This is AutoCal-R (Automatic Calibration for Rewards): isotonic regression or parametric calibration mapping surrogates to outcomes.

2. RLHF = maximizing the calibrated surrogate

Once you have r_θ, RLHF does:

maxπ EXP,Aπ[rθ(X,A)]\max_\pi \ \mathbb{E}_{X \sim P, A \sim \pi}[r_\theta(X, A)]

This is the Direct Method for policy evaluation: generate fresh samples from π, score them with r_θ, average. Except here, you're using it to optimize, not just estimate.

The pathology

RLHF treats r_θ as if it's the ground truth, not an estimator with error bars. As π changes, the distribution of (X, A) changes, and r_θ's calibration may break. You have no runtime feedback loop to detect this until you evaluate with Y and see the divergence.

Evaluation: Off-Policy Estimation of V(π)

Evaluation is the same statistical problem as RLHF's objective—estimate V(π)—but with π frozen and using logged data instead of fresh samples.

You have a target policy π, a logging policy μ, and surrogates S with optional oracle labels Y. And you want to estimate:

V(π)=EX,Aπ[Y(X,A)]V(\pi) = \mathbb{E}_{X, A \sim \pi}[Y^*(X, A)]

using data from μ, not π. Three regimes:

Direct Method (DM)

Generate fresh samples from π, score them with a calibrated judge:

V^DM(π)=1ni=1nrθ(Xi,Ai),Aiπ(Xi)\hat{V}_{\text{DM}}(\pi) = \frac{1}{n} \sum_{i=1}^n r_\theta(X_i, A_i), \quad A_i \sim \pi(\cdot | X_i)

Inverse Propensity Scoring (IPS)

Reweight logged data by importance weights:

V^IPS(π)=1ni=1nwiYi,wi=π(AiXi)μ(AiXi)\hat{V}_{\text{IPS}}(\pi) = \frac{1}{n} \sum_{i=1}^n w_i \cdot Y_i, \quad w_i = \frac{\pi(A_i | X_i)}{\mu(A_i | X_i)}

Doubly Robust (DR)

Combine a baseline model with importance-weighted corrections:

V^DR(π)=1ni=1n[Q^(Xi,π)+wi(YiQ^(Xi,Ai))]\hat{V}_{\text{DR}}(\pi) = \frac{1}{n} \sum_{i=1}^n \left[ \hat{Q}(X_i, \pi) + w_i (Y_i - \hat{Q}(X_i, A_i)) \right]

All three use the same calibrated surrogate machinery as RLHF's reward model. The difference is control: you're estimating, not optimizing.

The failure mode

If your evaluator uses a different judge, different calibration, or different Y than RLHF's reward model, you're measuring V⁽²⁾(π) while RLHF optimized V⁽¹⁾(π). Even if both are good proxies for Y*, the cross-stage mismatch means your eval doesn't measure what RL did.

Three Concrete Failure Modes

Let's make this practical. Here are three ways the pipeline breaks when stages disagree:

Failure Mode 1

Sycophancy (Spec Error)

True welfare Y*: truthful, factually accurate answers

RLHF reward model: "user rated this answer as helpful"

Evaluator: fact-checking against ground truth

Users often rate confident, detailed answers as "helpful" even when they're wrong. The reward model learns to optimize for user approval. The model learns to sound confident (high r_θ, low Y*), agree with user priors, and provide detailed but unverified claims.

Your evaluator runs fact-checking and finds: accuracy went down after RLHF.

Error term: V(π) - V⁽¹⁾(π) is large. The reward model was never measuring truthfulness; it was measuring user approval.

Failure Mode 2

Cross-Stage Mismatch

RLHF reward model: "helpfulness" (detailed explanations, citations, formatting)

Evaluator: task success rate and user retention

RLHF optimizes for detailed, well-formatted answers. The model learns to always provide long explanations and never give short answers.

Your evaluator measures task completion and finds:

  • Task success down: users wanted quick answers for simple queries, got walls of text
  • Retention down: slow responses, cognitive load increased
  • Cost up: longer responses = more tokens = higher inference cost

Error term: V⁽¹⁾(π) - V⁽²⁾(π) is large. Both are proxies for "good responses," but they point in different directions.

Failure Mode 3

Coverage-Limited Efficiency (Estimation Error)

Setup: RLHF pushes π away from the pretrained prior π₀

Problem: Evaluator uses logged data from π₀ with off-policy correction

After RLHF, π operates in regions where μ (the logging policy) had low density. Importance weights explode. A few samples dominate the estimate. Effective sample size (ESS) collapses.

Your evaluator reports:

  • High variance: ±30% confidence intervals on policy ranking
  • Unstable rankings: π₁ > π₂ in one seed, π₂ > π₁ in another
  • Broken diagnostics: coverage warnings, tail-heavy weight distributions

Error term: V⁽²⁾(π) - V̂(π) is large due to poor overlap and distribution shift.

The Fix: Shared Calibrators and Explicit Bounds

The solution isn't "build a better reward model" or "run more evals." It's to treat the reward model, calibrator, and evaluator as the same statistical object and enforce consistency across stages.

1. Use the Same Calibrated Surrogate Across Stages

Your reward model r_θ in RLHF should be the same calibrator you use in evaluation.

# Stage 1: Calibration (CJE)
calibrator = fit_autocal_r(judge_scores=S, oracle_labels=Y)
# Stage 2: RLHF
reward_model = calibrator # Same object!
policy = run_rlhf(policy=pi_0, reward_fn=reward_model)
# Stage 3: Evaluation
estimate = doubly_robust_estimator(
policy=policy,
baseline_model=calibrator, # Same object again!
logged_data=data
)

Why this works: If both RL and eval use the same r_θ, they're optimizing and measuring the same estimator of Y*. The cross-stage mismatch V⁽¹⁾ - V⁽²⁾ vanishes by construction.

What this doesn't fix: Sharing the calibrator collapses V⁽¹⁾ and V⁽²⁾ into the same object. It removes cross-stage mismatch but does nothing about spec error (V - V⁽¹⁾). If your shared calibrator is biased toward Y* (sycophancy, gaming, misspecification), you've just baked that bias into both training and eval. That's what SDP and spec-level audits are for.

2. Define One Y* Specification (SDP)

Stop using vague target specs like "helpful" or "aligned." Define a Standard Deliberation Protocol (SDP): a structured schema for what counts as welfare.

An SDP specifies:

  • Evidence requirements: what kinds of support must be present (citations, reasoning steps, uncertainty quantification)
  • Trace structure: required reasoning steps, decision points
  • Forbidden behaviors: outputs that are zero-reward by definition

Why this works: SDP acts as a "type system" for welfare. It makes certain classes of misalignmentstructurally inexpressible as high-welfare traces.

3. Explicit Bounds on Calibration Domain

Your calibrator r_θ is only valid in the region where it was trained. Outside that region, you're in undefined territory.

Track and enforce:

  • Target-Typicality Coverage (TTC): fraction of logged samples that fall in target-typical regions
  • Effective Sample Size (ESS): how many effective samples you have after importance weighting
  • Covariate Shift: how far π's induced distribution has moved from the calibration set

Why this works: You stop pretending your evaluator works everywhere. You quantify where it's valid and either constrain RLHF to stay there or explicitly recalibrate when π moves.

Summary: One Estimator, Three Regimes

Current practice treats pretraining, RLHF, and evaluation as separate systems with separate objectives: different reward models, different eval metrics, no shared notion of welfare, and predictable, measurable misalignment.

The fix is to recognize they're three regimes of the same statistical problem:

V(π)=E[Yπ]V(\pi) = \mathbb{E}[Y^* | \pi]

And to enforce consistency:

  • Same calibrator (reward model = evaluator)
  • Same Y* spec (SDP across stages)
  • Explicit domain bounds (TTC, ESS, coverage diagnostics)

When you do this, the error decomposes cleanly:

V(π)V^(π)=[VV(1)]fix: better SDP+[V(1)V(2)]fix: shared calibrator+[V(2)V^]fix: CJE + diagnosticsV(\pi) - \hat{V}(\pi) = \underbrace{[V - V^{(1)}]}_{\text{fix: better SDP}} + \underbrace{[V^{(1)} - V^{(2)}]}_{\text{fix: shared calibrator}} + \underbrace{[V^{(2)} - \hat{V}]}_{\text{fix: CJE + diagnostics}}

Each term is measurable, fixable, and orthogonal to the others.

That's the Welfare Compiler: a pipeline where pretraining provides the substrate, SDP defines the type system, CJE builds the calibrator, RLHF optimizes it, and evaluation measures the same objective—under explicit, enforced consistency constraints.

Not a grand unified theory. Just one welfare functional, estimated consistently across three regimes.

Practical Playbook: What to Do This Quarter

If you run a training stack today, here's a concrete checklist for reducing cross-stage misalignment:

1

Unify reward and eval calibrators

Make the reward model an explicit, versioned artifact. Use that same calibrator (or a trivially transformed version) as the baseline model in DR/DM evals. Track which calibrator version was used for which training run.

2

Adopt a minimal SDP for one domain

Start with a first-pass SDP for one domain (e.g., factual QA) that spells out evidence requirements and forbidden behaviors. Use it in both labeling instructions and judge prompts. Expand coverage incrementally.

3

Instrument coverage diagnostics

Add TTC (Target-Typicality Coverage) and ESS (Effective Sample Size) metrics to your eval pipeline. Refuse to report point estimates without them. Set thresholds for "this estimate is unreliable."

4

Close the organizational loop

The RM team (owning V⁽¹⁾) and Eval team (owning V⁽²⁾) are often separate orgs with different tools and definitions. The "shared calibrator" fix isn't just a code change—it requires coordination. Put someone in charge of the handoff.

Cite this work

Landesberg, Eddie and CIMO Labs (2025). The Welfare Compiler: One Estimator, Three Regimes. CIMO Labs. https://cimolabs.com/research/welfare-compiler

Related Reading