Surrogate Measurement Is Everywhere: Why LLM Judges Are Different

The key claim:

Surrogate measurement is old. Programmable surrogates are new. With LLM judges, you can update the rubric, re-calibrate to oracle labels, and check whether residuals shrink. With fixed proxies like CTR, if correlation with LTV breaks, you cannot patch what a click means.

A 30-second model: S → Y → Y*

Think of quality as the Ladder of Value that every measurement system already uses:

Y*: what you'd decide with time, context, and care (the ideal, but unobservable).
Y: a high-rung outcome you can observe on a small slice (expert audit, task success, 30–90-day retention).
S: cheap, abundant signals (judge scores, clicks, short human labels).

Your goal is to aim abundant S at Y* without paying Y*'s full cost. You do it by learning how S predicts Y on a small gold slice, then using that mapping at scale.

The opportunity with LLM judges is that S itself is editable. You can change the rubric, then test if residuals got better. That is not possible with exogenous proxies.

Why static proxies eventually betray you

Cheap metrics fail in three ways:

Wrong or no calibration. You maximized S (politeness, clicks, raw judge score) without checking whether high S raises Y (productivity, safety, retention).
Wrong population. You calibrated S→Y on the wrong slice (easy tasks, different users), so it doesn't transport to your real distribution.
Temporal drift. The S→Y relationship changed. What once predicted value no longer does.

Traditional proxies (clicks, sign-ups, dwell time) are exogenous. A click is a click. If CTR stops approximating LTV, you cannot rewrite the semantics of a click. You can only replace the proxy or pay to measure outcomes directly.

That is why programmable judges matter operationally: they let you run the same “improve-the-surrogate” loop quickly and repeatedly inside product workflows.

Why this is a continuation, not a new religion

YouTube moved beyond raw watch time toward valued watch time because minutes alone were not enough. Same pattern: refine surrogate definitions to track what actually matters.

Netflix uses learned proxies for long-term satisfaction because true outcomes are delayed and noisy. The core idea is longstanding; LLM judges just make iteration faster and more programmable.

Programmable proxies give you levers for all three: calibrate S→Y on a gold slice to stop optimizing vibes, transport-test residuals on new cohorts, and monitor drift with small monthly slices and re-calibrate when residuals shift.

Defining programmable proxies

Quick definition

Is: An LLM-judge + rubric you control. You write what to reward (accuracy, concision, evidence) and what to penalize (hallucinations, verbosity).
Isn't: Ground truth. It's a metric you can engineer, calibrated to Y, audited with residuals, and updated when you find failure modes.

Because the judge is programmable, you can add missing signal: require evidence checks, penalize gratuitous length, reward correct abstentions. Then you measure whether the error shrinks.

The practical framing is simple: residuals are your surrogate-programming signal. Large residuals tell you where the rubric is wrong. Rubric edits are hypotheses. Re-calibration plus residual shrinkage is your test.

Table scrolls horizontally on mobile.

Aspect	Exogenous (clicks, watch time)	Programmable (LLM-judge)
Who defines "good"?	Implicit behavior	Your rubric
Can you change it?	No	Yes: edit & re-calibrate
Gaming risk	High, hard to remediate directly	Detectable & often fixable

Why cheaper models can judge expensive ones

One counterintuitive property: a smaller, cheaper model can often reliably judge a larger model's output. The reason is a complexity gap between generation and evaluation.

Producing a good answer from scratch requires searching a massive space of possibilities, managing context, and synthesizing information. Evaluating whether a candidate answer meets a rubric is a simpler task: you're scoring a fixed input against explicit criteria (accuracy, concision, citation quality). That asymmetry means you can get reliable judgment from a smaller model or fewer tokens, then spend heavy compute only where the judge is uncertain.

This is why a smaller model can often judge outputs from a larger model, or why a task-specific fine-tuned model can evaluate frontier-model responses. Judging is cheaper than doing.

Judges are programs, not just prompts

The real power isn't only that judges are endogenous (you can rewrite the measurement). It's that they're composable instruments. A well-designed judge can:

Normalize style: Strip formatting, reorder content, or canonicalize before scoring to prevent style gaming
Call tools: Verify citations exist, run unit tests on generated code, check schemas for structured data
Produce structured evidence: Return not just a score, but the specific claims checked, sources verified, or tests run
Version and audit: Log rubric version, model config, and reasoning for every verdict

This makes judges closer to measurement instruments than metrics. You're not just prompting a model for a number. You're building a reusable, auditable evaluation program that can evolve with your needs.

The operating system: steer → calibrate → inspect → update → monitor

Using programmable proxies responsibly is a feedback loop where you progressively tighten the judge's alignment with real outcomes.

The judge improvement loop: steer, calibrate, inspect residuals, update rubric, re-calibrate, deploy with guardrails

Steer: Write the rubric. Spell out "good" in your domain: reward accuracy and minimal necessary length; penalize unsupported claims; prefer cite/quote over vague paraphrase; allow abstain when key info is missing.
Calibrate S→Y on a gold slice. Collect ~100–500 Y labels (expert audit, task success). Score everything with the judge (S). Fit a lightweight calibrator (isotonic or two-stage with covariates like response length) so S predicts Y on the gold set.
Inspect residuals: Y − Ŷ. Slice by response length, domain, difficulty, model family. Large negative residuals = the judge over-predicts. Treat this as a programming signal for the next rubric revision.
Update the judge. Patch the rubric where residuals are worst. Examples: "penalize boilerplate padding," "verify citations exist," "prefer minimal necessary complexity," "abstain if missing key inputs." This is programming the surrogate, not just observing it.
Re-calibrate & verify. Re-score the gold slice, re-fit the calibrator, and check that residuals shrink, especially in the slices that failed. Repeat steps 3–5 until residuals stabilize.
Monitor drift & transport. Monthly, pull a small batch from production, compute residuals, and alert if the mean residual leaves its confidence band. For new cohorts/policies, run a transport test: if residual means differ materially, add the cohort as a feature or fit a local calibrator.

Two definitions you need

Calibration: learn a mapping from judge score S to oracle outcome Y on a gold set.
Residuals: the error Y - Ŷ. Slice residuals by domain/length/model family to find where the judge lies.

The endogenous advantage

If CTR stops approximating LTV, you cannot patch CTR itself. With programmable proxies, you can change what gets rewarded and immediately test whether alignment improved via residuals. That turns surrogate quality into an engineering problem with feedback loops.

Two non-negotiable guardrails

Must-have monitoring

1) Drift monitor (monthly)

Sample 100–200 new items. Apply your old mapping Ŷ = f(S). Alert if mean residual's CI excludes 0 twice in a row → re-calibrate.

This catches temporal drift before it tanks your product.

2) Transport test (new policy/cohort)

Sample 100–200 items. Compare residual mean vs baseline. If CIs don't overlap → don't transport; re-prompt, add cohort as a feature, or fit a local calibrator.

This prevents your metric from lying when you ship to a new population.

These two checks cost ~$50–200/month in labeling and catch the failure modes that turn "4.8/5 judge score" into "support tickets spiking."

Ship this in 7 days

You don't need months to start using LLM-judges responsibly. Here's a minimal viable rollout with approximate time and labeling budgets, using the CJE (Causal Judge Evaluation) package to automate calibration, residual analysis, and diagnostics.

Using CJE

The steps below use the cje-eval package, which implements all the calibration, residual analysis, and diagnostics described in this post.

Installation•Demo•GitHub

Week 1: From zero to deployed judge

Budget: 200–300 labeled examples, ~3–4 hours reviewer time, ~$10–20 in API costs

Day 1–2

Draft judge rubric & prepare data.

Sample 200 items from your domain (e.g., support answers, product descriptions, code reviews). Label 100 as your gold set (Y). Keep 100 as adversarial/holdout.

# Install CJE

pip install cje-eval

Format your data as JSONL with prompt, response, judge_score, and oracle_label fields. See CJE in Action.

Day 3

Score with judge; fit calibrator; analyze residuals.

Run your judge prompt on all 200 items (S). CJE automatically fits AutoCal-R (isotonic calibration): Ŷ = f(S), computes residuals (Y − Ŷ), and generates diagnostic plots.

# CJE analysis in Python

from cje import analyze_dataset

# records = [{"prompt_id": "1", "judge_score": 0.72, "oracle_label": 0.80}, ...]

result = analyze_dataset(

fresh_draws_data={"candidate_policy": records},

estimator="direct"

)

print("Method:", result.method)

print("Estimate:", float(result.estimates[0]))

print("95% CI:", result.ci()[0])

if result.diagnostics:

print(result.diagnostics.summary())

CJE outputs calibrated estimates, uncertainty intervals, and diagnostics you can compare across rubric versions.

Day 4

Adversarial tests + bias checks.

Include verbose-but-empty responses, confident-but-wrong answers, style-matched nonsense in your holdout set. Check if the judge fails on known failure modes.

Run CJE on the adversarial holdout and inspect residuals. Large negative residuals = judge over-predicts (gaming vulnerability). Add length/order controls to your rubric.

Day 5

Update prompt; re-score; re-calibrate.

Based on residual patterns from Day 3–4, update the rubric (e.g., penalize verbosity, require evidence). Re-score all items with the new judge prompt. Re-run CJE and compare residuals to baseline.

# After updating judge prompt, re-score and re-analyze

result_v2 = analyze_dataset(

fresh_draws_data={"candidate_policy": records_v2},

estimator="direct"

)

# Compare calibration outputs side-by-side

print("V1:", float(result.estimates[0]), result.ci()[0])

print("V2:", float(result_v2.estimates[0]), result_v2.ci()[0])

Goal: Residuals shrink, especially in slices that failed (e.g., long responses). If not, iterate on rubric again.

Day 6

Add drift monitors; define abstain policy.

Set up a monthly schedule to collect new (S, Y) batches. Run CJE on each batch and alert if mean residual drifts outside confidence bands.

CJE's diagnostics include drift detection. Write down when the judge should abstain (e.g., missing key info, ambiguous task) and log abstention rates.

Day 7

Ship as evaluation gate.

Use the calibrated judge for new candidates. Log scores, justifications, and CJE's calibrated estimates. Schedule your first drift check for 30 days out.

# Apply calibrated judge at scale

result = analyze_dataset(

fresh_draws_data={"production_candidate": production_records},

estimator="direct"

)

point = float(result.estimates[0])

lo, hi = result.ci()[0]

print(f"Policy value: {point:.3f} [{lo:.3f}, {hi:.3f}]")

This is a minimal viable loop. You'll refine the rubric multiple times in production. The key is to start with a small, well-labeled gold set and build the monitoring infrastructure from day one.

If sample-size planning is your main blocker, read Label Budgeting: How many expensive labels do you actually need? for practical allocation rules.

CJE covers the heavy lifting

Calibration: AutoCal-R for mapping judge score S to oracle outcome Y.
Residual analysis: Slice failures by domain, length, and model family.
Uncertainty: Oracle-aware intervals so your CI does not over-claim.
Off-policy estimators: IPS/DR variants for policy evaluation settings.

Versioning and re-label triggers

Treat your judge like production code. Log version metadata and schedule re-calibration when assumptions change.

judge_model=surrogate-model-v1

judge_prompt_hash=sha256:ab12...

rubric_version=3.2

calibrator_version=isotonic-v5

abstain_rate=0.07

Judge model change or temperature shift
Rubric change affecting scoring rules
Cohort/domain mix shift over 15%
Drift monitor failure in two consecutive audits

Pressure-test your judge before launch

Good scores are cheap. Trustworthy scores are earned. Keep a holdout battery that the rubric never trains on.

Holdout attacks

Length padding: same answer, 2x boilerplate. Score should not rise.
Confident-but-wrong: polished tone, bad facts. Score should drop hard.
Style mimicry: high-quality style, wrong substance. Score should stay low.
Citation hallucination: fake sources. Score should be capped or fail.

Bias and drift checks

Verbosity bias: add length controls or explicit brevity rewards.
Family favoritism: use cross-family judge ensembles.
Temporal drift: run monthly residual audits (100–200 items).
Domain shift: re-check residual mean before transporting to a new cohort.

Drift monitor example

Residual drift monitor with a re-calibration event

Trigger re-calibration when residual mean exits its control band and the shift repeats.

Run this next week

Start with 100–300 oracle labels, ship one rubric iteration, and stand up residual monitoring before you trust leaderboard deltas.

Build the loop, not just the score

Use CJE to calibrate, inspect, and monitor. Use the budgeting guide to size your oracle slice.

CJE Documentation→Sample Size Planning→Get Help→

Citation

If you use this work, please cite:

BibTeX

@misc{landesberg2026surrogatemeasurement,
  author = {Landesberg, Eddie},
  title = {Surrogate Measurement Is Everywhere: Why LLM Judges Are Different},
  year = {2026},
  month = {February},
  url = {https://cimolabs.com/blog/programmable-proxies},
  note = {CIMO Labs}
}

Plain Text

Landesberg, E. (2026). Surrogate Measurement Is Everywhere: Why LLM Judges Are Different. CIMO Labs. https://cimolabs.com/blog/programmable-proxies

Acknowledgements

We are grateful to the CIMO Labs team and community for feedback on this work, and to the researchers whose prior work on surrogate endpoints, off-policy evaluation, and causal inference made this framework possible.

We welcome your feedback

This framework is actively evolving. We invite constructive criticism from practitioners and researchers.

If you spot errors, have suggestions, or have used LLM-judges in production and want to share lessons learned, please let us know or email eddie@cimolabs.com.

References & Further Reading

References

[1] Soliman, P., & Tomasev, N. (2021). On YouTube's recommendation system. YouTube Official Blog.Link. Explains how YouTube moved from optimizing raw watch time to "valued watch time" using viewer satisfaction surveys to make the proxy programmable.

[2] Covington, P., Adams, J., & Sargin, E. (2016). Deep neural networks for YouTube recommendations. Proceedings of the 10th ACM Conference on Recommender Systems (RecSys '16).DOI. Describes YouTube's historical use of expected watch time as the optimization objective before the shift to survey-based valued watch time.

[3] Schrage, A., & Kabiljo, M. (2024). Recommending for long-term member satisfaction at Netflix. Netflix Tech Blog.Link. Discusses how Netflix optimizes for long-term retention rather than short-horizon proxies and the challenges of delayed feedback.

[4] Zhang, J., et al. (2024). Improve your next experiment by learning better proxy metrics from past experiments. Netflix Tech Blog.Link. Shows how Netflix treats proxy metrics as learnable objects you can engineer and update based on historical experiment data, exactly the programmable proxy pattern.

[5] Cortes, C., & Lawrence, N. (2021). The NeurIPS 2021 consistency experiment. NeurIPS Blog.Link. Reports substantial disagreement between independent program committees on borderline papers, demonstrating that peer review is a noisy surrogate.

[6] Price, E. (2014). The NIPS experiment. Communications of the ACM.DOI. Overview of the 2014 NIPS experiment showing 57% disagreement on accept/reject decisions for borderline submissions, motivating rubric improvements.

Surrogate Measurement Is Everywhere

A 30-second model: S → Y → Y*

Why static proxies eventually betray you

Defining programmable proxies

Why cheaper models can judge expensive ones

Judges are programs, not just prompts

The operating system: steer → calibrate → inspect → update → monitor

Two non-negotiable guardrails

1) Drift monitor (monthly)

2) Transport test (new policy/cohort)

Ship this in 7 days

Week 1: From zero to deployed judge

Pressure-test your judge before launch

Bias and drift checks

Run this next week

Build the loop, not just the score

Citation

Acknowledgements

We welcome your feedback

Related Reading

Programmable Proxies: Technical Appendix

Your AI Metrics Are Lying to You

AI Quality as Surrogacy: Technical Appendix

Arena Experiment: Benchmarking 14 Estimators on 5k LLM Evaluations

References & Further Reading

References