Surrogate Measurement Is Everywhere
Every serious team already uses surrogates: click-through for value, watch time for satisfaction, interview scores for job performance. The novelty with LLM-as-judge is not that it is a surrogate; it is that you can change what it measures and verify that change against oracle outcomes.
TL;DR.
- Surrogate measurement is ubiquitous; nobody can run everything on true long-term outcomes alone.
- Most surrogates are fixed signals. LLM judges are programmable signals.
- Residuals are the programming feedback: calibrate → inspect residuals → update rubric → re-calibrate.
- CTR can drift away from LTV and you cannot rewrite what a click means. You can rewrite a judge rubric.
By the end of this post: you will have a practical loop for programming a judge toward oracle alignment, plus rollout guardrails and API-level implementation examples.
The key claim:
Surrogate measurement is old. Programmable surrogates are new. With LLM judges, you can update the rubric, re-calibrate to oracle labels, and check whether residuals shrink. With fixed proxies like CTR, if correlation with LTV breaks, you cannot patch what a click means.
A 30-second model: S → Y → Y*
Think of quality as the Ladder of Value that every measurement system already uses:
- Y*: what you'd decide with time, context, and care (the ideal, but unobservable).
- Y: a high-rung outcome you can observe on a small slice (expert audit, task success, 30–90-day retention).
- S: cheap, abundant signals (judge scores, clicks, short human labels).
Your goal is to aim abundant S at Y* without paying Y*'s full cost. You do it by learning how S predicts Y on a small gold slice, then using that mapping at scale.
The opportunity with LLM judges is that S itself is editable. You can change the rubric, then test if residuals got better. That is not possible with exogenous proxies.
Why static proxies eventually betray you
Cheap metrics fail in three ways:
- Wrong or no calibration. You maximized S (politeness, clicks, raw judge score) without checking whether high S raises Y (productivity, safety, retention).
- Wrong population. You calibrated S→Y on the wrong slice (easy tasks, different users), so it doesn't transport to your real distribution.
- Temporal drift. The S→Y relationship changed. What once predicted value no longer does.
Traditional proxies (clicks, sign-ups, dwell time) are exogenous. A click is a click. If CTR stops approximating LTV, you cannot rewrite the semantics of a click. You can only replace the proxy or pay to measure outcomes directly.
That is why programmable judges matter operationally: they let you run the same “improve-the-surrogate” loop quickly and repeatedly inside product workflows.
Why this is a continuation, not a new religion
YouTube moved beyond raw watch time toward valued watch time because minutes alone were not enough. Same pattern: refine surrogate definitions to track what actually matters.
Netflix uses learned proxies for long-term satisfaction because true outcomes are delayed and noisy. The core idea is longstanding; LLM judges just make iteration faster and more programmable.
Programmable proxies give you levers for all three: calibrate S→Y on a gold slice to stop optimizing vibes, transport-test residuals on new cohorts, and monitor drift with small monthly slices and re-calibrate when residuals shift.
Defining programmable proxies
Quick definition
- Is: An LLM-judge + rubric you control. You write what to reward (accuracy, concision, evidence) and what to penalize (hallucinations, verbosity).
- Isn't: Ground truth. It's a metric you can engineer, calibrated to Y, audited with residuals, and updated when you find failure modes.
Because the judge is programmable, you can add missing signal: require evidence checks, penalize gratuitous length, reward correct abstentions. Then you measure whether the error shrinks.
The practical framing is simple: residuals are your surrogate-programming signal. Large residuals tell you where the rubric is wrong. Rubric edits are hypotheses. Re-calibration plus residual shrinkage is your test.
Table scrolls horizontally on mobile.
| Aspect | Exogenous (clicks, watch time) | Programmable (LLM-judge) |
|---|---|---|
| Who defines "good"? | Implicit behavior | Your rubric |
| Can you change it? | No | Yes: edit & re-calibrate |
| Gaming risk | High, hard to remediate directly | Detectable & often fixable |
Why cheaper models can judge expensive ones
One counterintuitive property: a smaller, cheaper model can often reliably judge a larger model's output. The reason is a complexity gap between generation and evaluation.
Producing a good answer from scratch requires searching a massive space of possibilities, managing context, and synthesizing information. Evaluating whether a candidate answer meets a rubric is a simpler task: you're scoring a fixed input against explicit criteria (accuracy, concision, citation quality). That asymmetry means you can get reliable judgment from a smaller model or fewer tokens, then spend heavy compute only where the judge is uncertain.
This is why a smaller model can often judge outputs from a larger model, or why a task-specific fine-tuned model can evaluate frontier-model responses. Judging is cheaper than doing.
Judges are programs, not just prompts
The real power isn't only that judges are endogenous (you can rewrite the measurement). It's that they're composable instruments. A well-designed judge can:
- Normalize style: Strip formatting, reorder content, or canonicalize before scoring to prevent style gaming
- Call tools: Verify citations exist, run unit tests on generated code, check schemas for structured data
- Produce structured evidence: Return not just a score, but the specific claims checked, sources verified, or tests run
- Version and audit: Log rubric version, model config, and reasoning for every verdict
This makes judges closer to measurement instruments than metrics. You're not just prompting a model for a number. You're building a reusable, auditable evaluation program that can evolve with your needs.
The operating system: steer → calibrate → inspect → update → monitor
Using programmable proxies responsibly is a feedback loop where you progressively tighten the judge's alignment with real outcomes.
- Steer: Write the rubric. Spell out "good" in your domain: reward accuracy and minimal necessary length; penalize unsupported claims; prefer cite/quote over vague paraphrase; allow abstain when key info is missing.
- Calibrate S→Y on a gold slice. Collect ~100–500 Y labels (expert audit, task success). Score everything with the judge (S). Fit a lightweight calibrator (isotonic or two-stage with covariates like response length) so S predicts Y on the gold set.
- Inspect residuals: Y − Ŷ. Slice by response length, domain, difficulty, model family. Large negative residuals = the judge over-predicts. Treat this as a programming signal for the next rubric revision.
- Update the judge. Patch the rubric where residuals are worst. Examples: "penalize boilerplate padding," "verify citations exist," "prefer minimal necessary complexity," "abstain if missing key inputs." This is programming the surrogate, not just observing it.
- Re-calibrate & verify. Re-score the gold slice, re-fit the calibrator, and check that residuals shrink, especially in the slices that failed. Repeat steps 3–5 until residuals stabilize.
- Monitor drift & transport. Monthly, pull a small batch from production, compute residuals, and alert if the mean residual leaves its confidence band. For new cohorts/policies, run a transport test: if residual means differ materially, add the cohort as a feature or fit a local calibrator.
Two definitions you need
- Calibration: learn a mapping from judge score S to oracle outcome Y on a gold set.
- Residuals: the error Y - Ŷ. Slice residuals by domain/length/model family to find where the judge lies.
The endogenous advantage
If CTR stops approximating LTV, you cannot patch CTR itself. With programmable proxies, you can change what gets rewarded and immediately test whether alignment improved via residuals. That turns surrogate quality into an engineering problem with feedback loops.
Two non-negotiable guardrails
Must-have monitoring
1) Drift monitor (monthly)
Sample 100–200 new items. Apply your old mapping Ŷ = f(S). Alert if mean residual's CI excludes 0 twice in a row → re-calibrate.
This catches temporal drift before it tanks your product.
2) Transport test (new policy/cohort)
Sample 100–200 items. Compare residual mean vs baseline. If CIs don't overlap → don't transport; re-prompt, add cohort as a feature, or fit a local calibrator.
This prevents your metric from lying when you ship to a new population.
These two checks cost ~$50–200/month in labeling and catch the failure modes that turn "4.8/5 judge score" into "support tickets spiking."
Ship this in 7 days
You don't need months to start using LLM-judges responsibly. Here's a minimal viable rollout with approximate time and labeling budgets, using the CJE (Causal Judge Evaluation) package to automate calibration, residual analysis, and diagnostics.
Using CJE
The steps below use the cje-eval package, which implements all the calibration, residual analysis, and diagnostics described in this post.
Week 1: From zero to deployed judge
Budget: 200–300 labeled examples, ~3–4 hours reviewer time, ~$10–20 in API costs
Sample 200 items from your domain (e.g., support answers, product descriptions, code reviews). Label 100 as your gold set (Y). Keep 100 as adversarial/holdout.
Format your data as JSONL with prompt, response, judge_score, and oracle_label fields. See CJE in Action.
Run your judge prompt on all 200 items (S). CJE automatically fits AutoCal-R (isotonic calibration): Ŷ = f(S), computes residuals (Y − Ŷ), and generates diagnostic plots.
CJE outputs calibrated estimates, uncertainty intervals, and diagnostics you can compare across rubric versions.
Include verbose-but-empty responses, confident-but-wrong answers, style-matched nonsense in your holdout set. Check if the judge fails on known failure modes.
Run CJE on the adversarial holdout and inspect residuals. Large negative residuals = judge over-predicts (gaming vulnerability). Add length/order controls to your rubric.
Based on residual patterns from Day 3–4, update the rubric (e.g., penalize verbosity, require evidence). Re-score all items with the new judge prompt. Re-run CJE and compare residuals to baseline.
Goal: Residuals shrink, especially in slices that failed (e.g., long responses). If not, iterate on rubric again.
Set up a monthly schedule to collect new (S, Y) batches. Run CJE on each batch and alert if mean residual drifts outside confidence bands.
CJE's diagnostics include drift detection. Write down when the judge should abstain (e.g., missing key info, ambiguous task) and log abstention rates.
Use the calibrated judge for new candidates. Log scores, justifications, and CJE's calibrated estimates. Schedule your first drift check for 30 days out.
This is a minimal viable loop. You'll refine the rubric multiple times in production. The key is to start with a small, well-labeled gold set and build the monitoring infrastructure from day one.
If sample-size planning is your main blocker, read Label Budgeting: How many expensive labels do you actually need? for practical allocation rules.
CJE covers the heavy lifting
- Calibration: AutoCal-R for mapping judge score S to oracle outcome Y.
- Residual analysis: Slice failures by domain, length, and model family.
- Uncertainty: Oracle-aware intervals so your CI does not over-claim.
- Off-policy estimators: IPS/DR variants for policy evaluation settings.
Versioning and re-label triggers
Treat your judge like production code. Log version metadata and schedule re-calibration when assumptions change.
- Judge model change or temperature shift
- Rubric change affecting scoring rules
- Cohort/domain mix shift over 15%
- Drift monitor failure in two consecutive audits
Pressure-test your judge before launch
Good scores are cheap. Trustworthy scores are earned. Keep a holdout battery that the rubric never trains on.
Holdout attacks
- Length padding: same answer, 2x boilerplate. Score should not rise.
- Confident-but-wrong: polished tone, bad facts. Score should drop hard.
- Style mimicry: high-quality style, wrong substance. Score should stay low.
- Citation hallucination: fake sources. Score should be capped or fail.
Bias and drift checks
- Verbosity bias: add length controls or explicit brevity rewards.
- Family favoritism: use cross-family judge ensembles.
- Temporal drift: run monthly residual audits (100–200 items).
- Domain shift: re-check residual mean before transporting to a new cohort.
Drift monitor example
Trigger re-calibration when residual mean exits its control band and the shift repeats.
Run this next week
Start with 100–300 oracle labels, ship one rubric iteration, and stand up residual monitoring before you trust leaderboard deltas.
Build the loop, not just the score
Use CJE to calibrate, inspect, and monitor. Use the budgeting guide to size your oracle slice.
Citation
If you use this work, please cite:
BibTeX
@misc{landesberg2026surrogatemeasurement,
author = {Landesberg, Eddie},
title = {Surrogate Measurement Is Everywhere: Why LLM Judges Are Different},
year = {2026},
month = {February},
url = {https://cimolabs.com/blog/programmable-proxies},
note = {CIMO Labs}
}Plain Text
Landesberg, E. (2026). Surrogate Measurement Is Everywhere: Why LLM Judges Are Different. CIMO Labs. https://cimolabs.com/blog/programmable-proxies
Acknowledgements
We are grateful to the CIMO Labs team and community for feedback on this work, and to the researchers whose prior work on surrogate endpoints, off-policy evaluation, and causal inference made this framework possible.
We welcome your feedback
This framework is actively evolving. We invite constructive criticism from practitioners and researchers.
If you spot errors, have suggestions, or have used LLM-judges in production and want to share lessons learned, please let us know or email eddie@cimolabs.com.
Related Reading
Programmable Proxies: Technical Appendix
Formal framework for designing, calibrating, and auditing LLM judges as programmable surrogate outcomes. Identification results, closed-loop algorithms, panel-of-judges, verdict cards, and oracle-uncertainty-aware inference.
Your AI Metrics Are Lying to You
Companion piece exploring why traditional metrics fail (proxy inversion, Goodhart's Law, specification gaming) and how to build honest measurement systems with concrete examples and red flags.
AI Quality as Surrogacy: Technical Appendix
Formal framework with precise definitions, identification results, influence functions, and asymptotic theory for treating AI quality measurement as a surrogate endpoint problem.
Arena Experiment: Benchmarking 14 Estimators on 5k LLM Evaluations
Full technical post with math, proofs, and ablations. Deep dive into off-policy evaluation, importance sampling failures, and doubly robust methods with two-stage calibration.
