CIMO LabsCIMO Labs

Programmable Proxies: Why LLM-Judges Beat Clicks—and How to Use Them Responsibly

Eddie Landesberg16 min read

We optimized for clicks; customers still churned. We needed a metric we could shape, not just watch. Here's why LLM-judges are different—and how to use them without getting fooled.

TL;DR. LLM judges are programmable proxies you can steer via rubric, calibrate to outcomes that matter, and update when wrong. Unlike clicks or engagement (which you can't reprogram), judges give you a measurement channel you can continuously refine—turning static metrics into adaptive measurement systems.

The key difference:

Unlike clicks or engagement metrics, LLM judges are repairable instruments. When they drift or fail, you update the rubric, recalibrate to ground truth, and verify the fix with residuals. Legacy proxies give you no such lever—when clicks stop predicting value, you're stuck finding a new metric entirely.

One picture: S → Y → Y*

Think of quality as a ladder:

  • Y*: what you'd decide with time, context, and care (the ideal—but unobservable).
  • Y: a high-rung outcome you can observe on a small slice (expert audit, task success, 30–90-day retention).
  • S: cheap, abundant signals (LLM-judge scores, short human labels, engagement).

Your goal is to aim abundant S at Y* without paying Y*'s cost. You do it by learning how S predicts Y on a small gold slice, then using that mapping at scale.

Programmable proxies let you improve S itself—you can teach the metric to see what matters by editing the rubric, then re-testing and re-calibrating.

Why clicks can't be fixed

Cheap metrics fail in three ways:

  1. Wrong or no calibration. You maximized S (politeness, clicks, raw judge score) without checking whether high S raises Y (productivity, safety, retention).
  2. Wrong population. You calibrated S→Y on the wrong slice (easy tasks, different users), so it doesn't transport to your real distribution.
  3. Temporal drift. The S→Y relationship changed. What once predicted value no longer does.

Traditional proxies—clicks, sign-ups, dwell time—are exogenous. A click is a click. When these signals drift or get gamed, you can't change what they measure. You're stuck swapping to a different metric or directly measuring expensive outcomes.

Viewer-satisfaction surveys—the ones YouTube uses—are slow, expensive, and not available in every domain. LLM-judges are. They offer the same programmability but at scale and in real time.

Real-world examples: YouTube and Netflix

YouTube originally optimized for watch time, then learned that minutes alone missed whether viewers actually valued what they watched. They introduced viewer-satisfaction surveys and trained models to predict "valued watch time" at scale—moving from a fixed signal (minutes) to a programmable measurement channel they could rewrite and re-calibrate.

Netflix optimizes for long-term retention, but it's delayed and noisy. Netflix openly describes designing and re-learning proxies from historical experiments so day-to-day decisions still point at long-term member satisfaction. Same loop: author → calibrate → monitor → update.

Programmable proxies give you levers for all three: calibrate S→Y on a gold slice to stop optimizing vibes, transport-test residuals on new cohorts, and monitor drift with small monthly slices and re-calibrate when residuals shift.

Defining programmable proxies

Quick definition

  • Is: An LLM-judge + rubric you control. You write what to reward (accuracy, concision, evidence) and what to penalize (hallucinations, verbosity).
  • Isn't: Ground truth. It's a metric you can engineer—calibrated to Y, audited with residuals, and updated when you find failure modes.

Because the judge is programmable, you can add missing signal: require evidence checks, penalize gratuitous length, reward correct abstentions. Then you measure whether the error shrinks.

AspectExogenous
(clicks, watch time)
Programmable
(LLM-judge)
Who defines "good"?Implicit behaviorYour rubric
Can you change it?NoYes—edit & re-calibrate
Gaming riskHigh, unfixableDetectable & fixable

Why cheaper models can judge expensive ones

One counterintuitive property: a smaller, cheaper model can often reliably judge a larger model's output. The reason is a complexity gap between generation and evaluation.

Producing a good answer from scratch requires searching a massive space of possibilities, managing context, and synthesizing information. Evaluating whether a candidate answer meets a rubric is a simpler task: you're scoring a fixed input against explicit criteria (accuracy, concision, citation quality). That asymmetry means you can get reliable judgment from a smaller model or fewer tokens, then spend heavy compute only where the judge is uncertain.

This is why GPT-4 can judge GPT-5 outputs, or why a task-specific fine-tuned model can evaluate frontier-model responses—judging is cheaper than doing.

Judges are programs, not just prompts

The real power isn't only that judges are endogenous (you can rewrite the measurement). It's that they're composable instruments. A well-designed judge can:

  • Normalize style: Strip formatting, reorder content, or canonicalize before scoring to prevent style gaming
  • Call tools: Verify citations exist, run unit tests on generated code, check schemas for structured data
  • Produce structured evidence: Return not just a score, but the specific claims checked, sources verified, or tests run
  • Version and audit: Log rubric version, model config, and reasoning for every verdict

This makes judges closer to measurement instruments than metrics. You're not just prompting a model for a number—you're building a reusable, auditable evaluation program that can evolve with your needs.

The loop: steer → calibrate → inspect → update → monitor

Using programmable proxies responsibly is a feedback loop where you progressively tighten the judge's alignment with real outcomes.

The judge improvement loop: steer, calibrate, inspect residuals, update rubric, re-calibrate, deploy with guardrails
  1. Steer: Write the rubric. Spell out "good" in your domain: reward accuracy and minimal necessary length; penalize unsupported claims; prefer cite/quote over vague paraphrase; allow abstain when key info is missing.
  2. Calibrate S→Y on a gold slice. Collect ~100–500 Y labels (expert audit, task success). Score everything with the judge (S). Fit a lightweight calibrator (isotonic or two-stage with covariates like response length) so S predicts Y on the gold set.
  3. Inspect residuals: Y − Ŷ. Slice by response length, domain, difficulty, model family. Large negative residuals = the judge over-predicts. That's where gaming hides (verbosity, confident-but-wrong, style mimicry).
  4. Update the judge. Patch the rubric where residuals are worst. Examples: "penalize boilerplate padding," "verify citations exist," "prefer minimal necessary complexity," "abstain if missing key inputs."
  5. Re-calibrate & verify. Re-score the gold slice, re-fit the calibrator, and check that residuals shrink—especially in the slices that failed. Repeat steps 3–5 until residuals stabilize.
  6. Monitor drift & transport. Monthly, pull a small batch from production, compute residuals, and alert if the mean residual leaves its confidence band. For new cohorts/policies, run a transport test: if residual means differ materially, add the cohort as a feature or fit a local calibrator.

What is "calibration"?

Learning how your cheap metric (S, the judge score) predicts your expensive ground truth (Y, expert labels or real outcomes). You collect (S, Y) pairs, fit a function Ŷ = f(S), then use that function to map abundant S data into predictions of Y. This gives you less bias (Y is closer to what you care about than raw S) and less variance (you leverage thousands of S labels instead of just hundreds of Y labels).

What are "residuals"?

Prediction errors: Y − Ŷ. If your calibrated judge predicts Ŷ = 4 but the true label is Y = 3, the residual is -1 (you over-predicted). Large residuals tell you where the judge is systematically wrong. By slicing residuals (e.g., by response length or domain), you find patterns in the judge's mistakes. Maybe long answers always have negative residuals (judge over-rates them). You then update the rubric to fix that pattern and verify the residuals shrink.

The endogenous advantage

With clicks or LTV, you're stuck with the signal you get. With programmable proxies, you can add signal by making the rubric more specific—evidence, concision, risk handling, abstention policy—then test if residuals shrink on your gold set. If they do, you've improved the proxy. If they don't, you've learned the rubric change didn't help, and you can try something else.

Two cheap guardrails that save you from memes

Must-have monitoring

1) Drift monitor (monthly)

Sample 100–200 new items. Apply your old mapping Ŷ = f(S). Alert if mean residual's CI excludes 0 twice in a row → re-calibrate.

This catches temporal drift before it tanks your product.

2) Transport test (new policy/cohort)

Sample 100–200 items. Compare residual mean vs baseline. If CIs don't overlap → don't transport; re-prompt, add cohort as a feature, or fit a local calibrator.

This prevents your metric from lying when you ship to a new population.

These two checks cost ~$50–200/month in labeling and catch the failure modes that turn "4.8/5 judge score" into "support tickets spiking."

Quickstart: A 7-day rollout

You don't need months to start using LLM-judges responsibly. Here's a minimal viable rollout with approximate time and labeling budgets, using the CJE (Causal Judge Evaluation) package to automate calibration, residual analysis, and diagnostics.

Using CJE

The steps below use the cje-eval package, which implements all the calibration, residual analysis, and diagnostics described in this post.

Week 1: From zero to deployed judge

Budget: 200–300 labeled examples, ~3–4 hours reviewer time, ~$10–20 in API costs

Day 1–2
Draft judge rubric & prepare data.

Sample 200 items from your domain (e.g., support answers, product descriptions, code reviews). Label 100 as your gold set (Y). Keep 100 as adversarial/holdout.

# Install CJE
pip install cje-eval

Format your data as JSONL with prompt, response, judge_score, and oracle_label fields. See data format guide.

Day 3
Score with judge; fit calibrator; analyze residuals.

Run your judge prompt on all 200 items (S). CJE automatically fits AutoCal-R (isotonic calibration): Ŷ = f(S), computes residuals (Y − Ŷ), and generates diagnostic plots.

# Run CJE analysis
python -m cje analyze gold_data.jsonl --estimator direct
# Or in Python:
from cje import analyze_dataset
result = analyze_dataset(
"gold_data.jsonl",
estimator="direct"
)
# Inspect residuals by slice
print(result.diagnostics.residuals_by_length)

CJE outputs: calibrated estimates, residual plots sliced by length/domain, reliability curves, and coverage diagnostics.

Day 4
Adversarial tests + bias checks.

Include verbose-but-empty responses, confident-but-wrong answers, style-matched nonsense in your holdout set. Check if the judge fails on known failure modes.

Run CJE on the adversarial holdout and inspect residuals. Large negative residuals = judge over-predicts (gaming vulnerability). Add length/order controls to your rubric.

Day 5
Update prompt; re-score; re-calibrate.

Based on residual patterns from Day 3–4, update the rubric (e.g., penalize verbosity, require evidence). Re-score all items with the new judge prompt. Re-run CJE and compare residuals to baseline.

# After updating judge prompt, re-score and re-analyze
result_v2 = analyze_dataset(
"gold_data_v2.jsonl",
estimator="direct"
)
# Compare residual improvement
print(f"Baseline RMSE: {result.diagnostics.rmse:.3f}")
print(f"Updated RMSE: {result_v2.diagnostics.rmse:.3f}")

Goal: Residuals shrink, especially in slices that failed (e.g., long responses). If not, iterate on rubric again.

Day 6
Add drift monitors; define abstain policy.

Set up a monthly schedule to collect new (S, Y) batches. Run CJE on each batch and alert if mean residual drifts outside confidence bands.

CJE's diagnostics include drift detection. Write down when the judge should abstain (e.g., missing key info, ambiguous task) and log abstention rates.

Day 7
Ship as evaluation gate.

Use the calibrated judge for new candidates. Log scores, justifications, and CJE's calibrated estimates. Schedule your first drift check for 30 days out.

# Apply calibrated judge at scale
result = analyze_dataset(
"production_eval.jsonl",
estimator="direct"
)
print(f"Policy value: {result.estimate:.3f} [{result.ci_lower:.3f}, {result.ci_upper:.3f}]")

This is a minimal viable loop. You'll refine the rubric multiple times in production. The key is to start with a small, well-labeled gold set and build the monitoring infrastructure from day one.

CJE automates the hard parts

The package handles:

  • AutoCal-R: Isotonic or two-stage calibration (S → Y) with covariate controls
  • Residual analysis: Automatic slicing by length, domain, difficulty with plots
  • Diagnostics: Reliability curves, coverage checks, ESS, tail diagnostics
  • Uncertainty: Oracle-uncertainty-aware (OUA) inference for honest confidence intervals
  • Off-policy modes: IPS and DR with SIMCal weight stabilization

Always log: Judge versioning

Any change to your judge setup forces a re-calibration check on a small gold slice. Log these fields with every score:

judge_model=gpt-4.5-mini
judge_prompt_hash=sha256:ab12…
rubric_version=3.2
calibrator_version=isotonic-v5
abstain_rate=7%

Re-label triggers

Collect new (S, Y) pairs and re-calibrate when:

  • Change in judge model family or temperature
  • Rubric edit touching any hard rule
  • Domain mix shift > 15%
  • Monthly drift check fails (mean residual shift > 0.1 or slice p<0.01 after correction)

Anti-gaming battery (use as a holdout)

LLM-judges inherit known failure modes. The difference is you can test for these patterns and build guardrails.

Holdout test suite

  • Length padding: Correct answer + 30–200% boilerplate → judge shouldn't reward it
  • Confident-but-wrong: Authoritative tone, wrong facts → should be penalized
  • Style mimicry: High-scoring style, wrong content → score should stay low
  • Citation hallucination: Fake URLs/papers → force verify or cap score

Pass condition: Under each attack, the calibrated score should either decrease appropriately or stay stable; as a rule of thumb, keep score movement ≤ 0.05 unless the content truly improved.

Calibrated judges won't eliminate reward hacking—but they make it measurable and fixable. When you see a pattern, patch the rubric and re-test.

Testing hygiene

  • Randomize order and hide model identities when comparing responses
  • Control for length: Include length as covariate or penalize verbosity in rubric
  • Cross-model validation: Use multiple judge families or tie-breaker rubric for high-variance items
  • Require rationale (≤300 chars) but don't reward verbosity in the rationale itself
  • Keep a holdout gold slice untouched during calibration to verify generalization
  • Log "abstain" as first-class outcome when evidence is ambiguous

Bias, drift, and transport: be honest, show fixes

LLM-judges have known issues. The key is to acknowledge them upfront, measure them, and show how the loop mitigates them.

Known issue: Verbosity bias

Problem: Judges often reward longer responses even when brevity would be better.

Mitigation: Add response length as a covariate in calibration (two-stage calibrator), or explicitly add to rubric: "Penalize gratuitous length. Reward minimal necessary complexity."

Known issue: Family favoritism

Problem: A judge may favor responses from its own model family (e.g., GPT-4 rating GPT-4 outputs higher than Claude).

Mitigation: Use cross-family ensemble (average scores from judges in different families), or use a style-blind formatting step before judging.

Known issue: Temporal drift

Problem: The S→Y relationship can change over time as user preferences shift, model behavior changes, or new edge cases emerge.

Mitigation: Automate a monthly 30-item audit per domain. If mean residual shifts > 0.1, re-calibrate. Collect small batches of new (S, Y) pairs. Check if mean residuals stay near zero. If they drift beyond error bars, recalibrate or update the rubric.

Known issue: Domain shift

Problem: Calibration on one domain (e.g., customer support) may not transport to another (e.g., medical advice).

Mitigation: When launching in a new domain, collect a small (S, Y) slice from that domain. Check if residuals match your baseline. If not, either add domain as a covariate or fit a domain-specific calibrator.

Visual example: Drift detection over time

Drift monitoring showing residuals over time with error bars and re-prompt event

Months 1–3: Residuals stable near zero. Month 4: Drift detected (mean residual < 0). Re-prompted to penalize new gaming pattern. Month 5: Residuals return to baseline.

Case study: Support answers judged for resolution & safety

Context

A customer support team wanted to evaluate AI-generated answers for "task resolution" and "safety" (no hallucinated steps, no unsupported promises). They started with a basic judge prompt: "Rate 1–5 how well this answer solves the user's problem."

Before: Common residuals

  • Large negative residuals in long answers: The judge gave high scores to verbose responses that included irrelevant detail. Human raters penalized these as "not concise."
  • False positives on confident-but-wrong answers: Responses that hallucinated troubleshooting steps but sounded authoritative got high judge scores but low human ratings.

Intervention: Updated rubric

# Added to judge prompt:
- Reward: minimal necessary complexity. Prefer 3 clear steps over 10 verbose ones.
- Penalize: unsupported claims. If the answer references a feature/setting, it must exist.
- Prefer: cite/quote over paraphrase when referencing docs.

After: Residuals shrink

Re-scored the gold set with the updated rubric. Re-fit the calibrator. Residuals in the "long answer" bucket dropped by 40%. False positives on hallucinated steps cut in half.

Residual scatter plot showing large negative residuals clustered in long responses before rubric update

Before rubric update: Large negative residuals (judge over-predicting) cluster in long responses. This pattern flagged verbosity bias.

Residual improvement (illustrative)

Baseline (v1 prompt):Mean residual = -0.31, SD = 0.68
After rubric update (v2):Mean residual = -0.09, SD = 0.52

These numbers are illustrative. Your domain will differ. The pattern—inspect residuals, update rubric, verify improvement—is what generalizes.

Aside — Peer review is a programmable proxy

In science, Y* ("advances the field toward truth") is unobservable at submission time, so communities use peer review as a surrogate. The NeurIPS consistency experiments showed substantial disagreement near the boundary, i.e., a noisy surrogate. The field responded by updating the rubric (reproducibility checklists, artifacts, rigor criteria) and retraining reviewers—exactly the steer → calibrate → inspect → update loop.

Treat metrics as engineered objects, not oracles.

Citation

If you use this work, please cite:

BibTeX

@misc{landesberg2025programmableproxies,
  author = {Landesberg, Eddie},
  title = {Programmable Proxies: Why LLM-Judges Beat Clicks—and How to Use Them Responsibly},
  year = {2025},
  month = {November},
  url = {https://cimolabs.com/blog/programmable-proxies},
  note = {CIMO Labs}
}

Plain Text

Landesberg, E. (2025). Programmable Proxies: Why LLM-Judges Beat Clicks—and How to Use Them Responsibly. CIMO Labs. https://cimolabs.com/blog/programmable-proxies

FAQ for skeptics

"Aren't we just training to the judge?"

Short answer: Only if you skip the adversarial tests and drift monitoring.

Longer answer: Yes, you're optimizing for the judge's score—but the judge is calibrated to real outcomes (Y) on a gold set, and you're checking residuals to ensure the mapping stays tight. Include adversarial tests (verbose-but-empty, confident-but-wrong) and periodically validate with A/B tests or cohort studies. Treat the judge as a measurement upgrade, not the final objective.

"Models judging models will be biased."

Yes—and you can measure and reduce that bias. Use cross-family judges (e.g., average scores from GPT-4 and Claude), add length/style controls, and inspect residuals by model family. When you see a residual pattern (e.g., GPT-4 judge favoring GPT-4 outputs), you can switch to a cross-family ensemble or add style-blind formatting before judging. The key is you're not stuck with the bias—you can test and fix it.

"Why not just track click-through or engagement?"

Because clicks are easy to game and impossible to fix when they drift. Dark patterns, clickbait, and vanity metrics can all inflate engagement without improving value. When clicks stop predicting revenue, you have no way to "fix" what a click measures—you're stuck finding a new proxy. With LLM-judges, you can update the rubric, re-calibrate, and verify the fix with residual analysis. You're playing offense, not defense.

Drop-in assets: Templates you can use today

Judge prompt template (copy-paste ready)

# Judge Prompt Template **Goal:** Score how well the answer resolves the user's task. **Scale:** 1 (fails to resolve) to 5 (fully resolves) - 1: Irrelevant, incorrect, or harmful - 2: Partially on-topic but missing key info - 3: Addresses the task but lacks depth or has minor errors - 4: Good answer with minor room for improvement - 5: Excellent, complete, accurate **Reward:** - Accuracy and specificity - Minimal necessary length (prefer 3 clear steps over 10 verbose ones) - Cited evidence when making claims (verify sources exist) - Clear abstention when key info is missing **Penalize:** - Unsupported claims or hallucinated details - Irrelevant detail or boilerplate padding - Confident-but-wrong responses - Overly verbose or repetitive explanations - Bare URLs or unverified citations **Policy:** - If the task is ambiguous or lacks key context, prefer "ask for clarification" over guessing - If sources are cited, verify they're real (no hallucinated URLs) - Require verification when sources are cited (e.g., no bare URLs without context) **Output format:** { "score": <1-5>, "rationale": "<=300 chars explaining the score", "citations": ["<any sources referenced>"] }

Residuals playbook

  1. Plot residuals (Y − Ŷ) by slices: response length, domain, difficulty, model family.
  2. Identify the biggest residual pattern: e.g., "long answers have mean residual = -0.4" (judge over-predicts).
  3. Add pattern to rubric: "Penalize gratuitous length. Reward minimal necessary complexity."
  4. Re-score + re-calibrate: Apply updated prompt. Re-fit calibrator. Confirm residual shrinks in that slice.
  5. Lock a holdout: Keep a small gold set untouched. Verify the final mapping generalizes.
  6. Watch for re-emergence (drift): Monthly checks on new (S, Y) batches. If residual pattern returns, repeat loop.

Anti-gaming tests to include

  • Verbosity: Copy a correct answer, pad with boilerplate. Judge should penalize.
  • Confident-but-wrong: Authoritative tone, hallucinated facts. Judge should catch.
  • Style mimics: Copy writing style of high-scoring answers but with wrong content.
  • List-spam: Generate a long bulleted list with only 1–2 relevant items.
  • Citation hallucination: Include fake URLs or paper titles. Judge should flag or penalize.

Try it this week

Start small: 100–200 gold labels (one expert-day if you batch them) + rubric iteration + jackknife uncertainty → honest CI and a programmable lever you can tighten each month as drift appears. That's orders of magnitude cheaper than standing up full outcome studies, and you get a measurement system you can steer.

Ready to start?

Explore the CJE framework, see implementation examples, or dive into the technical documentation for formal definitions and theory.

Acknowledgements

We are grateful to the CIMO Labs team and community for feedback on this work, and to the researchers whose prior work on surrogate endpoints, off-policy evaluation, and causal inference made this framework possible.

We welcome your feedback

This framework is actively evolving. We invite constructive criticism from practitioners and researchers.

If you spot errors, have suggestions, or have used LLM-judges in production and want to share lessons learned, please let us know or email eddie@cimolabs.com.

Related Reading

References & Further Reading

References

[1] Soliman, P., & Tomasev, N. (2021). On YouTube's recommendation system. YouTube Official Blog.Link — Explains how YouTube moved from optimizing raw watch time to "valued watch time" using viewer satisfaction surveys to make the proxy programmable.
[2] Covington, P., Adams, J., & Sargin, E. (2016). Deep neural networks for YouTube recommendations. Proceedings of the 10th ACM Conference on Recommender Systems (RecSys '16).DOI — Describes YouTube's historical use of expected watch time as the optimization objective before the shift to survey-based valued watch time.
[3] Schrage, A., & Kabiljo, M. (2024). Recommending for long-term member satisfaction at Netflix. Netflix Tech Blog.Link — Discusses how Netflix optimizes for long-term retention rather than short-horizon proxies and the challenges of delayed feedback.
[4] Zhang, J., et al. (2024). Improve your next experiment by learning better proxy metrics from past experiments. Netflix Tech Blog.Link — Shows how Netflix treats proxy metrics as learnable objects you can engineer and update based on historical experiment data—exactly the programmable proxy pattern.
[5] Cortes, C., & Lawrence, N. (2021). The NeurIPS 2021 consistency experiment. NeurIPS Blog.Link — Reports substantial disagreement between independent program committees on borderline papers, demonstrating that peer review is a noisy surrogate.
[6] Price, E. (2014). The NIPS experiment. Communications of the ACM.DOI — Overview of the 2014 NIPS experiment showing 57% disagreement on accept/reject decisions for borderline submissions, motivating rubric improvements.