When to Use CJE

A decision guide for choosing the right evaluation approach

TL;DR: Use CJE when you have cheap metrics (judge scores) that don't directly predict your KPI, and can afford to label 5-25% of data with expensive ground truth. CJE gives you the speed of offline evaluation with statistical rigor closer to A/B tests.

Step 0: Before using this decision guide, verify your data meets the statistical assumptions. CJE requires monotonicity (higher S → higher expected Y) and sufficient overlap.

Decision Tree

Do you have a cheap metric (S) and an expensive outcome (Y)?

Examples: LLM judge scores (S) + retention/revenue (Y), thumbs up (S) + expert audit (Y), style scores (S) + task success (Y)

Yes → Continue to Q2

No → CJE may not help. Consider direct A/B testing or pure LLM-as-judge.

Can you afford to label 5-25% of your data with ground truth (Y)?

This could be expert annotation, A/B test outcomes, or business KPIs. Typically 200-1000 labels for stable calibration.

Yes → Continue to Q3

No → Use uncalibrated LLM-as-judge (accept bias risk) or invest in labeling infrastructure.

Do you need faster feedback than A/B tests allow?

A/B tests can take weeks and risk exposing users to bad experiences. CJE runs offline in minutes.

Yes → Continue to Q4

No → A/B testing remains the gold standard if you can afford the time and risk.

Do you need confidence intervals, not just point estimates?

For high-stakes decisions, you need to know "Policy A: 72% ± 3%" not just "Policy A scored 7.2/10".

Yes → CJE is a good fit.

No → CJE still helps (calibration alone is valuable), but the full value comes from uncertainty quantification.

If you answered "Yes" to Q1-Q4:

CJE can give you reliable policy comparisons with 94% pairwise accuracy (vs. 38% for raw importance sampling) at a fraction of the cost and time of A/B testing.

Check if your data meets the assumptions →

How CJE Compares to Alternatives

Method	Speed	Cost	Accuracy	Confidence Intervals
A/B Testing	Weeks	High (user exposure)	Gold standard	Yes, exact
LLM-as-Judge (raw)	Minutes	Low	Uncalibrated, drifts	No
Human Annotation	Days-weeks	High (labor)	High (if experts)	Possible but often ignored
CJE	Minutes	Low-Medium	94% pairwise (calibrated)	Yes, with OUA

CJE vs. A/B Testing

+ No user exposure risk
+ Results in minutes, not weeks
+ Can compare many policies at once
− Requires calibration assumptions to hold
− Still need some ground truth labels

CJE vs. Raw LLM-as-Judge

+ Calibrated to actual outcomes
+ Honest confidence intervals
+ Detects when calibration drifts
− Requires a labeled calibration set
− More complex setup

Sample Size Requirements

How much data do you need? It depends on how well your surrogate (S) correlates with your outcome (Y)—which is domain-specific. The table below gives rough starting points, but your mileage will vary.

Tier	Oracle Labels	Eval Samples	CI Width (typical)	Use Case
Minimal	100-200	500-1,000	±5-8%	Quick directional check
Recommended	300-500	1,000-2,000	±3-4%	Production decisions
Gold	1,000+	5,000+	±1-2%	High-stakes, regulatory

These numbers are domain-specific

If your judge scores correlate strongly with your outcome (r > 0.8), you may need fewer labels. If the correlation is weak (r < 0.5), you'll need more—or a better surrogate. Start with the "Minimal" tier and use OUA diagnostics to guide whether to invest in more labels.

Oracle labels are expensive ground truth (expert audits, A/B outcomes, business KPIs). These train the calibration function.

Eval samples are cheap judge scores applied to all policies. More samples = tighter confidence intervals.

Rule of thumb: When CIs are dominated by OUA (calibration uncertainty), add more labels. When CIs are dominated by sampling variance, add more eval samples.

LLM Judges Are Programmable Proxies

Unlike fixed metrics, you can improve your surrogate to correlate better with Y:

•Adjust judge prompts: Review large residuals (where S predicted poorly) and refine your rubric
•Change judge models: Try different models or ensembles to improve monotonicity with Y
•Add covariates: Include response length, topic, or other features in two-stage calibration

Warning: Don't Fool Yourself

Separate optimization from measurement. If you tune your judge on the same data you use to estimate policy value, you'll overfit and get biased estimates. Use a train/test split or k-fold cross-validation: tune the judge on one slice, measure policies on a held-out slice. ML practitioners will recognize this from hyperparameter tuning—the principle is identical.

Mode-Specific Guidance

Direct Method (DM)

Most efficient. Start with 300-500 labels (Recommended tier) + 1-2k eval samples per policy.

IPS

ESS matters more than raw N. If ESS <30% after SIMCal, improve overlap before adding data.

Helps when outcome model is strong. Worth the extra complexity if overlap is imperfect.

When NOT to Use CJE

No cheap-expensive metric pair

If you only have one type of metric (all expensive or all cheap), CJE's calibration approach doesn't apply.

Zero labeling budget

CJE requires some ground truth labels for calibration. If you can't afford any, use raw judge scores (but accept the bias).

Cheap metric doesn't correlate with expensive outcome

If your judge scores have no predictive relationship with your KPI, calibration won't help. You need a better surrogate.

A/B test is fast and low-risk for you

If you can run A/B tests quickly without significant user risk, that remains the gold standard. CJE is most valuable when A/B testing is slow or risky.

Ready to Get Started?

Check Your Assumptions Quick-Start Recipes

Or try CJE in your browser (Colab) →

← Back to Overview Next: Assumptions →