TL;DR: Use CJE when you have cheap metrics (judge scores) that don't directly predict your KPI, and can afford to label 5-25% of data with expensive ground truth. CJE gives you the speed of offline evaluation with statistical rigor closer to A/B tests.
Step 0: Before using this decision guide, verify your data meets the statistical assumptions. CJE requires monotonicity (higher S → higher expected Y) and sufficient overlap.
Decision Tree
Do you have a cheap metric (S) and an expensive outcome (Y)?
Examples: LLM judge scores (S) + retention/revenue (Y), thumbs up (S) + expert audit (Y), style scores (S) + task success (Y)
Yes → Continue to Q2
No → CJE may not help. Consider direct A/B testing or pure LLM-as-judge.
Can you afford to label 5-25% of your data with ground truth (Y)?
This could be expert annotation, A/B test outcomes, or business KPIs. Typically 200-1000 labels for stable calibration.
Yes → Continue to Q3
No → Use uncalibrated LLM-as-judge (accept bias risk) or invest in labeling infrastructure.
Do you need faster feedback than A/B tests allow?
A/B tests can take weeks and risk exposing users to bad experiences. CJE runs offline in minutes.
Yes → Continue to Q4
No → A/B testing remains the gold standard if you can afford the time and risk.
Do you need confidence intervals, not just point estimates?
For high-stakes decisions, you need to know "Policy A: 72% ± 3%" not just "Policy A scored 7.2/10".
Yes → CJE is a good fit.
No → CJE still helps (calibration alone is valuable), but the full value comes from uncertainty quantification.
If you answered "Yes" to Q1-Q4:
CJE can give you reliable policy comparisons with 94% pairwise accuracy (vs. 38% for raw importance sampling) at a fraction of the cost and time of A/B testing.
Check if your data meets the assumptions →How CJE Compares to Alternatives
| Method | Speed | Cost | Accuracy | Confidence Intervals |
|---|---|---|---|---|
| A/B Testing | Weeks | High (user exposure) | Gold standard | Yes, exact |
| LLM-as-Judge (raw) | Minutes | Low | Uncalibrated, drifts | No |
| Human Annotation | Days-weeks | High (labor) | High (if experts) | Possible but often ignored |
| CJE | Minutes | Low-Medium | 94% pairwise (calibrated) | Yes, with OUA |
CJE vs. A/B Testing
- + No user exposure risk
- + Results in minutes, not weeks
- + Can compare many policies at once
- − Requires calibration assumptions to hold
- − Still need some ground truth labels
CJE vs. Raw LLM-as-Judge
- + Calibrated to actual outcomes
- + Honest confidence intervals
- + Detects when calibration drifts
- − Requires a labeled calibration set
- − More complex setup
Sample Size Requirements
How much data do you need? It depends on how well your surrogate (S) correlates with your outcome (Y)—which is domain-specific. The table below gives rough starting points, but your mileage will vary.
| Tier | Oracle Labels | Eval Samples | CI Width (typical) | Use Case |
|---|---|---|---|---|
| Minimal | 100-200 | 500-1,000 | ±5-8% | Quick directional check |
| Recommended | 300-500 | 1,000-2,000 | ±3-4% | Production decisions |
| Gold | 1,000+ | 5,000+ | ±1-2% | High-stakes, regulatory |
These numbers are domain-specific
If your judge scores correlate strongly with your outcome (r > 0.8), you may need fewer labels. If the correlation is weak (r < 0.5), you'll need more—or a better surrogate. Start with the "Minimal" tier and use OUA diagnostics to guide whether to invest in more labels.
Oracle labels are expensive ground truth (expert audits, A/B outcomes, business KPIs). These train the calibration function.
Eval samples are cheap judge scores applied to all policies. More samples = tighter confidence intervals.
Rule of thumb: When CIs are dominated by OUA (calibration uncertainty), add more labels. When CIs are dominated by sampling variance, add more eval samples.
LLM Judges Are Programmable Proxies
Unlike fixed metrics, you can improve your surrogate to correlate better with Y:
- •Adjust judge prompts: Review large residuals (where S predicted poorly) and refine your rubric
- •Change judge models: Try different models or ensembles to improve monotonicity with Y
- •Add covariates: Include response length, topic, or other features in two-stage calibration
Warning: Don't Fool Yourself
Separate optimization from measurement. If you tune your judge on the same data you use to estimate policy value, you'll overfit and get biased estimates. Use a train/test split or k-fold cross-validation: tune the judge on one slice, measure policies on a held-out slice. ML practitioners will recognize this from hyperparameter tuning—the principle is identical.
Mode-Specific Guidance
Direct Method (DM)
Most efficient. Start with 300-500 labels (Recommended tier) + 1-2k eval samples per policy.
IPS
ESS matters more than raw N. If ESS <30% after SIMCal, improve overlap before adding data.
DR
Helps when outcome model is strong. Worth the extra complexity if overlap is imperfect.
When NOT to Use CJE
No cheap-expensive metric pair
If you only have one type of metric (all expensive or all cheap), CJE's calibration approach doesn't apply.
Zero labeling budget
CJE requires some ground truth labels for calibration. If you can't afford any, use raw judge scores (but accept the bias).
Cheap metric doesn't correlate with expensive outcome
If your judge scores have no predictive relationship with your KPI, calibration won't help. You need a better surrogate.
A/B test is fast and low-risk for you
If you can run A/B tests quickly without significant user risk, that remains the gold standard. CJE is most valuable when A/B testing is slow or risky.
