5-minute overview
CJE in 5 Minutes
The short version: why your LLM metrics are broken and how calibration fixes them.
The Problem
You're evaluating LLM outputs. You have two options:
Cheap metrics
LLM-as-judge, thumbs up/down, automated scores
Fast and scalable, but gameable and often wrong
Expensive labels
Expert audits, A/B test outcomes, user retention
Accurate, but slow and costly at scale
Most teams pick one: either they scale cheap metrics and hope for the best, or they burn budget on expensive labels and can only evaluate a fraction of what they ship.
Both approaches fail. Cheap metrics can be 40% wrong on ranking decisions. Expensive labels alone don't scale.
The Solution
Causal Judge Evaluation (CJE) uses both. The core idea:
- 1.Collect expensive labels on a small sample (5-10% of your data)
- 2.Learn the mapping from cheap metrics → expensive labels
- 3.Apply that mapping to all your data
- 4.Get confidence intervals that account for the calibration uncertainty
The result: estimates that track what you actually care about, at scale, with honest error bars.
Concrete Example
You're comparing 5 prompt variants. Running GPT-5 as your "oracle" judge on all 5,000 test cases would cost $500 and take hours.
Instead:
- •Run a cheap judge (GPT-4.1 Nano) on all 5,000 cases — fast and cheap
- •Run the expensive oracle on 250 cases (5%) — this is your calibration sample
- •CJE learns how the cheap judge's scores map to oracle scores
- •Apply calibration to get accurate estimates for all 5 variants
Result: 99% pairwise ranking accuracy at 14× lower cost than full oracle labeling. You correctly identify which prompt variant is best, with confidence intervals you can defend.
Why Use CJE?
Cut evaluation costs
14× cheaper than labeling everything with your expensive oracle. Calibrate on 5% of samples, apply at scale.
Produce auditable results
Valid confidence intervals you can defend to stakeholders. Know when your numbers are trustworthy—and when they're not.
Next Steps
Ready to try it?
pip install cje-eval