5-minute overview

CJE in 5 Minutes

The short version: why your LLM metrics are broken and how calibration fixes them.

The Problem

You're evaluating LLM outputs. You have two options:

Cheap metrics

LLM-as-judge, thumbs up/down, automated scores

Fast and scalable, but gameable and often wrong

Expensive labels

Expert audits, A/B test outcomes, user retention

Accurate, but slow and costly at scale

Most teams pick one: either they scale cheap metrics and hope for the best, or they burn budget on expensive labels and can only evaluate a fraction of what they ship.

Both approaches fail. Cheap metrics can be 40% wrong on ranking decisions. Expensive labels alone don't scale.

The Solution

Causal Judge Evaluation (CJE) uses both. The core idea:

1.Collect expensive labels on a small sample (5-10% of your data)
2.Learn the mapping from cheap metrics → expensive labels
3.Apply that mapping to all your data
4.Get confidence intervals that account for the calibration uncertainty

The result: estimates that track what you actually care about, at scale, with honest error bars.

Concrete Example

You're comparing 5 prompt variants. Running GPT-5 as your "oracle" judge on all 5,000 test cases would cost $500 and take hours.

Instead:

•Run a cheap judge (GPT-4.1 Nano) on all 5,000 cases (fast and cheap)
•Run the expensive oracle on 250 cases (5%). This is your calibration sample
•CJE learns how the cheap judge's scores map to oracle scores
•Apply calibration to get accurate estimates for all 5 variants

Result: 99% pairwise ranking accuracy at 14× lower cost than full oracle labeling. You correctly identify which prompt variant is best, with confidence intervals you can defend.

Why Use CJE?

Cut evaluation costs

14× cheaper than labeling everything with your expensive oracle. Calibrate on 5% of samples, apply at scale.

Produce auditable results

Valid confidence intervals you can defend to stakeholders. Know when your numbers are trustworthy, and when they're not.

Next Steps

Try it in Colab

Run CJE on real data in 5 minutes

→

Full Deep Dive

The complete conceptual explanation (30 min)

→

How CJE Works

Technical documentation and API

→

Ready to try it?

pip install cje-eval