Causal evaluation for LLM systems

The problem: You have abundant cheap metrics (LLM-as-judge, clicks, thumbs) that don't map to what you care about and are easy to game. You have expensive labels (expert audits, A/B outcomes, retention) that are robust but scarce.

The solution: Causal Judge Evaluation (CJE) leverages both—calibrate cheap metrics to expensive labels on a small sample, apply that mapping at scale, and detect when it breaks.

Read: Why Your Metrics Are Lying →

No math required • 15-minute read

New here?

Start with the Learning Path

A guided progression from core concepts to implementation. Go as deep as you need—most practitioners stop at step 4.

View Learning Path Try in Colab →

What CJE Returns

Forest plot showing CJE policy estimates with 95% confidence intervals compared to oracle ground truth

What you're seeing: Blue circles = CJE's estimates. Red diamonds = actual ground truth. Error bars = uncertainty range.

Example from our LMSYS Chatbot Arena benchmark with n=1,000 evaluations and 25% oracle labels. The 95% confidence intervals successfully cover the true values.Note: In applied settings, you won't have oracle ground truth—that's why you need statistical inference. The red diamonds are shown here only for validation.

What CJE Does

Calibrates cheap metrics to expensive outcomes

Learn how judge scores map to real outcomes (conversion, retention, expert audits). Use that mapping at scale.

Reports honest confidence intervals

Accounts for both evaluation uncertainty and calibration uncertainty. No false wins on noisy data.

Detects when calibration breaks

Built-in drift monitoring and transportability tests. Know when your judge→outcome mapping stops working.

Install:pip install cje-eval

✓

Validated on 5,000 Real Chatbot Arena Evaluations

Benchmarked 14 estimators on real LMSYS Arena prompts with GPT-4o as cheap judge and GPT-5 as oracle. 94% pairwise ranking accuracy vs. 38% for raw importance sampling.

Note: Oracle = GPT-5 simulating expert judgment, not human validation. See methodology details in full results.

See Full Results →

Get Started

Why Metrics Lie

Conceptual intro

How CJE Works

Solution details

GitHub

Install and use

View Full Learning Path→

Go Deeper

Research

Theory and empirical validation: structural alignment, surrogate methods, Arena benchmarks, and robustness testing.

Documentation

Practical guides: glossary, business case, pillar documentation, and getting started resources.