Learning Path

Master CJE at your own pace. Start with the basics and go as deep as you need.

How to use this guide: Follow the path sequentially. Each step builds on the previous one. Stop when you have what you need—whether that's a working implementation or a deep understanding of the theory.

Start: Why Your Metrics Are Lying

Zero math. Just concrete failure stories and the core insight: treat cheap metrics like a currency exchange to real value. Understand the "Judge-Dollars to Value-Dollars" analogy and the deliberation ladder.

Read: Your AI Metrics Are Lying to You→

⏱️ 7 minutes • Zero equations • Accessible to executives and PMs

Align Generation & Evaluation

Before measuring, align your prompts and judges to the same welfare target (Y*). Copy-paste templates, two-week rollout plan, and diagnostics checklist. Zero math.

Read: Your AI Is Optimizing the Wrong Thing→

⏱️ 18 minutes • Get: Generation prompt template, judge rubric, rollout plan

Implementation Guide

Practical guide with no equations: which estimators work, when to use them, how much data you need. Actionable defaults and concrete failure modes to avoid.

Read: Arena Experiment in 15 Minutes→

⏱️ 15 minutes • Key takeaways: Use Direct+cov for fresh responses, avoid raw IPS, aim for 1k prompts + 25% oracle coverage

Get CJE Running

Install the package, run your first evaluation, interpret diagnostics. Most practitioners stop here—you'll have a working implementation and know when to trust it.

# Install

pip install cje-eval

# Run

from

cje

import

analyze_dataset
result = analyze_dataset(fresh_draws_dir=

"responses/"

)

GitHub: Installation & Examples→

⏱️ 30 minutes hands-on

✓ Checkpoint: You can stop here

At this point, you understand the problem, have aligned prompts and judges to Y*, know which estimators to use, and can run CJE in production. The remaining steps are for understanding why the methods work and when they break.

Continue if you need to extend the methods, explain them to skeptical stakeholders, or understand the theoretical foundations.

Alignment Theory: Formal Proofs

Formal framework for Y*-alignment: propositions (optimization compatibility, judge-pool invariance, OPE transportability), assumptions ledger, and integration with OpenAI's deliberative alignment work.

Read: Your AI Is Optimizing the Wrong Thing — Technical Appendix→

⏱️ 25 minutes • Covers: calibration theory, judge-pool invariance theorem, OPE transportability

Empirical Deep Dive

Complete evaluation on 5k real Arena data: all 14 estimators, ablations, diagnostics, uncertainty decomposition. Understand edge cases and when methods fail.

Read: Full Arena Experiment→

⏱️ 45 minutes • Focus: estimator comparisons, OUA decomposition, transportability tests

Why the Methods Work

Understand the unifying principle: projection onto convex constraint sets. Why AutoCal-R and SIMCal-W reduce variance while preserving unbiasedness. When off-policy evaluation hits fundamental limits.

Read: Design-by-Projection Framework→
Read: Coverage-Limited Efficiency→

⏱️ 25 minutes combined • Covers: isotonic regression, mean preservation, ESS limits

Evaluation Theory: Formal Framework

Complete theoretical treatment: identification assumptions (S1-S4, L1-L2), influence functions, semiparametric efficiency, Pearl-Bareinboim transport theory.

Read: Technical Appendix→

⏱️ 45 minutes • Covers: DM/IPS/DR estimators, cross-fitting, DML, transportability proofs, literature connections

Ready to Implement?

Full Documentation GitHub + Examples Get Help