Learning Path
Master CJE at your own pace. Start with the basics and go as deep as you need.
How to use this guide: Follow the path sequentially. Each step builds on the previous one. Stop when you have what you need—whether that's a working implementation or a deep understanding of the theory.
Start: Why Your Metrics Are Lying
Zero math. Just concrete failure stories and the core insight: treat cheap metrics like a currency exchange to real value. Understand the "Judge-Dollars to Value-Dollars" analogy and the deliberation ladder.
Read: Your AI Metrics Are Lying to You→⏱️ 7 minutes • Zero equations • Accessible to executives and PMs
Align Generation & Evaluation
Before measuring, align your prompts and judges to the same welfare target (Y*). Copy-paste templates, two-week rollout plan, and diagnostics checklist. Zero math.
Read: Your AI Is Optimizing the Wrong Thing→⏱️ 18 minutes • Get: Generation prompt template, judge rubric, rollout plan
Implementation Guide
Practical guide with no equations: which estimators work, when to use them, how much data you need. Actionable defaults and concrete failure modes to avoid.
Read: Arena Experiment in 15 Minutes→⏱️ 15 minutes • Key takeaways: Use Direct+cov for fresh responses, avoid raw IPS, aim for 1k prompts + 25% oracle coverage
Get CJE Running
Install the package, run your first evaluation, interpret diagnostics. Most practitioners stop here—you'll have a working implementation and know when to trust it.
result = analyze_dataset(fresh_draws_dir=
⏱️ 30 minutes hands-on
✓ Checkpoint: You can stop here
At this point, you understand the problem, have aligned prompts and judges to Y*, know which estimators to use, and can run CJE in production. The remaining steps are for understanding why the methods work and when they break.
Continue if you need to extend the methods, explain them to skeptical stakeholders, or understand the theoretical foundations.
Alignment Theory: Formal Proofs
Formal framework for Y*-alignment: propositions (optimization compatibility, judge-pool invariance, OPE transportability), assumptions ledger, and integration with OpenAI's deliberative alignment work.
Read: Your AI Is Optimizing the Wrong Thing — Technical Appendix→⏱️ 25 minutes • Covers: calibration theory, judge-pool invariance theorem, OPE transportability
Empirical Deep Dive
Complete evaluation on 5k real Arena data: all 14 estimators, ablations, diagnostics, uncertainty decomposition. Understand edge cases and when methods fail.
Read: Full Arena Experiment→⏱️ 45 minutes • Focus: estimator comparisons, OUA decomposition, transportability tests
Why the Methods Work
Understand the unifying principle: projection onto convex constraint sets. Why AutoCal-R and SIMCal-W reduce variance while preserving unbiasedness. When off-policy evaluation hits fundamental limits.
⏱️ 25 minutes combined • Covers: isotonic regression, mean preservation, ESS limits
Evaluation Theory: Formal Framework
Complete theoretical treatment: identification assumptions (S1-S4, L1-L2), influence functions, semiparametric efficiency, Pearl-Bareinboim transport theory.
Read: Technical Appendix→⏱️ 45 minutes • Covers: DM/IPS/DR estimators, cross-fitting, DML, transportability proofs, literature connections
