Main Function: analyze_dataset()
Parameters
| Parameter | Type | Description |
|---|---|---|
| logged_data_path | str | None | Path to JSONL file with logged data. Required for IPS/DR modes. Optional for Direct mode (used for calibration only if provided). |
| fresh_draws_dir | str | None | Directory containing fresh draw response files. Required for DR mode and Direct mode. |
| calibration_data_path | str | None | Path to dedicated calibration dataset with oracle labels. Use to learn judge→oracle mapping from a curated set separate from evaluation data. |
| combine_oracle_sources | bool | Pool oracle labels from all sources (calibration_data + logged_data + fresh_draws). Default: True for data efficiency. Set False to use only calibration_data_path. |
| estimator | str | "auto" (default) or manual: "direct", "calibrated-ips", "stacked-dr", etc. |
| judge_field | str | Metadata field containing judge scores (default: "judge_score") |
| oracle_field | str | Metadata field containing oracle labels (default: "oracle_label") |
| calibration_covariates | List[str] | None | Metadata field names to use as covariates in two-stage calibration (e.g., ["response_length", "domain"]). Helps handle confounding where judge scores have different oracle outcomes based on observable features. |
| include_response_length | bool | Auto-add response length (word count) as a calibration covariate. Convenient for handling length bias. Default: False. |
| estimator_config | Dict | None | Optional estimator-specific configuration (n_folds, clip_weight, etc.) |
| verbose | bool | Print detailed progress messages (default: False) |
Automatic Mode Selection
Use estimator="auto" (default) and CJE will:
- • Detect the mode based on your data (Direct/IPS/DR)
- • Select the best estimator for that mode
- • Check logprob coverage (need ≥50% for IPS/DR)
Return Type: EstimationResult
Result Fields
estimates
Policy value estimates as numpy array. One estimate per target policy, in [0, 1] range.
standard_errors
Complete standard errors including all uncertainty sources: influence function variance, Monte Carlo variance (for DR), and oracle uncertainty (when oracle coverage < 100%).
n_samples_used
Dictionary mapping policy names to number of valid samples used in estimation.
method
String identifying the estimation method used (e.g., "calibrated-ips", "stacked-dr", "direct").
influence_functions
Dictionary mapping policy names to their influence function arrays. Used for proper variance estimation in policy comparisons.
diagnostics
Health metrics including:
- •
weight_ess- Effective sample size (0-1, higher is better) - •
ess_per_policy- ESS for each policy - •
overall_status- GOOD/WARNING/CRITICAL - •
calibration_rmse- Judge calibration quality
metadata
Run information including target_policies list, mode selected, estimator used, data sources, oracle_sources breakdown, and degrees_of_freedom for t-based CIs.
Result Methods
ci(alpha=0.05)
Returns confidence intervals as a list of (lower, upper) tuples, one per policy.
confidence_interval(alpha=0.05)
Returns confidence intervals as numpy arrays. Uses t-based CIs when degrees of freedom are available in metadata, falls back to z-based CIs for large samples.
best_policy()
Returns the index of the best policy by point estimate.
compare_policies(idx1, idx2, alpha=0.05)
Compares two policies using influence functions for proper variance estimation of the difference. Returns a dictionary with difference, SE, z-score, p-value, and significance.
plot_estimates(base_policy_stats, oracle_values, save_path, **kwargs)
Plots policy estimates with confidence intervals. Optionally includes base policy comparison and oracle ground truth values.
to_dict()
Serializes the result to a dictionary for JSON export. Includes estimates, SEs, CIs, per-policy results, and diagnostics.
Jupyter Notebook Display
EstimationResult has rich HTML display (_repr_html_) for Jupyter notebooks. Simply evaluate the result in a cell to see a formatted table with estimates, standard errors, confidence intervals, and diagnostic status.
Usage Examples
Basic Usage
Check Diagnostics
Custom Configuration
Dedicated Calibration Set
Handle Length Bias with Covariates
CLI Commands
Basic Analysis
Data Validation
Common Patterns
Find and Compare Best Policies
Export Results
Visualize Results
Reliability Gating
Calibration System (AutoCal-R)
CJE automatically calibrates judge scores to oracle labels using AutoCal-R with automatic mode selection.
Monotone Calibration (Default)
Uses isotonic regression to learn f̂(S) = E[Y|S]. Enforces that higher judge scores never predict lower oracle outcomes. Works well when the judge-oracle relationship is monotonic.
- • When used: Default when no covariates specified
- • Strengths: Stable with few labels (5-10% coverage), never inverts rankings
- • Mean-preserving: Automatically matches oracle KPI scale
Two-Stage Calibration
When judge scores have slice heterogeneity (same score means different things for different contexts), two-stage calibration learns an intermediate risk index g(S, X_cov) before applying isotonic regression.
- • When used: Automatically when
calibration_covariatesorinclude_response_length=True - • Common use: Handle length bias where longer responses get higher scores at same quality
- • Process: Learn g(S, covariates) → ECDF → isotonic h(U)
Weight Stabilization (SIMCal-W)
Key Result
SIMCal-W achieves 158× ESS improvement vs raw SNIPS on Arena benchmarks.
For IPS and DR modes, CJE automatically stabilizes importance weights using SIMCal-W (Surrogate-Indexed Monotone Calibration for Weights). This is separate from reward calibration (AutoCal-R) and runs automatically.
How SIMCal-W Works
- • Problem: Raw importance weights have extreme variance (some weights 1000×+ larger than others)
- • Solution: Project weights to be monotone with judge score ordering
- • Key property: Maintains mean-1 for unbiasedness while reducing variance dramatically
- • Automatic: Always enabled for
calibrated-ipsand DR estimators
Checking Weight Diagnostics
Understanding Uncertainty (OUA)
CJE standard errors include Oracle-Uncertainty Adjustment (OUA), which accounts for the fact that the calibration function f̂(S) itself has uncertainty from being learned on a finite oracle sample.
Two Sources of Variance
- • Sampling variance: From finite evaluation samples
- • Calibration variance (OUA): From learning f̂ on finite oracle labels
OUA uses delete-one-fold jackknife to measure calibrator variance and adds it to standard errors.
Interpreting OUA Share
Transportability Audit
Before reusing a calibrator on a new policy or time period, test if it still works using the transportability audit.
When to Run This
- • Policy shift: Deploying calibrator from GPT-4 to GPT-4-mini
- • Temporal drift: Using Q1 calibrator on Q2 data
- • Domain shift: Different user cohort or use case
How It Works
- Compute mean residual δ̂ = E[Y - f̂(S)] on probe data (40-60 samples)
- Construct 95% CI: δ̂ ± 1.96 × SE
- Test if 0 ∈ CI (unbiased) or not (drifted)
Status Classification
- • PASS: 0 ∈ CI → calibrator is unbiased, safe to reuse
- • WARN: 0 ∉ CI but |δ̂| < 0.05 → small bias, monitor closely
- • FAIL: |δ̂| ≥ 0.05 → systematic bias, refit calibrator
Example Usage
Error Handling
ValueError: No data provided
At least one of logged_data_path or fresh_draws_dir must be provided.
ValueError: Estimator requires fresh draws
DR estimators like stacked-dr require fresh draws.
Solution: Provide fresh_draws_dir or use calibrated-ips
ValueError: Insufficient logprob coverage
Need ≥50% of samples with complete logprobs for IPS/DR modes.
Solution: Compute missing logprobs or use Direct mode with fresh draws
Developer Documentation
For module-level documentation, implementation details, and extending CJE, see the README files in each module directory on GitHub.
