CIMO LabsCIMO Labs
← Back to CJE Overview

API Reference

Complete reference for analyze_dataset() and results

Main Function: analyze_dataset()

def analyze_dataset( logged_data_path: Optional[str] = None, # Path to JSONL (optional for Direct mode) fresh_draws_dir: Optional[str] = None, # Directory with fresh draws calibration_data_path: Optional[str] = None, # Dedicated calibration dataset combine_oracle_sources: bool = True, # Pool oracle labels from all sources estimator: str = "auto", # Estimator or "auto" for mode selection judge_field: str = "judge_score", # Metadata field with judge scores oracle_field: str = "oracle_label", # Metadata field with oracle labels calibration_covariates: Optional[List[str]] = None, # Covariates for two-stage calibration include_response_length: bool = False, # Auto-add response length as covariate estimator_config: Optional[Dict] = None, # Estimator-specific config verbose: bool = False # Print detailed progress ) -> EstimationResult

Parameters

ParameterTypeDescription
logged_data_pathstr | NonePath to JSONL file with logged data. Required for IPS/DR modes. Optional for Direct mode (used for calibration only if provided).
fresh_draws_dirstr | NoneDirectory containing fresh draw response files. Required for DR mode and Direct mode.
calibration_data_pathstr | NonePath to dedicated calibration dataset with oracle labels. Use to learn judge→oracle mapping from a curated set separate from evaluation data.
combine_oracle_sourcesboolPool oracle labels from all sources (calibration_data + logged_data + fresh_draws). Default: True for data efficiency. Set False to use only calibration_data_path.
estimatorstr"auto" (default) or manual: "direct", "calibrated-ips", "stacked-dr", etc.
judge_fieldstrMetadata field containing judge scores (default: "judge_score")
oracle_fieldstrMetadata field containing oracle labels (default: "oracle_label")
calibration_covariatesList[str] | NoneMetadata field names to use as covariates in two-stage calibration (e.g., ["response_length", "domain"]). Helps handle confounding where judge scores have different oracle outcomes based on observable features.
include_response_lengthboolAuto-add response length (word count) as a calibration covariate. Convenient for handling length bias. Default: False.
estimator_configDict | NoneOptional estimator-specific configuration (n_folds, clip_weight, etc.)
verboseboolPrint detailed progress messages (default: False)

Automatic Mode Selection

Use estimator="auto" (default) and CJE will:

  • • Detect the mode based on your data (Direct/IPS/DR)
  • • Select the best estimator for that mode
  • • Check logprob coverage (need ≥50% for IPS/DR)

Return Type: EstimationResult

class EstimationResult: # Core results estimates: np.ndarray # Shape: [n_policies], values in [0,1] standard_errors: np.ndarray # Complete SEs (IF + MC + oracle) n_samples_used: Dict[str, int] # Valid samples per policy method: str # Estimation method used # Statistical artifact influence_functions: Optional[Dict] # Per-sample contributions # Quality metrics diagnostics: IPSDiagnostics | DRDiagnostics # Health metrics # Configuration metadata: Dict # Run metadata, target_policies, etc. # Methods def ci(alpha=0.05) -> List[Tuple[float, float]] def confidence_interval(alpha=0.05) -> Tuple[np.ndarray, np.ndarray] def best_policy() -> int def compare_policies(idx1, idx2, alpha=0.05) -> Dict def plot_estimates(...) -> Figure def to_dict() -> Dict

Result Fields

estimates

Policy value estimates as numpy array. One estimate per target policy, in [0, 1] range.

standard_errors

Complete standard errors including all uncertainty sources: influence function variance, Monte Carlo variance (for DR), and oracle uncertainty (when oracle coverage < 100%).

n_samples_used

Dictionary mapping policy names to number of valid samples used in estimation.

method

String identifying the estimation method used (e.g., "calibrated-ips", "stacked-dr", "direct").

influence_functions

Dictionary mapping policy names to their influence function arrays. Used for proper variance estimation in policy comparisons.

diagnostics

Health metrics including:

  • weight_ess - Effective sample size (0-1, higher is better)
  • ess_per_policy - ESS for each policy
  • overall_status - GOOD/WARNING/CRITICAL
  • calibration_rmse - Judge calibration quality

metadata

Run information including target_policies list, mode selected, estimator used, data sources, oracle_sources breakdown, and degrees_of_freedom for t-based CIs.

Result Methods

ci(alpha=0.05)

Returns confidence intervals as a list of (lower, upper) tuples, one per policy.

cis = result.ci() # 95% CIs cis = result.ci(alpha=0.10) # 90% CIs # Returns: [(0.701, 0.745), (0.680, 0.720)]

confidence_interval(alpha=0.05)

Returns confidence intervals as numpy arrays. Uses t-based CIs when degrees of freedom are available in metadata, falls back to z-based CIs for large samples.

lower, upper = result.confidence_interval() # lower, upper are numpy arrays

best_policy()

Returns the index of the best policy by point estimate.

best_idx = result.best_policy() best_name = result.metadata["target_policies"][best_idx]

compare_policies(idx1, idx2, alpha=0.05)

Compares two policies using influence functions for proper variance estimation of the difference. Returns a dictionary with difference, SE, z-score, p-value, and significance.

comparison = result.compare_policies(0, 1) # Returns: { # "difference": 0.05, # "se_difference": 0.02, # "z_score": 2.5, # "p_value": 0.012, # "significant": True, # "used_influence": True # }

plot_estimates(base_policy_stats, oracle_values, save_path, **kwargs)

Plots policy estimates with confidence intervals. Optionally includes base policy comparison and oracle ground truth values.

# Basic plot fig = result.plot_estimates() # With base policy and oracle values fig = result.plot_estimates( base_policy_stats={"mean": 0.72, "se": 0.01}, oracle_values={"policy_a": 0.75, "policy_b": 0.68}, save_path="results/estimates.png", figsize=(10, 6) )

to_dict()

Serializes the result to a dictionary for JSON export. Includes estimates, SEs, CIs, per-policy results, and diagnostics.

import json data = result.to_dict() with open("results.json", "w") as f: json.dump(data, f, indent=2)

Jupyter Notebook Display

EstimationResult has rich HTML display (_repr_html_) for Jupyter notebooks. Simply evaluate the result in a cell to see a formatted table with estimates, standard errors, confidence intervals, and diagnostic status.

Usage Examples

Basic Usage

from cje import analyze_dataset # Automatic mode selection result = analyze_dataset("data.jsonl") # Access results for i, policy in enumerate(result.metadata["target_policies"]): est = result.estimates[i] se = result.standard_errors[i] print(f"{policy}: {est:.3f} ± {1.96*se:.3f}") # Get confidence intervals cis = result.ci() # Returns [(lower, upper), ...] for i, (lower, upper) in enumerate(cis): policy = result.metadata["target_policies"][i] print(f"{policy}: [{lower:.3f}, {upper:.3f}]")

Check Diagnostics

# Check overall health if result.diagnostics.overall_status.value == "CRITICAL": print("⚠️ Critical issues detected") print(result.diagnostics.summary()) # Check ESS for each policy for policy, ess in result.diagnostics.ess_per_policy.items(): if ess < 0.30: print(f"⚠️ Low ESS for {policy}: {ess:.1%}")

Custom Configuration

# Use specific estimator with custom config result = analyze_dataset( "logs.jsonl", fresh_draws_dir="responses/", estimator="stacked-dr", estimator_config={ "n_folds": 10, # More folds for stability "use_calibrated_weights": True } )

Dedicated Calibration Set

# Learn calibration from curated oracle set result = analyze_dataset( logged_data_path="production_logs.jsonl", # Evaluation data calibration_data_path="human_labels.jsonl", # Curated oracle labels combine_oracle_sources=True, # Pool labels from all sources verbose=True ) print(f"Oracle sources: {result.metadata['oracle_sources']}")

Handle Length Bias with Covariates

# Use two-stage calibration to handle confounders result = analyze_dataset( "logs.jsonl", include_response_length=True, # Auto-add response length calibration_covariates=["domain", "difficulty"], # Additional covariates verbose=True )

CLI Commands

Basic Analysis

# Automatic mode selection python -m cje analyze data.jsonl # With fresh draws (for DR) python -m cje analyze logs.jsonl --fresh-draws-dir responses/ # Specify estimator python -m cje analyze logs.jsonl --estimator calibrated-ips # Save results to JSON python -m cje analyze data.jsonl -o results.json

Data Validation

# Check data format before running python -m cje validate data.jsonl --verbose

Common Patterns

Find and Compare Best Policies

result = analyze_dataset("data.jsonl") policies = result.metadata["target_policies"] # Find best policy best_idx = result.best_policy() print(f"Best policy: {policies[best_idx]}") # Compare to second best import numpy as np sorted_idx = np.argsort(result.estimates)[::-1] comparison = result.compare_policies(sorted_idx[0], sorted_idx[1]) if comparison["significant"]: print(f"Best is significantly better (p={comparison['p_value']:.3f})")

Export Results

import json result = analyze_dataset("data.jsonl") # Use built-in serialization data = result.to_dict() with open("results.json", "w") as f: json.dump(data, f, indent=2)

Visualize Results

result = analyze_dataset("data.jsonl") # Plot with optional base policy comparison fig = result.plot_estimates( base_policy_stats={"mean": 0.72, "se": 0.01}, save_path="results/policy_comparison.png" )

Reliability Gating

result = analyze_dataset("data.jsonl") if result.diagnostics.weight_ess < 0.1: raise ValueError("Insufficient overlap for reliable estimation") return result.estimates[result.best_policy()]

Calibration System (AutoCal-R)

CJE automatically calibrates judge scores to oracle labels using AutoCal-R with automatic mode selection.

Monotone Calibration (Default)

Uses isotonic regression to learn f̂(S) = E[Y|S]. Enforces that higher judge scores never predict lower oracle outcomes. Works well when the judge-oracle relationship is monotonic.

  • When used: Default when no covariates specified
  • Strengths: Stable with few labels (5-10% coverage), never inverts rankings
  • Mean-preserving: Automatically matches oracle KPI scale

Two-Stage Calibration

When judge scores have slice heterogeneity (same score means different things for different contexts), two-stage calibration learns an intermediate risk index g(S, X_cov) before applying isotonic regression.

  • When used: Automatically when calibration_covariates or include_response_length=True
  • Common use: Handle length bias where longer responses get higher scores at same quality
  • Process: Learn g(S, covariates) → ECDF → isotonic h(U)
# Handle length bias with two-stage calibration result = analyze_dataset( "logs.jsonl", include_response_length=True, # Triggers two-stage verbose=True ) # Check which mode was selected print(result.metadata.get("calibration_mode")) # "two_stage"

Weight Stabilization (SIMCal-W)

Key Result

SIMCal-W achieves 158× ESS improvement vs raw SNIPS on Arena benchmarks.

For IPS and DR modes, CJE automatically stabilizes importance weights using SIMCal-W (Surrogate-Indexed Monotone Calibration for Weights). This is separate from reward calibration (AutoCal-R) and runs automatically.

How SIMCal-W Works

  • Problem: Raw importance weights have extreme variance (some weights 1000×+ larger than others)
  • Solution: Project weights to be monotone with judge score ordering
  • Key property: Maintains mean-1 for unbiasedness while reducing variance dramatically
  • Automatic: Always enabled for calibrated-ips and DR estimators

Checking Weight Diagnostics

result = analyze_dataset("logs.jsonl", estimator="calibrated-ips") # Check effective sample size (higher is better) print(f"Weight ESS: {result.diagnostics.weight_ess:.1%}") # e.g., 45% # ESS per policy for policy, ess in result.diagnostics.ess_per_policy.items(): status = "✓" if ess > 0.30 else "⚠️" print(f"{status} {policy}: {ess:.1%} ESS")

Understanding Uncertainty (OUA)

CJE standard errors include Oracle-Uncertainty Adjustment (OUA), which accounts for the fact that the calibration function f̂(S) itself has uncertainty from being learned on a finite oracle sample.

Two Sources of Variance

  • Sampling variance: From finite evaluation samples
  • Calibration variance (OUA): From learning f̂ on finite oracle labels

OUA uses delete-one-fold jackknife to measure calibrator variance and adds it to standard errors.

Interpreting OUA Share

# Check what's driving your uncertainty oua_share = result.metadata.get("oua_share", 0) print(f"OUA share: {oua_share:.0%}") if oua_share > 0.5: print("→ Calibration dominates: add more oracle labels") else: print("→ Sampling dominates: add more eval prompts")

Transportability Audit

Before reusing a calibrator on a new policy or time period, test if it still works using the transportability audit.

When to Run This

  • Policy shift: Deploying calibrator from GPT-4 to GPT-4-mini
  • Temporal drift: Using Q1 calibrator on Q2 data
  • Domain shift: Different user cohort or use case

How It Works

  1. Compute mean residual δ̂ = E[Y - f̂(S)] on probe data (40-60 samples)
  2. Construct 95% CI: δ̂ ± 1.96 × SE
  3. Test if 0 ∈ CI (unbiased) or not (drifted)

Status Classification

  • PASS: 0 ∈ CI → calibrator is unbiased, safe to reuse
  • WARN: 0 ∉ CI but |δ̂| < 0.05 → small bias, monitor closely
  • FAIL: |δ̂| ≥ 0.05 → systematic bias, refit calibrator

Example Usage

from cje.diagnostics import audit_transportability # Test if Q1 calibrator works on Q2 data diag = audit_transportability( calibrator, # From Q1 probe_data, # 50 Q2 samples with oracle labels group_label="Q2" ) print(diag.summary()) # "Transport: PASS | Group: Q2 | N=50 | δ̂: +0.012 (CI: [-0.008, +0.032])" if diag.status == "FAIL": print(f"Drift detected: δ̂={diag.delta_hat:+.3f}") print(f"Action: {diag.recommended_action}") # "refit_two_stage"

Error Handling

ValueError: No data provided

At least one of logged_data_path or fresh_draws_dir must be provided.

ValueError: Estimator requires fresh draws

DR estimators like stacked-dr require fresh draws.

Solution: Provide fresh_draws_dir or use calibrated-ips

ValueError: Insufficient logprob coverage

Need ≥50% of samples with complete logprobs for IPS/DR modes.

Solution: Compute missing logprobs or use Direct mode with fresh draws

Developer Documentation

For module-level documentation, implementation details, and extending CJE, see the README files in each module directory on GitHub.