Quick Start
Get CJE running on your data in under 5 minutes.
1. Install
# Using pip
pip install causal-judge-evaluation
# Or clone from GitHub
git clone https://github.com/cimo-labs/cje.git
cd cje && pip install -e .
2. Prepare your data
CJE needs three inputs:
📊 Evaluation logs (required)
Your LLM conversation logs with:
- • Prompts and responses
- • Policy identifiers (model/prompt version)
- • Judge scores (GPT-4, Claude, etc.)
logs.parquet or logs.jsonl
🎯 Oracle labels (small sample)
Ground truth labels for ~100-200 examples for calibration. Can be human preferences, task success, or downstream metrics.
oracle.csv with columns: [id, label]
⚙️ Config (optional)
Customize estimators, folds, and diagnostics.
config.yaml (uses sensible defaults if omitted)
3. Run evaluation
Command line
cje evaluate \
--data logs.parquet \
--oracle oracle.csv \
--output results/
Python API
from
cje import
Pipelinepipeline = Pipeline()
results = pipeline.evaluate(
data=
"logs.parquet"
,oracle=
"oracle.csv"
)
results.summary()
4. Interpret results
CJE outputs three key files:
📈 Point estimates with confidence intervals
# results/estimates.json
{
"policy_A": 0.523 ± 0.021,
"policy_B": 0.487 ± 0.019,
"difference": 0.036 ± 0.028,
"p_value": 0.023
}
🔍 Diagnostics
# results/diagnostics.json
{
"ESS": 0.946, // 94.6% effective sample size
"judge_calibration": 0.89, // R² of calibration
"overlap": 0.72, // Policy overlap
"status": "PASS" // or "REFUSE-LEVEL"
}
📊 Visualizations
Auto-generated plots for calibration curves, weight distributions, and PIT diagnostics.
5. Make decisions
✅ Ship when:
- • Status = "PASS" (all diagnostics passed)
- • ESS > 0.5 (50%+ effective samples)
- • Confidence interval excludes zero
- • p-value < your significance threshold
⚠️ Don't ship when:
- • Status = "REFUSE-LEVEL" (diagnostics failed)
- • ESS < 0.3 (low effective samples)
- • Judge calibration R² < 0.7
- • Extreme weight concentration (top 1% > 50% mass)
Next steps
Need help?
Check the GitHub issues or reach out at eddie@cimolabs.com