CJE requires different data depending on which mode you're using. All data should be in JSONL format (one JSON object per line).
Direct Mode (Fresh Draws Only)
For Direct mode (on-policy comparison), you only need fresh responses from your target policies scored by a judge. No log probabilities required.
Required Fields
Optional: Add Oracle Labels for Calibration
If 50% or more of fresh draws have oracle_label
, Direct mode automatically learns judge→oracle calibration and applies calibrated rewards.
File Organization
Store fresh draws in separate files per policy:
responses/clone_responses.jsonl
responses/parallel_universe_prompt_responses.jsonl
responses/unhelpful_responses.jsonl
IPS/DR Modes (Logged Data)
For IPS and DR modes (counterfactual estimation), you need logged data with importance weights computed from log probabilities.
Required Fields
Important: Log Probabilities
All log probabilities must be ≤ 0 (negative or zero). These are log probabilities, not raw probabilities. A probability of 0.5 becomes log(0.5) ≈ -0.693.
Getting Log Probabilities
CJE includes built-in Fireworks API integration for computing teacher-forced log probabilities:
This handles chat templates, tokenization, and API calls automatically. See cje/teacher_forcing/
for details.
Field Details
Field | Type | Required | Notes |
---|---|---|---|
prompt_id | string | Auto-generated | Unique identifier, generated from prompt hash if missing |
prompt | string | Yes | Input text/question |
response | string | Yes | Generated output |
base_policy_logprob | number | IPS/DR only | Must be ≤ 0 |
target_policy_logprobs | object | IPS/DR only | Dict of policy → logprob |
judge_score | number | Yes | Must be in [0, 1] |
oracle_label | number | Optional | Ground truth in [0, 1], 5-50% coverage recommended |
Data Validation Rules
- Log probabilities must be ≤ 0
Common error: Using raw probabilities instead of log probabilities - Judge scores must be in [0, 1]
Normalize 0-10 scales to 0-1 by dividing by 10 - Oracle labels must be in [0, 1]
For binary outcomes, use 0 or 1 - Missing log probs → sample skipped
At least 50% of samples need complete logprobs for IPS/DR modes
Working Example
Arena Sample Dataset
See examples/arena_sample/
in the CJE repository for a complete working example with:
- • Logged data with judge scores and oracle labels
- • Fresh draws for each policy (for DR estimation)
- • 100 samples from real Arena 10K evaluation
- • 4 target policies: clone, premium, parallel_universe_prompt, unhelpful
Common Issues
Error: "Log probability must be ≤ 0"
Cause: Using raw probabilities instead of log probabilities
Solution: Use math.log(probability)
or ensure your API returns logprobs
Error: "Insufficient data" or low logprob coverage
Cause: Not enough samples have complete logprobs (need both base and all target policies)
Solution: Compute missing logprobs using cje.teacher_forcing
or use Direct mode with fresh draws
Error: "Judge field 'judge_score' not found"
Cause: Missing judge scores in data
Solution: Ensure data has judge_score
field at top level or in metadata