CJE Data Format Guide

CJE requires different data depending on which mode you're using. All data should be in JSONL format (one JSON object per line).

Direct Mode (Fresh Draws Only)

For Direct mode (on-policy comparison), you only need fresh responses from your target policies scored by a judge. No log probabilities required.

Required Fields

{ "prompt_id": "arena_0", "prompt": "What is 2+2?", "response": "4", "policy": "clone", "judge_score": 0.85 }

Optional: Add Oracle Labels for Calibration

{ "prompt_id": "arena_0", "prompt": "What is 2+2?", "response": "4", "policy": "clone", "judge_score": 0.85, "oracle_label": 0.86 // Ground truth (50% coverage enables calibration) }

If 50% or more of fresh draws have oracle_label, Direct mode automatically learns judge→oracle calibration and applies calibrated rewards.

File Organization

Store fresh draws in separate files per policy:

responses/clone_responses.jsonl
responses/parallel_universe_prompt_responses.jsonl
responses/unhelpful_responses.jsonl

IPS/DR Modes (Logged Data)

For IPS and DR modes (counterfactual estimation), you need logged data with importance weights computed from log probabilities.

Required Fields

{ "prompt": "What is 2+2?", "response": "4", "base_policy_logprob": -14.7, // Log P(response|prompt) for logging policy "target_policy_logprobs": { // Same for policies to evaluate "clone": -14.7, "parallel_universe_prompt": -18.3, "unhelpful": -42.1 }, "judge_score": 0.85, // Required: judge evaluation "oracle_label": 0.86 // Optional: ground truth (5-10% is enough) }

Important: Log Probabilities

All log probabilities must be ≤ 0 (negative or zero). These are log probabilities, not raw probabilities. A probability of 0.5 becomes log(0.5) ≈ -0.693.

Getting Log Probabilities

CJE includes built-in Fireworks API integration for computing teacher-forced log probabilities:

from cje.teacher_forcing import compute_teacher_forced_logprob # Compute log P(response|prompt) for any model on Fireworks result = compute_teacher_forced_logprob( prompt="What is 2+2?", response="4", model="accounts/fireworks/models/llama-v3p2-3b-instruct" ) if result.status == "success": logprob = result.value # e.g., -2.3

This handles chat templates, tokenization, and API calls automatically. See cje/teacher_forcing/ for details.

Field Details

Field	Type	Required	Notes
prompt_id	string	Auto-generated	Unique identifier, generated from prompt hash if missing
prompt	string	Yes	Input text/question
response	string	Yes	Generated output
base_policy_logprob	number	IPS/DR only	Must be ≤ 0
target_policy_logprobs	object	IPS/DR only	Dict of policy → logprob
judge_score	number	Yes	Must be in [0, 1]
oracle_label	number	Optional	Ground truth in [0, 1], 5-50% coverage recommended

Data Validation Rules

Log probabilities must be ≤ 0
Common error: Using raw probabilities instead of log probabilities
Judge scores must be in [0, 1]
Normalize 0-10 scales to 0-1 by dividing by 10
Oracle labels must be in [0, 1]
For binary outcomes, use 0 or 1
Missing log probs → sample skipped
At least 50% of samples need complete logprobs for IPS/DR modes

Working Example

Arena Sample Dataset

See examples/arena_sample/ in the CJE repository for a complete working example with:

• Logged data with judge scores and oracle labels
• Fresh draws for each policy (for DR estimation)
• 100 samples from real Arena 10K evaluation
• 4 target policies: clone, premium, parallel_universe_prompt, unhelpful

Common Issues

Error: "Log probability must be ≤ 0"

Cause: Using raw probabilities instead of log probabilities

Solution: Use math.log(probability) or ensure your API returns logprobs

Error: "Insufficient data" or low logprob coverage

Cause: Not enough samples have complete logprobs (need both base and all target policies)

Solution: Compute missing logprobs using cje.teacher_forcing or use Direct mode with fresh draws

Error: "Judge field 'judge_score' not found"

Cause: Missing judge scores in data

Solution: Ensure data has judge_score field at top level or in metadata

← Previous: Installation Next: API Reference →