CIMO LabsCIMO Labs
← Back to CJE Overview

Data Format Guide

Required fields for each analysis mode

CJE requires different data depending on which mode you're using. All data should be in JSONL format (one JSON object per line).

Direct Mode (Fresh Draws Only)

For Direct mode (on-policy comparison), you only need fresh responses from your target policies scored by a judge. No log probabilities required.

Required Fields

{ "prompt_id": "arena_0", "prompt": "What is 2+2?", "response": "4", "policy": "clone", "judge_score": 0.85 }

Optional: Add Oracle Labels for Calibration

{ "prompt_id": "arena_0", "prompt": "What is 2+2?", "response": "4", "policy": "clone", "judge_score": 0.85, "oracle_label": 0.86 // Ground truth (50% coverage enables calibration) }

If 50% or more of fresh draws have oracle_label, Direct mode automatically learns judge→oracle calibration and applies calibrated rewards.

File Organization

Store fresh draws in separate files per policy:

  • responses/clone_responses.jsonl
  • responses/parallel_universe_prompt_responses.jsonl
  • responses/unhelpful_responses.jsonl

IPS/DR Modes (Logged Data)

For IPS and DR modes (counterfactual estimation), you need logged data with importance weights computed from log probabilities.

Required Fields

{ "prompt": "What is 2+2?", "response": "4", "base_policy_logprob": -14.7, // Log P(response|prompt) for logging policy "target_policy_logprobs": { // Same for policies to evaluate "clone": -14.7, "parallel_universe_prompt": -18.3, "unhelpful": -42.1 }, "judge_score": 0.85, // Required: judge evaluation "oracle_label": 0.86 // Optional: ground truth (5-10% is enough) }

Important: Log Probabilities

All log probabilities must be ≤ 0 (negative or zero). These are log probabilities, not raw probabilities. A probability of 0.5 becomes log(0.5) ≈ -0.693.

Getting Log Probabilities

CJE includes built-in Fireworks API integration for computing teacher-forced log probabilities:

from cje.teacher_forcing import compute_teacher_forced_logprob # Compute log P(response|prompt) for any model on Fireworks result = compute_teacher_forced_logprob( prompt="What is 2+2?", response="4", model="accounts/fireworks/models/llama-v3p2-3b-instruct" ) if result.status == "success": logprob = result.value # e.g., -2.3

This handles chat templates, tokenization, and API calls automatically. See cje/teacher_forcing/ for details.

Field Details

FieldTypeRequiredNotes
prompt_idstringAuto-generatedUnique identifier, generated from prompt hash if missing
promptstringYesInput text/question
responsestringYesGenerated output
base_policy_logprobnumberIPS/DR onlyMust be ≤ 0
target_policy_logprobsobjectIPS/DR onlyDict of policy → logprob
judge_scorenumberYesMust be in [0, 1]
oracle_labelnumberOptionalGround truth in [0, 1], 5-50% coverage recommended

Data Validation Rules

  • Log probabilities must be ≤ 0
    Common error: Using raw probabilities instead of log probabilities
  • Judge scores must be in [0, 1]
    Normalize 0-10 scales to 0-1 by dividing by 10
  • Oracle labels must be in [0, 1]
    For binary outcomes, use 0 or 1
  • Missing log probs → sample skipped
    At least 50% of samples need complete logprobs for IPS/DR modes

Working Example

Arena Sample Dataset

See examples/arena_sample/ in the CJE repository for a complete working example with:

  • • Logged data with judge scores and oracle labels
  • • Fresh draws for each policy (for DR estimation)
  • • 100 samples from real Arena 10K evaluation
  • • 4 target policies: clone, premium, parallel_universe_prompt, unhelpful

Common Issues

Error: "Log probability must be ≤ 0"

Cause: Using raw probabilities instead of log probabilities

Solution: Use math.log(probability) or ensure your API returns logprobs

Error: "Insufficient data" or low logprob coverage

Cause: Not enough samples have complete logprobs (need both base and all target policies)

Solution: Compute missing logprobs using cje.teacher_forcing or use Direct mode with fresh draws

Error: "Judge field 'judge_score' not found"

Cause: Missing judge scores in data

Solution: Ensure data has judge_score field at top level or in metadata