CIMO LabsCIMO Labs
← Back to CJE Overview

Off-Policy Re-use

Calibrated IPS and DR for evaluating policies from logged data

When you can't generate fresh outputs—maybe it's too expensive, API-limited, or you need to evaluate on historical data only—off-policy evaluation lets you reuse logged data to estimate how a new policy would perform.

Example scenario

You logged 10,000 chatbot conversations using GPT-4 (your "logging policy"). Now you want to know if Claude 3.5 would have performed better—without actually running Claude on all those conversations. Off-policy methods let you answer this using your existing logs.

Calibrated IPS

Reweighting logged data to estimate new policy performance

The core idea

IPS (Importance-weighted Policy Scoring) reweights your logged data based on how likely each response would be under the new policy vs the old policy. The intuition:

  • • Responses that your new policy would likely produce → count them more
  • • Responses that your new policy would rarely produce → count them less
  • • This reweighting simulates what you'd see if the new policy had generated the data

For each logged example, compute a weight:

W = pnew(response|context) / pold(response|context)

Weight = (how likely under new policy) / (how likely under old policy)

Then compute your estimate as a weighted average: multiply each calibrated reward by its weight, then average.

Strengths

  • • Gives correct answer on average (statistically unbiased)
  • • No need to generate fresh responses
  • • Works with any logged data that has scores

Challenges

  • • Estimates can be noisy when policies differ a lot
  • • Requires probability scores from your LLM
  • • A few extreme weights can dominate the average

SIMCal: taming extreme weights

Raw weights are often unstable: a few examples get massive weights while most get tiny weights. This makes estimates noisy and unreliable. SIMCal fixes this by smoothing the weights.

How SIMCal works

  1. 1. Key insight: Examples with similar judge scores should get similar weights. If two responses both scored 8/10, they shouldn't have wildly different weights.
  2. 2. Smoothing: Fit a smooth curve that predicts weights from judge scores. This replaces extreme outlier weights with more reasonable values.
  3. 3. Preserve the average: Make sure the smoothed weights still add up to the same total, so you're not changing what you're estimating—just making it more stable.

Typical improvements

On LMSYS Arena data comparing GPT-4 to Claude:

  • Raw IPS: ESS = 0.6% (12 effective samples from 2000), max weight = 47%
  • After SIMCal: ESS = 94.6% (1892 effective samples), max weight = 0.8%
  • Result: 158× improvement in effective sample size, CI width reduced by 12×

What to report (IPS)

Point estimate & CI

IPS(π′) = 0.42 [0.37, 0.47] (95% CI with OUA)

ESS fraction

Raw: 12.3% → After SIMCal: 68.5%

→ Stabilization gave 5.6× more effective samples

Max-weight share

Top 1% of samples carry 8.2% of total weight (after SIMCal)

→ Reasonable concentration; no outlier dominance

Tail index

Hill estimator: α = 3.2 (moderately light tails)

→ Variance is well-behaved

Overlap heatmap

Weights well-distributed across S ∈ [3, 9] and prompt lengths 50-200 tokens

Calibrated DR

Hedging IPS with an outcome model

What DR is

Doubly Robust (DR) estimation combines importance weighting with outcome modeling. You fit two models:

  • μπ'(X): Outcome model for target policy π′ (trained on one fresh draw per prompt)
  • q(X, A): Outcome model for logger π₀ (trained on logged data)

The DR estimator is:

DR = (1/n) ∑ [μπ'(Xᵢ) + Wcali (Rᵢ - q(Xᵢ, Aᵢ))]

Why you care: doubly robust protection

DR is consistent if either the weights are correct or the critic q is decent. You don't need both—one suffices. This is the "doubly robust" guarantee:

  • ✓ If weights are perfect: DR = IPS (unbiased even if q is garbage)
  • ✓ If q is perfect: DR = DM-like estimator (unbiased even if weights are noisy)
  • ✓ If both are decent: DR often has lower variance than either alone

Practical takeaway

Use DR when overlap is imperfect (low ESS) and you can train a reasonable critic. Just one fresh draw per prompt is often enough to fit μπ' and get substantial variance reduction over pure IPS.

Orthogonality test

The orthogonality test verifies DR's first-order protection. Compute the weighted moment:

θ = (1/n) ∑ Wcali (Rᵢ - q(Xᵢ, Aᵢ))

This should be near zero with a CI that covers zero. If not, either the weights or the critic (or both) need improvement.

Test passes

θ = 0.012 [-0.008, 0.032]
→ CI covers zero, DR protection verified

Test fails

θ = 0.087 [0.061, 0.113]
→ CI excludes zero, improve critic or weights

What to report (DR)

Point estimate & CI

DR(π′) = 0.43 [0.39, 0.47] (95% CI with OUA)

Orthogonality test

θ = 0.008 [-0.012, 0.028]

→ CI covers zero ✓ First-order protection verified

ESS fraction (after SIMCal)

71.2% effective sample size

OUA share

18% of total variance from calibration uncertainty

Variance reduction vs IPS

CI width: IPS = 0.14, DR = 0.08 (43% tighter)

Implementation notes

Getting probability scores from your LLM

To compute weights, you need your LLM to tell you "how likely was this response?" Most APIs support this via "log probabilities" or "logprobs." Key requirements:

  • Consistent tokenization: The same text should split into the same tokens every time (most modern LLMs handle this automatically)
  • Complete scores: You need probability scores for the entire response that was logged, not just parts of it
  • Match generation settings: If you logged responses with temperature=0.7, get probabilities using temperature=0.7 (not 0.0 or 1.0)
  • API compatibility: OpenAI, Anthropic, and most providers expose logprobs—check your provider's docs for "log probabilities" or "logprobs" endpoints

Practical note

If your LLM provider doesn't expose probability scores, you can't use IPS/DR—fall back to the Direct Method instead.

Cross-fitting for DR (avoiding overfitting)

When training the outcome models for DR, use cross-fitting to avoid overfitting:

  1. 1. Split your data into 5 groups
  2. 2. Train your model on groups 1-4, apply it to group 5
  3. 3. Repeat for each group (train on the other 4, apply to the held-out one)
  4. 4. This ensures your model isn't "cheating" by memorizing training examples

Complete off-policy workflow

IPS workflow

  1. 1. Fit AutoCal-R on labeled slice: S → R
  2. 2. Compute raw importance weights via teacher forcing
  3. 3. Run SIMCal to stabilize weights (monotone in S, mean-preserving)
  4. 4. Estimate V̂IPS = (1/n) ∑ Wcalᵢ Rᵢ
  5. 5. Add OUA to CI
  6. 6. Check ESS, tails, and overlap diagnostics
  7. 7. Report: point estimate, CI, ESS, max-weight, tail index, heatmap

DR workflow

  1. 1. Fit AutoCal-R on labeled slice: S → R
  2. 2. Train outcome models (cross-fit): μπ'(X) and q(X,A)
  3. 3. Compute raw weights via teacher forcing, stabilize with SIMCal
  4. 4. Estimate V̂DR = (1/n) ∑ [μπ'(Xᵢ) + Wcali(Rᵢ - q(Xᵢ,Aᵢ))]
  5. 5. Add OUA to CI
  6. 6. Run orthogonality test + ESS/tail checks
  7. 7. Report: point estimate, CI, orthogonality, ESS, OUA share

When to use IPS vs DR

  • Use IPS when:
    • • Overlap is decent (ESS > 40% after SIMCal)
    • • You can't afford even one fresh draw per prompt
    • • You want the simplest off-policy method
  • Use DR when:
    • • Overlap is imperfect but not terrible (ESS 20-60%)
    • • You can generate one fresh draw per prompt for μπ'
    • • You can train a reasonable critic q on logged data
    • • You want tighter CIs via variance reduction
  • Switch to DM when:
    • • ESS is very low (<20%) even after SIMCal
    • • You can afford fresh generation for all policies
    • • Simpler is better and you have the generation budget