Off-Policy Re-use
Calibrated IPS and DR for evaluating policies from logged data
When you can't generate fresh outputs—maybe it's too expensive, API-limited, or you need to evaluate on historical data only—off-policy evaluation lets you reuse logged data to estimate how a new policy would perform.
Example scenario
You logged 10,000 chatbot conversations using GPT-4 (your "logging policy"). Now you want to know if Claude 3.5 would have performed better—without actually running Claude on all those conversations. Off-policy methods let you answer this using your existing logs.
Calibrated IPS
Reweighting logged data to estimate new policy performance
The core idea
IPS (Importance-weighted Policy Scoring) reweights your logged data based on how likely each response would be under the new policy vs the old policy. The intuition:
- • Responses that your new policy would likely produce → count them more
- • Responses that your new policy would rarely produce → count them less
- • This reweighting simulates what you'd see if the new policy had generated the data
For each logged example, compute a weight:
Weight = (how likely under new policy) / (how likely under old policy)
Then compute your estimate as a weighted average: multiply each calibrated reward by its weight, then average.
Strengths
- • Gives correct answer on average (statistically unbiased)
- • No need to generate fresh responses
- • Works with any logged data that has scores
Challenges
- • Estimates can be noisy when policies differ a lot
- • Requires probability scores from your LLM
- • A few extreme weights can dominate the average
SIMCal: taming extreme weights
Raw weights are often unstable: a few examples get massive weights while most get tiny weights. This makes estimates noisy and unreliable. SIMCal fixes this by smoothing the weights.
How SIMCal works
- 1. Key insight: Examples with similar judge scores should get similar weights. If two responses both scored 8/10, they shouldn't have wildly different weights.
- 2. Smoothing: Fit a smooth curve that predicts weights from judge scores. This replaces extreme outlier weights with more reasonable values.
- 3. Preserve the average: Make sure the smoothed weights still add up to the same total, so you're not changing what you're estimating—just making it more stable.
Typical improvements
On LMSYS Arena data comparing GPT-4 to Claude:
- • Raw IPS: ESS = 0.6% (12 effective samples from 2000), max weight = 47%
- • After SIMCal: ESS = 94.6% (1892 effective samples), max weight = 0.8%
- • Result: 158× improvement in effective sample size, CI width reduced by 12×
What to report (IPS)
Point estimate & CI
V̂IPS(π′) = 0.42 [0.37, 0.47] (95% CI with OUA)
ESS fraction
Raw: 12.3% → After SIMCal: 68.5%
→ Stabilization gave 5.6× more effective samples
Max-weight share
Top 1% of samples carry 8.2% of total weight (after SIMCal)
→ Reasonable concentration; no outlier dominance
Tail index
Hill estimator: α = 3.2 (moderately light tails)
→ Variance is well-behaved
Overlap heatmap
Weights well-distributed across S ∈ [3, 9] and prompt lengths 50-200 tokens
Calibrated DR
Hedging IPS with an outcome model
What DR is
Doubly Robust (DR) estimation combines importance weighting with outcome modeling. You fit two models:
- μπ'(X): Outcome model for target policy π′ (trained on one fresh draw per prompt)
- q(X, A): Outcome model for logger π₀ (trained on logged data)
The DR estimator is:
Why you care: doubly robust protection
DR is consistent if either the weights are correct or the critic q is decent. You don't need both—one suffices. This is the "doubly robust" guarantee:
- ✓ If weights are perfect: DR = IPS (unbiased even if q is garbage)
- ✓ If q is perfect: DR = DM-like estimator (unbiased even if weights are noisy)
- ✓ If both are decent: DR often has lower variance than either alone
Practical takeaway
Use DR when overlap is imperfect (low ESS) and you can train a reasonable critic. Just one fresh draw per prompt is often enough to fit μπ' and get substantial variance reduction over pure IPS.
Orthogonality test
The orthogonality test verifies DR's first-order protection. Compute the weighted moment:
This should be near zero with a CI that covers zero. If not, either the weights or the critic (or both) need improvement.
Test passes
θ = 0.012 [-0.008, 0.032]
→ CI covers zero, DR protection verified
Test fails
θ = 0.087 [0.061, 0.113]
→ CI excludes zero, improve critic or weights
What to report (DR)
Point estimate & CI
V̂DR(π′) = 0.43 [0.39, 0.47] (95% CI with OUA)
Orthogonality test
θ = 0.008 [-0.012, 0.028]
→ CI covers zero ✓ First-order protection verified
ESS fraction (after SIMCal)
71.2% effective sample size
OUA share
18% of total variance from calibration uncertainty
Variance reduction vs IPS
CI width: IPS = 0.14, DR = 0.08 (43% tighter)
Implementation notes
Getting probability scores from your LLM
To compute weights, you need your LLM to tell you "how likely was this response?" Most APIs support this via "log probabilities" or "logprobs." Key requirements:
- • Consistent tokenization: The same text should split into the same tokens every time (most modern LLMs handle this automatically)
- • Complete scores: You need probability scores for the entire response that was logged, not just parts of it
- • Match generation settings: If you logged responses with temperature=0.7, get probabilities using temperature=0.7 (not 0.0 or 1.0)
- • API compatibility: OpenAI, Anthropic, and most providers expose logprobs—check your provider's docs for "log probabilities" or "logprobs" endpoints
Practical note
If your LLM provider doesn't expose probability scores, you can't use IPS/DR—fall back to the Direct Method instead.
Cross-fitting for DR (avoiding overfitting)
When training the outcome models for DR, use cross-fitting to avoid overfitting:
- 1. Split your data into 5 groups
- 2. Train your model on groups 1-4, apply it to group 5
- 3. Repeat for each group (train on the other 4, apply to the held-out one)
- 4. This ensures your model isn't "cheating" by memorizing training examples
Complete off-policy workflow
IPS workflow
- 1. Fit AutoCal-R on labeled slice: S → R
- 2. Compute raw importance weights via teacher forcing
- 3. Run SIMCal to stabilize weights (monotone in S, mean-preserving)
- 4. Estimate V̂IPS = (1/n) ∑ Wcalᵢ Rᵢ
- 5. Add OUA to CI
- 6. Check ESS, tails, and overlap diagnostics
- 7. Report: point estimate, CI, ESS, max-weight, tail index, heatmap
DR workflow
- 1. Fit AutoCal-R on labeled slice: S → R
- 2. Train outcome models (cross-fit): μπ'(X) and q(X,A)
- 3. Compute raw weights via teacher forcing, stabilize with SIMCal
- 4. Estimate V̂DR = (1/n) ∑ [μπ'(Xᵢ) + Wcali(Rᵢ - q(Xᵢ,Aᵢ))]
- 5. Add OUA to CI
- 6. Run orthogonality test + ESS/tail checks
- 7. Report: point estimate, CI, orthogonality, ESS, OUA share
When to use IPS vs DR
- Use IPS when:
- • Overlap is decent (ESS > 40% after SIMCal)
- • You can't afford even one fresh draw per prompt
- • You want the simplest off-policy method
- Use DR when:
- • Overlap is imperfect but not terrible (ESS 20-60%)
- • You can generate one fresh draw per prompt for μπ'
- • You can train a reasonable critic q on logged data
- • You want tighter CIs via variance reduction
- Switch to DM when:
- • ESS is very low (<20%) even after SIMCal
- • You can afford fresh generation for all policies
- • Simpler is better and you have the generation budget