CIMO LabsCIMO Labs
← Research

Research Note

Post-Audit Drift Correction: Offset vs EIF Correction vs Refit

Eddie Landesberg12 min read

Transport audits tell you when calibration has drifted. This note studies the harder follow-up question: once drift is detected, what correction should be the default?

Abstract

We compare post-audit correction strategies for calibrated LLM judges under controlled transport drift. Methods include stale plug-in, global offset, policy-specific residual correction (AIPW/EIF-style first-moment update), and recent/pooled refits (monotone and two-stage). Global offset reduces bias relative to the stale plug-in when drift is material, but is not a robust default under structural drift. Policy-specific correction is a stronger default for first-moment drift. Refit remains necessary when residual structure is score- or covariate-conditional.

TL;DR

  • Global offset reduces bias when drift is material — a cheap first step that corrects common mean drift with zero model fitting. Does not help pairwise comparisons (cancels in differences).
  • Policy-specific EIF correction is the default — it captures per-policy drift that a single global offset misses, and changes relative rankings.
  • Escalate to refit only when residuals show structural drift — i.e. when the correction needed is score-conditional or covariate-conditional.

1. Problem

Let fold(S)f_{\text{old}}(S) be a calibrator learned in a prior era and let the target estimand for policy pp be the policy value θp=Ep[Y]\theta_p = \mathbb{E}_p[Y]. After an audit in the new era, we need to update estimates without blindly refitting on every cycle.

The key decomposition is:

Key decomposition

θp  =  Ep ⁣[fold(S)]old plug-in  +  Ep ⁣[Yfold(S)]residual correction\theta_p \;=\; \underbrace{\mathbb{E}_p\!\bigl[f_{\text{old}}(S)\bigr]}_{\text{old plug-in}} \;+\; \underbrace{\mathbb{E}_p\!\bigl[Y - f_{\text{old}}(S)\bigr]}_{\text{residual correction}}

This identity separates the old plug-in term from the residual correction term. The correction question becomes: how rich does that residual update need to be?

All post-audit corrections can be seen as choosing a function class for the residual model r(p,S,X)=E[Yfold(S,X)p,S,X]r(p, S, X) = \mathbb{E}[Y - f_{\text{old}}(S, X) \mid p, S, X]:

  • Global offset: r=cr = c (constant)
  • Policy offset: r=r(p)r = r(p) (per-policy constant)
  • Monotone refit: r=r(S)r = r(S) (score-conditional, shape-constrained)
  • Two-stage refit: r=r(S,X)r = r(S, X) (covariate-conditional)

2. Methods Compared

We evaluate seven methods, grouped into three tiers:

Baseline

old_plugin

Use the stale calibrator as-is, with no correction. Lower bound for comparison. Fixes policy means only; does not repair conditional calibration or thresholds.

Corrections (no refit)

old_plus_global_offset

Add one shared residual mean across all policies. Reduces bias when common mean drift is material and audit size is nontrivial. Note: because the same δ^\hat\delta is added to all policies, it cancels in pairwise differences and does not affect relative rankings.

old_plus_policy_offset

AIPW/EIF-style first-moment correction estimated per policy. Captures policy-level drift and changes relative rankings. Assumes audit labels are approximately representative within each policy (or inclusion-weighted).

Refits (new model)

recent_refit_monotone

Monotone (isotonic) calibrator re-trained on new-era audit data only.

pooled_refit_monotone

Monotone calibrator re-trained on old + new data pooled. Mean-preserving on the pooled mixture, not necessarily on the new-era distribution alone.

recent_refit_two_stage

Two-stage (monotone + linear) calibrator on new-era data only. Handles covariate-conditional drift.

pooled_refit_two_stage

Two-stage calibrator on pooled data. Most expressive but most data-hungry.

In practice, the main operational tradeoff is between global offset, policy-specific correction, and recent refits.

3. EIF-Style Correction Intuition

Global offset applies one shared correction to all policies. Define δ^\hat\delta as the mean residual under a chosen reference mixture over policies (e.g., audit sampling distribution or production traffic mix):

Global offset

θ^p=E^p ⁣[fold(S)]+δ^\hat\theta_p = \widehat{\mathbb{E}}_p\!\bigl[f_{\text{old}}(S)\bigr] + \hat\delta

Policy-specific correction estimates the residual first moment per policy:

Policy-specific correction

θ^p=E^p ⁣[fold(S)]+E^audit,p ⁣[Yfold(S)]\hat\theta_p = \widehat{\mathbb{E}}_p\!\bigl[f_{\text{old}}(S)\bigr] + \widehat{\mathbb{E}}_{\text{audit},p}\!\bigl[Y - f_{\text{old}}(S)\bigr]

This is the standard AIPW/EIF one-step estimator for Ep[Y]\mathbb{E}_p[Y], using fold(S)f_{\text{old}}(S) as the outcome regression. The old calibrator acts as a control variate to reduce variance, while the residual term restores unbiasedness for the mean as long as the audit is representative within each policy.

Key Insight

Policy-specific correction captures drift that a single global offset misses, without the cost of full recalibration. Because it estimates a per-policy δ^p\hat\delta_p, it changes pairwise differences and relative rankings — unlike a global offset, which cancels in comparisons.

4. Experiment Design

We simulate two time periods (old calibration era, new evaluation era), four policy families, and controlled drift in the judge-to-oracle relationship.

Drift scenarios

intercept_shift

Uniform additive bias — all scores shift by a constant. Easiest for offset to fix.

slope_shift

Score-dependent drift — high scores drift more than low. Global offset undercorrects.

nonlinear_shift

Non-monotone distortion of the score–outcome relationship. Requires richer correction.

covariate_interaction_shift

Drift depends on covariates, not just scores. Only two-stage refit captures this.

Audit profiles

base_heavy

Non-representative audit — oversamples the baseline policy (75%). Stress-tests mismatch between the audit mixture and the evaluation mixture.

balanced

Representative audit — equal coverage across policies (25% each). Favors all methods equally.

Main metrics are policy-mean MAE/RMSE, ranking accuracy, and transport status from audit diagnostics.

5. Empirical Pattern

  1. Global offset reduces bias relative to stale plug-in when mean drift is material.
  2. Under slope and nonlinear drift, global offset leaves substantial residual error.
  3. Policy-specific correction closes much of that gap and often approaches monotone refit.
  4. Under covariate-interaction drift, two-stage refit remains best.

When to escalate

When residuals are score-conditional or covariate-conditional, first-moment correction is insufficient — refit is necessary. Check the diagnostic plots below before deciding.

Method comparison at max audit size across drift scenarios and audit profiles
Figure 1: At larger audit sizes, policy-specific correction and refits dominate global offset under structural drift.
MAE versus audit size comparing global offset, policy offset, and recent monotone refit
Figure 2: As audit size increases, policy-specific residual correction remains consistently stronger than a single global offset.

6. What to Plot After an Audit

The recommendation “escalate to refit when residuals show structural drift” requires knowing what to look at. After collecting audit labels, plot the following:

Residual vs score (binned)

Bin Yfold(S)Y - f_{\text{old}}(S) by score decile. A flat residual curve means offset suffices. A trend means score-conditional refit is needed.

Residual vs key covariates

Check response length, refusal rate, tool-use indicators, or any feature you suspect interacts with drift. Systematic patterns signal covariate-conditional drift.

Policy-by-score interaction

Overlay binned residual curves per policy. If curves diverge, a single global or even per-policy offset is insufficient — the drift structure differs across policies.

7. Recommended Post-Audit Protocol

Recommended protocol

  1. 1

    Always report the corrected estimate

    Use policy-specific EIF correction (bootstrap + per-policy residual update) as the default. It is unbiased for the mean under representative auditing and handles most first-moment drift.

  2. 2

    Run global offset as a sanity check

    Compare global and policy-specific corrections. If they diverge substantially, drift is heterogeneous across policies.

  3. 3

    Use residual diagnostics to decide on refit

    Plot residuals vs score and covariates (Section 6). If residuals are flat, first-moment correction suffices. If score- or covariate-conditional, escalate to monotone or two-stage refit.

  4. 4

    Use two-stage refit for covariate-conditional drift

    When drift depends on observables beyond the score itself. Pooled refits should be recency-weighted if the goal is accuracy on the new era.

Selection bias caveat

Do not use the transport test as a switch that toggles correction on/off. If you only correct when the test rejects, you introduce selection bias: near δ=0\delta = 0, you correct only when the estimate is large by chance. Instead, always report the EIF-corrected estimate, and use residual diagnostics to decide whether the old calibrator should be trusted going forward or replaced.

Uncertainty

For offsets and EIF corrections, uncertainty is dominated by the audit residual term and scales like O(1/m)O(1/\sqrt{m}) where mm is the number of oracle labels. For refits, bootstrap with calibration refits is the safest default. This links directly to the CJE variance decomposition and explains why naive plug-in CIs fail at low oracle fractions.

8. Assumptions

Audit representativeness

The EIF correction is unbiased when audit labels are an approximately random subsample within each policy's evaluation distribution. If you do targeted auditing (oversample uncertain prompts, extremes of SS), you need inclusion-probability weighting in the residual term.

Goal: policy means vs conditional calibration

Offsets (global or per-policy) fix the policy mean only. They do not repair conditional calibration, subgroup validity, or threshold-based decisions (“ship if score > 0.75”). For those, refit is necessary.

Bootstrap for learned corrections

Any method that learns from audit data (including EIF correction) should use bootstrap or cross-fitting for honest uncertainty quantification.

9. Reproducibility

All experiments are in the repo-level workspace (outside PyPI runtime surface):

python experiments/offset_vs_refit/offset_vs_refit_simulation.py --n-reps 60 --audit-sizes 20,50,100,200

Source: github.com/cimo-labs/cje/experiments/offset_vs_refit