CIMO LabsCIMO Labs

Coverage-Limited Efficiency: Why High ESS Isn't Enough

Eddie Landesberg10 min read

The problem: Your off-policy estimator reports 95% effective sample size (ESS), stabilized importance weights, and no extreme values. Everything looks healthy. But the estimate is still uninformative—standard errors remain huge and confidence intervals span the entire plausible range.

The insight: High ESS only tells you that weights aren't dominated by a few extreme observations. It doesn't tell you whether your logging policy has meaningful coverage in the regions where your target policy concentrates. If the logger rarely visits target-typical regions, no amount of weight calibration can make your logs-only estimate precise.

We formalize this intuition with Coverage-Limited Efficiency (CLE)—a sharp, local lower bound on standard errors for any logs-only off-policy estimator. The bound separates two failure modes: (1) insufficient logger coverage in target-relevant regions, and (2) shape mismatch between logger and target distributions within those regions. When judged fresh draws from the target policy are available, the CLE floor becomes diagnostic rather than limiting—the binding constraint shifts to the Monte Carlo term from the target sample.

Note: CLE has not been extensively tested numerically or empirically. We present it as our leading hypothesis for why calibrated importance sampling broke down in the Arena experiment, but computing TTC, β, and the actual CLE floor for that data remains future work. The theoretical framework is sound, but empirical validation across multiple domains is an active research direction.

The Coverage Gap

Suppose your logging policy π₀ generates responses to prompts, and you want to evaluate a new target policy π′ using only logged data. Standard off-policy evaluation (OPE) wisdom says: compute importance weights wi=π(AiXi)/π0(AiXi)w_i = \pi'(A_i|X_i) / \pi_0(A_i|X_i), check that ESS is high, stabilize the weights if needed, and you're good to go.

But ESS measures weight concentration—whether the estimate is dominated by a few outliers. It doesn't measure coverage—whether the logger visits the regions where the target policy typically operates.

Context vs. action coverage: Coverage can fail in contexts (P(X)) or in actions given context (P(A|X)). CLE here targets action-space overlap given a shared prompt set; if P(X) shifts, diagnose and correct that first (reweight contexts) before applying CLE.

Concrete failure mode

Your logger π₀ is a base LLM that produces terse, factual responses. Your target π′ is a fine-tuned variant that produces detailed, explanatory responses 3× longer. The logger almost never generates the kind of responses π′ prefers—the policies have poor overlap in output space.

Even after weight stabilization (SIMCal-W) pushes ESS to 95%, you only have a handful of logged responses that look anything like what π′ would generate. Those few observations carry all the information about π′'s performance. Your effective sample size for estimating π′ is tiny, even though global ESS looks great.

A Coverage-Limited Efficiency Bound

Notation ledger

π₀: logger policy; π′: target policy; Y: oracle-scale reward; R: calibrated reward;
T ⊂ 𝒳×𝒜: target-typical region; α = Pπ′(T); β = Pπ₀(T);
σT²: Var(Y | (X,A)∈T); χ²(p‖q): chi-square divergence;
TTC = α̂ (estimated target mass on T).

Coverage-Limited Efficiency formalizes this intuition. Let TX×AT \subset \mathcal{X} \times \mathcal{A} be any target-relevant region—a subset of context-action space where the target policy concentrates its probability mass. For example, TT could be the set of (context, response) pairs that are "typical" under π′ (defined via surprisal thresholds from teacher forcing).

Define:

  • α=Pπ(T)\alpha = P_{\pi'}(T): Target policy's mass on TT ("How often does π′ visit this region?")
  • β=Pπ0(T)\beta = P_{\pi_0}(T): Logging policy's mass on TT ("How often does the logger visit this region?")
  • σT2=Var(Y(X,A)T)\sigma_T^2 = \text{Var}(Y \mid (X,A) \in T): Outcome variance within TT
  • χ2(πTπ0,T)\chi^2(\pi'_T \| \pi_{0,T}): Chi-square divergence between π′ and π₀ restricted to TT (measures shape mismatch inside the region)

Bound 1 (Coverage-Limited Efficiency)

For any regular logs-only estimator of Ψ=Eπ[Y]\Psi = \mathbb{E}_{\pi'}[Y] with influence function φ,

SE(Ψ^)    σTαβn1+χ2 ⁣(πTπ0,T).\text{SE}(\hat{\Psi}) \;\ge\; \frac{\sigma_T \,\alpha}{\sqrt{\beta\, n}} \sqrt{1 + \chi^2\!\big(\pi'_T \,\|\, \pi_{0,T}\big)}.

No weight calibration, no projection method, no clever estimator can beat this floor using only logged data from π₀.

Assumptions (informal): (A1) Regular estimator with an influence function under π₀; (A2) Positivity on T: β = Pπ₀(T) > 0; (A3) Local approximation: variance lower bounds derived from second-moment properties of the restricted likelihood ratio on T; (A4) Outcome variance on T finite (σT² < ∞). Remark: This is a local lower bound; it binds when target mass α on T and logger mass β on T drive difficulty. We treat it as a diagnostic bound in practice.

Connection to Rényi divergence: Because D₂(p‖q) = log(1 + χ²(p‖q))[3], our earlier factor eD₂/2 equals √(1+χ²). This connects the floor directly to second-moment diagnostics of restricted weights.

Interpreting the components

The CLE bound has three multiplicative factors:

1. Coverage penalty: α/β\alpha / \sqrt{\beta}

If the target concentrates mass α on region TT but the logger only puts mass β there, you pay α2/β=α/β\sqrt{\alpha^2/\beta} = \alpha/\sqrt{\beta}. When β is tiny (logger rarely visits target-typical regions), this dominates. Example: α = 0.7, β = 0.01 → penalty = 7×.

2. Shape mismatch: 1+χ2\sqrt{1 + \chi^2}

Even when the logger does visit TT, if the distributions have different shapes inside that region, you pay an additional penalty. Chi-square divergence measures this: χ2=0\chi^2 = 0 means identical shapes (no penalty), χ2>0\chi^2 > 0 inflates the floor. Example: χ² = 3 → penalty = 2×.

3. Noise & sample size: σT/n\sigma_T / \sqrt{n}

Standard Monte Carlo term: intrinsic outcome variability divided by n\sqrt{n}. This is the unavoidable statistical uncertainty you'd face even with perfect overlap and no shape mismatch. Vacuous (correctly) when σT=0.

Key insight: The coverage penalty α/β\alpha/\sqrt{\beta} and shape mismatch 1+χ2\sqrt{1 + \chi^2} are multiplicative. Small β or large χ² can make the floor prohibitively large, rendering logs-only estimation uninformative no matter how you calibrate the weights.

Estimating χ² on T

Let wi = π′(Ai|Xi)/π₀(Ai|Xi) and restrict to i∈T. Normalize restricted weights: ẇi = wi / ((1/|T|)∑j∈T wj).

Then 1+χ2^=1TiTw˙i2\widehat{1+\chi^2} = \frac{1}{|T|}\sum_{i\in T} \dot{w}_i^2.

Practical diagnostic: Plot the distribution of normalized restricted weights ẇi on T; (1/|T|)∑ ẇi² = 1+χ̂T². Heavy tails → large mismatch → inflated floor.

Two Regimes: Logs-Only vs. With Fresh Draws

The CLE floor only binds for logs-only estimators (pure IPS, calibrated IPS). When you have judged fresh draws from the target policy π′, the picture changes fundamentally.

RegimeBinding FloorRole of Logs
Logs-only (IPS, Cal-IPS)CLE:σTαβn1+χ2\text{CLE}: \frac{\sigma_T \alpha}{\sqrt{\beta n}} \sqrt{1+\chi^2}Essential. Refuse if floor exceeds SE budget.
With judged fresh draws (Direct, DR)MC:Varπ(Y)/m\text{MC}: \sqrt{\text{Var}_{\pi'}(Y) / m}Optional control variates (can improve, but not essential).

Why fresh draws change everything

Suppose you collect mm judged fresh draws from π′: each target policy generates a response to the same prompts, you score it with a cheap judge, and calibrate those scores to the oracle scale. Now your estimator is a two-sample design: nn logged observations from π₀ and mm fresh observations from π′.

Proposition: Two-sample decomposition

For any regular estimator using nn logs and mm judged fresh draws,

Var(Ψ^)Blogs+Btarget,where Btarget=Varπ(R)m,BlogsσT2α2βn(1+χ2).\text{Var}(\hat{\Psi}) \geq B_{\text{logs}} + B_{\text{target}}, \quad \text{where } B_{\text{target}} = \frac{\text{Var}_{\pi'}(R)}{m}, \quad B_{\text{logs}} \geq \frac{\sigma_T^2 \alpha^2}{\beta n} (1+\chi^2).

The binding floor is the target Monte Carlo term BtargetB_{\text{target}}. The logs-only CLE becomes diagnostic—it tells you whether the logged data can meaningfully improve the target-only estimate, but you're not stuck with it as the binding constraint.

Oracle uncertainty (OUA): With partial oracle coverage, add an OUA component from calibrator learning: Vartotal ≥ Blogs + Btarget + VarOUA. When m is large and OUA shrinks, Btarget is the binding term.

Implication for Direct Model (DM) and Doubly Robust (DR): These methods generate fresh responses from each policy on the same prompts. The binding uncertainty is the Monte Carlo term from scoring those fresh responses with a calibrated judge. Logged data from π₀ can serve as control variates (DR uses them for bias correction), but poor logger coverage doesn't prevent you from getting a precise estimate—it just means the log-based corrections contribute little. Influence-function stacking (stacked-DR) automatically down-weights log corrections when the CLE floor is high.

Target-Typicality Coverage (TTC) Diagnostic

To operationalize CLE, we need to compute β: what fraction of logged data lives in target-relevant regions? The key challenge: defining TT (the target-typical region) without already having extensive data from π′.

Defining typicality via surprisal

Use teacher forcing to compute the per-token surprisal of each logged response under the target policy π′. Responses with low surprisal are "typical" for π′; high surprisal means π′ would rarely generate them. Set a threshold τ (e.g., 75th percentile of surprisal on a small validation set of fresh draws from π′, or a fixed percentile on logged data) and define:

T={(Xi,Ai):surprisalπ(AiXi)τ}T = \{(X_i, A_i) : \text{surprisal}_{\pi'}(A_i | X_i) \leq \tau\}

Alternative (no teacher forcing): risk-index typicality. Use the stage-1 index Ť = g(S, X) from AutoCal-R and define T as the top-k percentile of Ť under π′ (estimated via a small set of fresh draws). This yields a coverage diagnostic that does not depend on propensity scoring.

Then compute:

  • α^=1ni=1nwi1[iT]\hat{\alpha} = \frac{1}{n} \sum_{i=1}^n w_i \cdot \mathbb{1}[i \in T] (target mass on TT, estimated via importance weights)
  • β^=1ni=1n1[iT]\hat{\beta} = \frac{1}{n} \sum_{i=1}^n \mathbb{1}[i \in T] (logger mass on TT, directly observed)

Target-Typicality Coverage (TTC) is α̂. If TTC is low, the target policy concentrates most of its mass on a region where you have few logged observations—a red flag for logs-only estimation.

IF-ESS on T

Using out-of-fold influence contributions ψi, defineIF-ESST=(iTψi)2iTψi2\text{IF-ESS}_T = \frac{(\sum_{i\in T} \psi_i)^2}{\sum_{i\in T} \psi_i^2}. This is the effective number of informative samples inside T for your estimator.

Computing the SE floor

Once you have α̂, β̂, and an estimate of the shape mismatch 1+χT2^\widehat{1+\chi^2_T} (computed from the restricted importance weights on TT as shown above), the CLE floor is:

SEmin(τ)=σ^Tα^β^n1+χT2^\text{SE}_{\min}(\tau) = \frac{\hat{\sigma}_T \hat{\alpha}}{\sqrt{\hat{\beta} n}} \sqrt{\widehat{1+\chi^2_T}}

where σ^T\hat{\sigma}_T is the out-of-fold (OOF) residual variance on TT.

REFUSE-LEVEL gates (logs-only)

Refuse to report logs-only estimates if any hold:

  • Floor exceeds precision budget: SEmin(τ) > SEtarget
  • Precision-to-cost dominated: Required n from nσT2α^2β^ε2(1+χT2)^n \geq \frac{\sigma_T^2 \hat{\alpha}^2}{\hat{\beta} \varepsilon^2} \widehat{(1+\chi^2_T)} exceeds feasible budget
  • Fragile evidence in-target: IF-ESST < Nmin (default 20)

Coverage-mismatch profile

Since the definition of TT depends on the surprisal threshold τ, plot SEmin(τ) across a range of thresholds. If the floor exceeds your SE budget for all plausible τ, logs-only estimation is infeasible.

Worked example (toy)

n=5,000 logs; β̂=0.01; α̂=0.6; σT≈0.20; 1+χ̂T²≈3.

SEmin = (0.20 × 0.6) / √(0.01 × 5000) × √3 ≈ 0.12 / √50 × 1.732 ≈ 0.12 / 7.071 × 1.732 ≈ 0.029.

If your SEtarget is 0.01, logs-only is infeasible (floor is 3× too high).

How to compute CLE in practice

  1. Choose T: Define via surprisal or risk-index typicality; profile over τ
  2. Compute β̂: β̂ = |T|/n
  3. Estimate α̂: Small fresh draws + simple classifier density ratio, or importance weights if reliable
  4. Estimate σT: Via out-of-fold residuals on T
  5. Estimate 1+χ̂T²: Via normalized restricted weights on T (shown above)
  6. Compute SEmin(τ): Profile over τ; apply gates

Connection to Arena Experiment Results

In the Arena experiment, we saw that SNIPS (raw importance sampling)[1] and calibrated-ips (SIMCal-W stabilized weights) both failed catastrophically for ranking, despite SIMCal-W boosting ESS from 0.4–26% up to 82–99%. CLE provides our leading hypothesis for why this occurred:

  • Poor logger coverage in target-typical regions. The base policy (Llama 3.3 70B with standard system prompt) generates very different responses than the target policies (parallel_universe, premium, unhelpful). Even though global ESS is high after SIMCal-W, β̂ (logger coverage in target-typical regions) is tiny for policies that differ substantially from base.
  • Teacher forcing noise amplifies shape mismatch. Computing propensities π′(A|X) via teacher forcing is noisy and non-deterministic[2], inflating the local divergence D₂ even for the clone policy (identical to base with different seed).
  • Why Direct Model (DM) succeeds. Direct methods generate fresh responses from each policy on the same prompts. The binding floor is the Monte Carlo term Varπ(R)/m\sqrt{\text{Var}_{\pi'}(R)/m}, not the CLE floor. Poor logger coverage is irrelevant—each policy gets its own independent sample.
  • Why Doubly Robust (DR) recovers performance. DR uses fresh draws to train an outcome model, then applies importance-weighted corrections using logged data. The outcome model provides a baseline; logs are control variates. When the CLE floor is high (poor logger coverage), the log-based corrections contribute little, and the estimate is governed by the target Monte Carlo term. SIMCal-W stabilization ensures the corrections don't hurt, even if they don't help much.

Illustrative vignette from Arena experiment

For the parallel_universe policy (hypothetical coverage estimates):

  • Global ESS after SIMCal-W: 95.4% (observed)
  • Logger coverage in parallel_universe-typical regions: β̂ ≈ 0.006 (inferred, not computed)
  • Target mass: α̂ ≈ 0.7 (inferred, not computed)
  • Coverage penalty: α/β0.7/0.0069\alpha / \sqrt{\beta} \approx 0.7 / \sqrt{0.006} \approx 9 (9× inflation)

Even with perfect shape match (D₂ = 0), the CLE floor would be 9× higher than the naïve 1/n1/\sqrt{n} rate. With teacher forcing noise pushing D₂ > 0, the floor becomes prohibitive. Result: SNIPS achieves 8.7% top-1 accuracy (random guessing = 20%), and calibrated-ips only reaches 19.1% despite 95% ESS. Note: The β and α values are illustrative inferences based on policy similarity, not computed from the data. Computing actual TTC and CLE floors for the Arena experiment is future work.

Practical Guidance

When to refuse logs-only estimation

  1. Before running the experiment: If you know the target and logging policies differ substantially (different model sizes, system prompts, temperatures), expect poor coverage. Plan to collect judged fresh draws rather than relying on logs-only OPE.
  2. After collecting data: Compute TTC (α̂), logger coverage (β̂), and the CLE floor SEmin. If SEmin exceeds your precision target (e.g., you need SE ≤ 0.01 but the floor is 0.05), refuse logs-only estimation.
  3. Profile over typicality thresholds: Plot SEmin(τ) to check robustness. If the floor is prohibitive across all plausible definitions of "target-typical," the result is not sensitive to your choice of τ.

Sample size planning

To achieve SE ≤ ε with logs-only data,

nσT2α2βε2(1+χ2).n \geq \frac{\sigma_T^2 \alpha^2}{\beta \varepsilon^2} (1+\chi^2).

If β is tiny (poor logger coverage), the required nn explodes. In such cases, collecting fresh target draws is far more efficient than increasing the size of the logged dataset.

Cost guidance: With β small, required n scales like α²/(β ε²). A handful of judged fresh draws (m in the low thousands) often beats adding millions of logs. We surface this trade-off directly via the floor.

When logs still help

Even with judged fresh draws (Direct/DR), logged data can improve efficiency if the CLE floor is reasonable. Doubly robust estimators and influence-function stacking (stacked-DR) automatically weight the log-based corrections by their precision. When β is moderate (logger has decent coverage), DR can achieve lower variance than pure Direct estimation. When β is tiny, stacking mutes the log corrections and you recover essentially the Direct estimate.

Comparison to ESS

MetricWhat it measuresFailure mode it detects
ESSWeight concentration: (wi)2/wi2(\sum w_i)^2 / \sum w_i^2A few extreme weights dominate the estimate
TTC (α̂)Logger coverage in target-typical regionsLogger rarely visits where target concentrates
CLE floorMinimum achievable SE given coverage and shape mismatchLogs-only estimation is fundamentally uninformative

Key insight: You can have high ESS (no weight concentration) but low TTC (poor coverage), leading to a prohibitive CLE floor. Both diagnostics are necessary for honest off-policy evaluation.

Limitations

  • Defining TT without fresh draws is heuristic. The surprisal-based typicality definition requires teacher forcing under π′, which may be noisy. Profile over thresholds and disclose sensitivity. The risk-index alternative avoids propensity scoring.
  • Estimating σT with sparse oracle coverage. Use out-of-fold residuals and conservative bounds when oracle labels are limited.
  • Context shift. If the distribution over contexts XX differs between logger and target, the analysis requires separate coverage checks over XX. The CLE bound assumes overlap failures are in the action space AXA | X, not in XX itself.
  • Tightness: The bound is local in T. If you partition 𝒳×𝒜 into bins, convexity yields a weighted combination lower bound; the worst (lowest β/highest χ²) bin often dominates. This explains why global ESS can look fine while a single high-mass target bin sets the floor.

The Relationship: Design-by-Projection and CLE

Design-by-Projection (DbP) methods (AutoCal-R, SIMCal-W) project empirical data onto convex sets encoding structural knowledge (monotonicity, mean preservation). This reduces standard errors from an uncalibrated baseline. But no projection can beat the coverage-limited efficiency floor—it's a hard barrier determined by logger coverage and local mismatch.

Together, DbP and CLE provide complementary perspectives:

  • DbP methods: Practical tools to reduce variance through calibration and stabilization
  • CLE floor: Theoretical minimum that tells you when logs-only estimation is fundamentally limited

CJE currently implements Design-by-Projection methods (AutoCal-R, SIMCal-W). CLE diagnostics (TTC, IF-ESS in T, SE floor) are planned for future releases.

Planned Implementation in CJE

CLE diagnostics are planned for integration into the CJE package in a future release. The implementation will include:

  • TTC (Target-Typicality Coverage): α̂, the target policy's estimated mass on the logger-observed region
  • Logger coverage: β̂, the fraction of logged data in target-typical regions
  • Mismatch multiplier: M^T=exp(D2)\hat{M}_T = \exp(D_2), measuring shape divergence inside the region
  • CLE floor: SEmin, the theoretical minimum standard error
  • IF-ESS restricted to TT: Effective sample size of the influence function within the target-typical region

When implemented, if logs-only estimation is attempted and any REFUSE-LEVEL gate triggers, CJE will issue a warning and suggest collecting judged fresh draws.

# Planned API for CLE diagnostics
from cje import OffPolicyEstimator
results = OffPolicyEstimator(method='calibrated-ips').fit(data)
results.cle_diagnostics() # Coming soon
# Output will include TTC, SE_min, refuse gates

If you're interested in contributing to the CLE implementation or have use cases that would benefit from these diagnostics, please open an issue or discussion on the CJE GitHub repository.

Conclusion

High ESS is necessary but not sufficient for informative off-policy evaluation. Coverage-Limited Efficiency provides a sharp, local lower bound on standard errors that separates logger coverage from shape mismatch and makes refusal decisions rigorous.

Key takeaways:

  • Logs-only OPE has a hard precision floor determined by logger coverage in target-relevant regions
  • With judged fresh draws (Direct/DR), the binding floor is the Monte Carlo term—logs become optional control variates
  • TTC (Target-Typicality Coverage) operationalizes the coverage diagnostic
  • Design-by-Projection and CLE form complementary perspectives: projections reduce variance, CLE sets the theoretical limit

For practitioners: compute TTC and the CLE floor before committing to logs-only estimation. When the floor is prohibitive, invest in fresh target draws rather than scaling up logged data collection.

Research Direction & Feedback

CLE is an active area of research. Empirical validation across diverse domains, numerical studies of the tightness of the bound, and practical implementation of TTC diagnostics are ongoing work. We invite feedback, critical discussion, and collaboration. If you're interested in testing CLE on your data or have insights about coverage diagnostics for off-policy evaluation, please reach out via our contact page or GitHub.

Appendix: Proofs

Notation & assumptions (for this appendix)

Let Z=(X,A,Y)Z=(X,A,Y) be drawn i.i.d. under the logging policy distribution π0\pi_0 with densityqq, and let the target policy distribution be π\pi' with density pp (both w.r.t. a common base measure). Define the importance ratio w(Z)=p(Z)q(Z)w(Z)=\frac{p(Z)}{q(Z)} and a measurableTX×AT\subset\mathcal{X}\times\mathcal{A}. Writeα=Pπ(T),  β=Pπ0(T)\alpha = P_{\pi'}(T),\; \beta = P_{\pi_0}(T) and the restricted densitiespT=p/α,  qT=q/βp_T = p/\alpha,\; q_T = q/\beta on TT. LetσT2\sigma_T^2 denote a lower bound on conditional outcome variance on TT, e.g. σT2ess inf(x,a)TVar(YX=x,A=a)\sigma_T^2 \le \operatorname*{ess\,inf}_{(x,a)\in T}\operatorname{Var}(Y\mid X{=}x,A{=}a) (any fixed lower bound suffices for the inequality below). We consider regular (asymptotically linear) estimators based only on logs (for Bound 1) and independent logs + target draws (for the two‑sample bound).


Lemma A (weight identity on TT)

On TT, the importance ratio satisfies w=pq=αβpTqT.w = \frac{p}{q} = \frac{\alpha}{\beta}\cdot \frac{p_T}{q_T}. Consequently,

Eπ0 ⁣[w21T]  =  α2βTpT2qTdμ  =  α2β(1+χ2(pTqT)). \mathbb{E}_{\pi_0}\!\left[w^2\,\mathbf{1}_T\right] \;=\; \frac{\alpha^2}{\beta}\,\int_T \frac{p_T^2}{q_T}\,d\mu \;=\; \frac{\alpha^2}{\beta}\,\big(1+\chi^2(p_T\|q_T)\big).

Proof. On TT, write p=αpT,  q=βqTp=\alpha p_T,\; q=\beta q_T and expand. The last equality uses χ2(pTqT)=T(pTqT)2qTdμ=TpT2qTdμ1.\chi^2(p_T\|q_T)=\int_T \frac{(p_T-q_T)^2}{q_T}d\mu = \int_T \frac{p_T^2}{q_T}d\mu - 1.


Bound 1 (Coverage‑Limited Efficiency, χ² form)

For any regular logs‑only estimator of Ψ=Eπ[Y]\Psi=\mathbb{E}_{\pi'}[Y], the asymptotic standard error obeys

SE(Ψ^)    σTαβn  1+χ2 ⁣(pTqT). \mathrm{SE}(\hat\Psi) \;\ge\; \frac{\sigma_T\,\alpha}{\sqrt{\beta\,n}}\;\sqrt{\,1+\chi^2\!\big(p_T\|q_T\big)\,}.

Intuition: How the pieces fit together

The floor has four multiplicative components, each capturing a distinct source of difficulty:

  • σT\sigma_T: Irreducible outcome noise on TT—even with infinite data, you cannot estimate E[YX,A]\mathbb{E}[Y|X,A] more precisely than the conditional variance allows.
  • α\alpha: Target mass on TT—the higher the target policy concentrates on TT, the more this region dominates the overall mean, amplifying any estimation error here.
  • 1/βn1/\sqrt{\beta n}: Effective logger sample size in TT—you only have βn\beta n logged samples in the relevant region; sparse coverage directly inflates variance.
  • 1+χ2(pTqT)\sqrt{1+\chi^2(p_T\|q_T)}: Shape mismatch within TT—even when logger visits TT, if it explores it differently than the target (high χ²), importance weights become extreme and variance explodes. This is the "effective sample size tax" from reweighting[1].

Together: outcome noise × target importance × logger scarcity × reweighting penalty. Each factor is unavoidable; the bound is tight when all four sources align to make the problem fundamentally hard.

Proof.

  1. Decompose Ψ=Eπ[Y]=ΨT+ΨTc\Psi = \mathbb{E}_{\pi'}[Y] = \Psi_T + \Psi_{T^c} with ΨT=Eπ[Y1T]=Eπ0[wY1T]\Psi_T=\mathbb{E}_{\pi'}[Y\,\mathbf{1}_T]=\mathbb{E}_{\pi_0}[w\,Y\,\mathbf{1}_T]. Since variances add for independent components and dropping TcT^c can only reduce difficulty, a lower bound for estimating ΨT\Psi_T is a valid lower bound for estimating Ψ\Psi.
  2. Consider the model where ww is treated as known (oracle propensities). Then ΨT=Eπ0[f(Z)]\Psi_T = \mathbb{E}_{\pi_0}[f(Z)] with f(Z)=wY1Tf(Z)=w\,Y\,\mathbf{1}_T is a linear functional of the unknown logging distribution. In this model the efficient influence function (EIF)[4] is ϕ(Z)=f(Z)ΨT\phi(Z)=f(Z)-\Psi_T, so the semiparametric efficiency bound is AVAR(Ψ^T)1nVarπ0 ⁣(wY1T).\operatorname{AVAR}(\hat\Psi_T)\ge \frac{1}{n}\,\operatorname{Var}_{\pi_0}\!\big(w Y \mathbf{1}_T\big).
  3. Center on the target‑restricted mean μT=EpT[Y]\mu_T=\mathbb{E}_{p_T}[Y] and note Eπ0[w(YμT)1T]=αEpT[YμT]=0\mathbb{E}_{\pi_0}[w\,(Y-\mu_T)\,\mathbf{1}_T] = \alpha\,\mathbb{E}_{p_T}[Y-\mu_T]=0. Then Varπ0 ⁣(wY1T)=Eπ0 ⁣[w2(YμT)21T].\operatorname{Var}_{\pi_0}\!\big(w Y \mathbf{1}_T\big) = \mathbb{E}_{\pi_0}\!\big[w^2 (Y-\mu_T)^2 \mathbf{1}_T\big].
  4. Take conditional expectation given (X,A)(X,A) and use E ⁣[(YμT)2X,A]    Var(YX,A)\mathbb{E}\!\big[(Y-\mu_T)^2\mid X,A\big] \;\ge\; \operatorname{Var}(Y\mid X,A). By the definition of σT2\sigma_T^2, we have Eπ0 ⁣[w2(YμT)21T]    σT2Eπ0 ⁣[w21T].\mathbb{E}_{\pi_0}\!\big[w^2 (Y-\mu_T)^2 \mathbf{1}_T\big] \;\ge\; \sigma_T^2\,\mathbb{E}_{\pi_0}\!\big[w^2 \mathbf{1}_T\big].
  5. Apply Lemma A to conclude Varπ0 ⁣(wY1T)    σT2α2β(1+χ2(pTqT)).\operatorname{Var}_{\pi_0}\!\big(w Y \mathbf{1}_T\big) \;\ge\; \sigma_T^2\,\frac{\alpha^2}{\beta}\,\big(1+\chi^2(p_T\|q_T)\big). Taking square roots yields the stated bound.

Remarks.

  • Treating ww as known makes the problem easier; hence this is a valid lower bound for any logs‑only estimator in the real problem (where ww is estimated).
  • The bound is local in TT: poor logger coverage (β1\beta\ll 1) or shape mismatch (χ21\chi^2\gg 1) on any high‑mass TT inflates the floor multiplicatively.

Corollary (ESS form on TT)

Define normalized restricted weights w~i=wi/Eπ0[wiT]\tilde w_i = w_i / \mathbb{E}_{\pi_0}[w\mid i\in T] and the usual weight ESS on TT:

ESST  =  (iTw~i)2iTw~i2  . \text{ESS}_T \;=\; \frac{\big(\sum_{i\in T} \tilde w_i\big)^2}{\sum_{i\in T} \tilde w_i^{\,2}}\;.

Then E[w~i]=1\mathbb{E}[\tilde w_i]=1 and E[w~i2]=1+χ2(pTqT)\mathbb{E}[\tilde w_i^{\,2}] = 1+\chi^2(p_T\|q_T), so E[ESST]T1+χ2(pTqT).\mathbb{E}[\text{ESS}_T] \approx \frac{|T|}{1+\chi^2(p_T\|q_T)}. Since E[T]=βn\mathbb{E}[|T|]=\beta n, the Bound 1 floor can be rewritten as

SEmin    σTαESST(up to concentration of T around βn). \mathrm{SE}_{\min} \;\gtrsim\; \frac{\sigma_T\,\alpha}{\sqrt{\text{ESS}_T}} \quad \text{(up to concentration of } |T| \text{ around } \beta n).

Proof sketch. With w~\tilde w normalized, the classical identity 1TiTw~i21+χ2(pTqT)\frac{1}{|T|}\sum_{i\in T} \tilde w_i^{\,2} \to 1+\chi^2(p_T\|q_T) implies ESSTT1+χ2\text{ESS}_T \approx \frac{|T|}{1+\chi^2}. Replace 1+χ2\sqrt{1+\chi^2} in Bound 1 by T/ESST\sqrt{|T|/\text{ESS}_T} and use Tβn|T|\approx \beta n.


Proposition (Two‑sample variance lower bound)

Suppose in addition to nn logs from π0\pi_0, you have mm independent judged fresh draws U1,,UmπU_1,\ldots,U_m \sim \pi', each scored into an oracle‑scale reward RR (e.g., a calibrated judge). For any regular estimator using both samples (with sample‑splitting/cross‑fitting so that nuisance fits are independent of the target half),

Var(Ψ^)    Blogs  +  Btarget,Btarget  =  Varπ(R)m,Blogs    σT2α2βn(1+χ2(pTqT)). \operatorname{Var}(\hat\Psi) \;\ge\; B_{\text{logs}} \;+\; B_{\text{target}}, \qquad B_{\text{target}} \;=\; \frac{\operatorname{Var}_{\pi'}(R)}{m}, \quad B_{\text{logs}} \;\ge\; \frac{\sigma_T^2\,\alpha^2}{\beta\,n}\,\big(1+\chi^2(p_T\|q_T)\big).

Proof sketch. The efficient influence function for Ψ=Eπ[R]\Psi=\mathbb{E}_{\pi'}[R] in the two‑sample (independent) experiment decomposes into orthogonal components ϕ=ϕtarget+ϕlogs\phi = \phi_{\text{target}} + \phi_{\text{logs}}, where ϕtarget=REπ[R]\phi_{\text{target}} = R - \mathbb{E}_{\pi'}[R] (the nonparametric EIF for a mean under π\pi') and ϕlogs\phi_{\text{logs}} lies in the tangent space of the logs model. Independence implies Var(ϕ)=Var(ϕtarget)+Var(ϕlogs)\operatorname{Var}(\phi) = \operatorname{Var}(\phi_{\text{target}}) + \operatorname{Var}(\phi_{\text{logs}}). The first term yields Varπ(R)/m\operatorname{Var}_{\pi'}(R)/m; the second is bounded below by Bound 1 (applied to RR in place of YY) when we use logs as control variates. This gives the stated inequality.

Remark. The target term is the binding floor when judged fresh draws are available in sufficient quantity; logs can help as control variates only when the CLE floor is not prohibitive.


Lemma B (Rényi‑2 factor equals 1+χ2\sqrt{1+\chi^2} )

By definition, D2(pTqT)=log ⁣TpT2qTdμ=log ⁣(1+χ2(pTqT))D_2(p_T\|q_T)=\log\!\int_T \frac{p_T^2}{q_T}\,d\mu=\log\!\big(1+\chi^2(p_T\|q_T)\big).[3] Therefore exp ⁣(D2/2)=1+χ2(pTqT)\exp\!\big(D_2/2\big)=\sqrt{1+\chi^2(p_T\|q_T)} and the version of the CLE floor in the main text σTαβnexp(D2/2)\frac{\sigma_T\alpha}{\sqrt{\beta n}}\exp(D_2/2) is identical to the χ² form here.

References

[1] Swaminathan, A., & Joachims, T. (2015). Counterfactual Risk Minimization: Learning from Logged Bandit Feedback. NeurIPS. https://papers.nips.cc/paper/5748-the-self-normalized-estimator-for-counterfactual-learning
[2] Bachmann, G., & Nagarajan, V. (2024). The Pitfalls of Next-Token Prediction. ICML / arXiv:2403.06963. https://aclanthology.org/2024.emnlp-main.474/
[3] van Erven, T., & Harremoës, P. (2014). Rényi Divergence and Kullback–Leibler Divergence. IEEE Transactions on Information Theory. https://en.wikipedia.org/wiki/R%C3%A9nyi_entropy
[4] Schnitzer, M. E., et al. (2013). Targeted maximum likelihood estimation for marginal time-dependent treatment effects under density misspecification. Biostatistics.

Cite this work

APA

Eddie Landesberg. (2025, October 15). Coverage-Limited Efficiency: Why High ESS Isn't Enough. CIMO Labs Blog. https://cimolabs.com/blog/coverage-limited-efficiency

BibTeX

@misc{landesberg2025coverage-limited,
  author = {Eddie Landesberg},
  title = {Coverage-Limited Efficiency: Why High ESS Isn't Enough},
  howpublished = {\url{https://cimolabs.com/blog/coverage-limited-efficiency}},
  year = {2025},
  note = {CIMO Labs Blog}
}