Coverage-Limited Efficiency: Why High ESS Isn't Enough

October 15, 2025•Eddie Landesberg•10 min read

The problem: Your off-policy estimator reports 95% effective sample size (ESS), stabilized importance weights, and no extreme values. Everything looks healthy. But the estimate is still uninformative—standard errors remain huge and confidence intervals span the entire plausible range.

The insight: High ESS only tells you that weights aren't dominated by a few extreme observations. It doesn't tell you whether your logging policy has meaningful coverage in the regions where your target policy concentrates. If the logger rarely visits target-typical regions, no amount of weight calibration can make your logs-only estimate precise.

We formalize this intuition with Coverage-Limited Efficiency (CLE)—a sharp, local lower bound on standard errors for any logs-only off-policy estimator. The bound separates two failure modes: (1) insufficient logger coverage in target-relevant regions, and (2) shape mismatch between logger and target distributions within those regions. When judged fresh draws from the target policy are available, the CLE floor becomes diagnostic rather than limiting—the binding constraint shifts to the Monte Carlo term from the target sample.

Note: CLE has not been extensively tested numerically or empirically. We present it as our leading hypothesis for why calibrated importance sampling broke down in the Arena experiment, but computing TTC, β, and the actual CLE floor for that data remains future work. The theoretical framework is sound, but empirical validation across multiple domains is an active research direction.

The Coverage Gap

Suppose your logging policy π₀ generates responses to prompts, and you want to evaluate a new target policy π′ using only logged data. Standard off-policy evaluation (OPE) wisdom says: compute importance weights $w_i = \pi'(A_i|X_i) / \pi_0(A_i|X_i)$ , check that ESS is high, stabilize the weights if needed, and you're good to go.

But ESS measures weight concentration—whether the estimate is dominated by a few outliers. It doesn't measure coverage—whether the logger visits the regions where the target policy typically operates.

Context vs. action coverage: Coverage can fail in contexts (P(X)) or in actions given context (P(A|X)). CLE here targets action-space overlap given a shared prompt set; if P(X) shifts, diagnose and correct that first (reweight contexts) before applying CLE.

Concrete failure mode

Your logger π₀ is a base LLM that produces terse, factual responses. Your target π′ is a fine-tuned variant that produces detailed, explanatory responses 3× longer. The logger almost never generates the kind of responses π′ prefers—the policies have poor overlap in output space.

Even after weight stabilization (SIMCal-W) pushes ESS to 95%, you only have a handful of logged responses that look anything like what π′ would generate. Those few observations carry all the information about π′'s performance. Your effective sample size for estimating π′ is tiny, even though global ESS looks great.

A Coverage-Limited Efficiency Bound

Notation ledger

π₀: logger policy; π′: target policy; Y: oracle-scale reward; R: calibrated reward;
T ⊂ 𝒳×𝒜: target-typical region; α = P_π′(T); β = P_π₀(T);
σ_T²: Var(Y | (X,A)∈T); χ²(p‖q): chi-square divergence;
TTC = α̂ (estimated target mass on T).

Coverage-Limited Efficiency formalizes this intuition. Let $T \subset \mathcal{X} \times \mathcal{A}$ be any target-relevant region—a subset of context-action space where the target policy concentrates its probability mass. For example, $T$ could be the set of (context, response) pairs that are "typical" under π′ (defined via surprisal thresholds from teacher forcing).

Define:

$\alpha = P_{\pi'}(T)$ : Target policy's mass on $T$ ("How often does π′ visit this region?")
$\beta = P_{\pi_0}(T)$ : Logging policy's mass on $T$ ("How often does the logger visit this region?")
$\sigma_T^2 = \text{Var}(Y \mid (X,A) \in T)$ : Outcome variance within $T$
$\chi^2(\pi'_T \| \pi_{0,T})$ : Chi-square divergence between π′ and π₀ restricted to $T$ (measures shape mismatch inside the region)

Bound 1 (Coverage-Limited Efficiency)

For any regular logs-only estimator of $\Psi = \mathbb{E}_{\pi'}[Y]$ with influence function φ,

\text{SE}(\hat{\Psi}) \;\ge\; \frac{\sigma_T \,\alpha}{\sqrt{\beta\, n}} \sqrt{1 + \chi^2\!\big(\pi'_T \,\|\, \pi_{0,T}\big)}.

No weight calibration, no projection method, no clever estimator can beat this floor using only logged data from π₀.

Assumptions (informal): (A1) Regular estimator with an influence function under π₀; (A2) Positivity on T: β = P_π₀(T) > 0; (A3) Local approximation: variance lower bounds derived from second-moment properties of the restricted likelihood ratio on T; (A4) Outcome variance on T finite (σ_T² < ∞). Remark: This is a local lower bound; it binds when target mass α on T and logger mass β on T drive difficulty. We treat it as a diagnostic bound in practice.

Connection to Rényi divergence: Because D₂(p‖q) = log(1 + χ²(p‖q))^[3], our earlier factor e^D₂/2 equals √(1+χ²). This connects the floor directly to second-moment diagnostics of restricted weights.

Interpreting the components

The CLE bound has three multiplicative factors:

1. Coverage penalty: $\alpha / \sqrt{\beta}$

If the target concentrates mass α on region $T$ but the logger only puts mass β there, you pay $\sqrt{\alpha^2/\beta} = \alpha/\sqrt{\beta}$ . When β is tiny (logger rarely visits target-typical regions), this dominates. Example: α = 0.7, β = 0.01 → penalty = 7×.

2. Shape mismatch: $\sqrt{1 + \chi^2}$

Even when the logger does visit $T$ , if the distributions have different shapes inside that region, you pay an additional penalty. Chi-square divergence measures this: $\chi^2 = 0$ means identical shapes (no penalty), $\chi^2 > 0$ inflates the floor. Example: χ² = 3 → penalty = 2×.

3. Noise & sample size: $\sigma_T / \sqrt{n}$

Standard Monte Carlo term: intrinsic outcome variability divided by $\sqrt{n}$ . This is the unavoidable statistical uncertainty you'd face even with perfect overlap and no shape mismatch. Vacuous (correctly) when σ_T=0.

Key insight: The coverage penalty $\alpha/\sqrt{\beta}$ and shape mismatch $\sqrt{1 + \chi^2}$ are multiplicative. Small β or large χ² can make the floor prohibitively large, rendering logs-only estimation uninformative no matter how you calibrate the weights.

Estimating χ² on T

Let w_i = π′(A_i|X_i)/π₀(A_i|X_i) and restrict to i∈T. Normalize restricted weights: ẇ_i = w_i / ((1/|T|)∑_j∈T w_j).

Then $\widehat{1+\chi^2} = \frac{1}{|T|}\sum_{i\in T} \dot{w}_i^2$ .

Practical diagnostic: Plot the distribution of normalized restricted weights ẇ_i on T; (1/|T|)∑ ẇ_i² = 1+χ̂_T². Heavy tails → large mismatch → inflated floor.

Two Regimes: Logs-Only vs. With Fresh Draws

The CLE floor only binds for logs-only estimators (pure IPS, calibrated IPS). When you have judged fresh draws from the target policy π′, the picture changes fundamentally.

Regime	Binding Floor	Role of Logs
Logs-only (IPS, Cal-IPS)	$\text{CLE}: \frac{\sigma_T \alpha}{\sqrt{\beta n}} \sqrt{1+\chi^2}$	Essential. Refuse if floor exceeds SE budget.
With judged fresh draws (Direct, DR)	$\text{MC}: \sqrt{\text{Var}_{\pi'}(Y) / m}$	Optional control variates (can improve, but not essential).

Why fresh draws change everything

Suppose you collect $m$ judged fresh draws from π′: each target policy generates a response to the same prompts, you score it with a cheap judge, and calibrate those scores to the oracle scale. Now your estimator is a two-sample design: $n$ logged observations from π₀ and $m$ fresh observations from π′.

Proposition: Two-sample decomposition

For any regular estimator using $n$ logs and $m$ judged fresh draws,

\text{Var}(\hat{\Psi}) \geq B_{\text{logs}} + B_{\text{target}}, \quad \text{where } B_{\text{target}} = \frac{\text{Var}_{\pi'}(R)}{m}, \quad B_{\text{logs}} \geq \frac{\sigma_T^2 \alpha^2}{\beta n} (1+\chi^2).

The binding floor is the target Monte Carlo term $B_{\text{target}}$ . The logs-only CLE becomes diagnostic—it tells you whether the logged data can meaningfully improve the target-only estimate, but you're not stuck with it as the binding constraint.

Oracle uncertainty (OUA): With partial oracle coverage, add an OUA component from calibrator learning: Var_total ≥ B_logs + B_target + Var_OUA. When m is large and OUA shrinks, B_target is the binding term.

Implication for Direct Model (DM) and Doubly Robust (DR): These methods generate fresh responses from each policy on the same prompts. The binding uncertainty is the Monte Carlo term from scoring those fresh responses with a calibrated judge. Logged data from π₀ can serve as control variates (DR uses them for bias correction), but poor logger coverage doesn't prevent you from getting a precise estimate—it just means the log-based corrections contribute little. Influence-function stacking (stacked-DR) automatically down-weights log corrections when the CLE floor is high.

Target-Typicality Coverage (TTC) Diagnostic

To operationalize CLE, we need to compute β: what fraction of logged data lives in target-relevant regions? The key challenge: defining $T$ (the target-typical region) without already having extensive data from π′.

Defining typicality via surprisal

Use teacher forcing to compute the per-token surprisal of each logged response under the target policy π′. Responses with low surprisal are "typical" for π′; high surprisal means π′ would rarely generate them. Set a threshold τ (e.g., 75th percentile of surprisal on a small validation set of fresh draws from π′, or a fixed percentile on logged data) and define:

T = \{(X_i, A_i) : \text{surprisal}_{\pi'}(A_i | X_i) \leq \tau\}

Alternative (no teacher forcing): risk-index typicality. Use the stage-1 index Ť = g(S, X) from AutoCal-R and define T as the top-k percentile of Ť under π′ (estimated via a small set of fresh draws). This yields a coverage diagnostic that does not depend on propensity scoring.

Then compute:

$\hat{\alpha} = \frac{1}{n} \sum_{i=1}^n w_i \cdot \mathbb{1}[i \in T]$ (target mass on $T$ , estimated via importance weights)
$\hat{\beta} = \frac{1}{n} \sum_{i=1}^n \mathbb{1}[i \in T]$ (logger mass on $T$ , directly observed)

Target-Typicality Coverage (TTC) is α̂. If TTC is low, the target policy concentrates most of its mass on a region where you have few logged observations—a red flag for logs-only estimation.

IF-ESS on T

Using out-of-fold influence contributions ψ_i, define $\text{IF-ESS}_T = \frac{(\sum_{i\in T} \psi_i)^2}{\sum_{i\in T} \psi_i^2}$ . This is the effective number of informative samples inside T for your estimator.

Computing the SE floor

Once you have α̂, β̂, and an estimate of the shape mismatch $\widehat{1+\chi^2_T}$ (computed from the restricted importance weights on $T$ as shown above), the CLE floor is:

\text{SE}_{\min}(\tau) = \frac{\hat{\sigma}_T \hat{\alpha}}{\sqrt{\hat{\beta} n}} \sqrt{\widehat{1+\chi^2_T}}

where $\hat{\sigma}_T$ is the out-of-fold (OOF) residual variance on $T$ .

REFUSE-LEVEL gates (logs-only)

Refuse to report logs-only estimates if any hold:

Floor exceeds precision budget: SE_min(τ) > SE_target
Precision-to-cost dominated: Required n from $n \geq \frac{\sigma_T^2 \hat{\alpha}^2}{\hat{\beta} \varepsilon^2} \widehat{(1+\chi^2_T)}$ exceeds feasible budget
Fragile evidence in-target: IF-ESS_T < N_min (default 20)

Coverage-mismatch profile

Since the definition of $T$ depends on the surprisal threshold τ, plot SE_min(τ) across a range of thresholds. If the floor exceeds your SE budget for all plausible τ, logs-only estimation is infeasible.

Worked example (toy)

n=5,000 logs; β̂=0.01; α̂=0.6; σ_T≈0.20; 1+χ̂_T²≈3.

SE_min = (0.20 × 0.6) / √(0.01 × 5000) × √3 ≈ 0.12 / √50 × 1.732 ≈ 0.12 / 7.071 × 1.732 ≈ 0.029.

If your SE_target is 0.01, logs-only is infeasible (floor is 3× too high).

How to compute CLE in practice

Choose T: Define via surprisal or risk-index typicality; profile over τ
Compute β̂: β̂ = |T|/n
Estimate α̂: Small fresh draws + simple classifier density ratio, or importance weights if reliable
Estimate σ_T: Via out-of-fold residuals on T
Estimate 1+χ̂_T²: Via normalized restricted weights on T (shown above)
Compute SE_min(τ): Profile over τ; apply gates

Connection to Arena Experiment Results

In the Arena experiment, we saw that SNIPS (raw importance sampling)^[1] and calibrated-ips (SIMCal-W stabilized weights) both failed catastrophically for ranking, despite SIMCal-W boosting ESS from 0.4–26% up to 82–99%. CLE provides our leading hypothesis for why this occurred:

Poor logger coverage in target-typical regions. The base policy (Llama 3.3 70B with standard system prompt) generates very different responses than the target policies (parallel_universe, premium, unhelpful). Even though global ESS is high after SIMCal-W, β̂ (logger coverage in target-typical regions) is tiny for policies that differ substantially from base.
Teacher forcing noise amplifies shape mismatch. Computing propensities π′(A|X) via teacher forcing is noisy and non-deterministic^[2], inflating the local divergence D₂ even for the clone policy (identical to base with different seed).
Why Direct Model (DM) succeeds. Direct methods generate fresh responses from each policy on the same prompts. The binding floor is the Monte Carlo term $\sqrt{\text{Var}_{\pi'}(R)/m}$ , not the CLE floor. Poor logger coverage is irrelevant—each policy gets its own independent sample.
Why Doubly Robust (DR) recovers performance. DR uses fresh draws to train an outcome model, then applies importance-weighted corrections using logged data. The outcome model provides a baseline; logs are control variates. When the CLE floor is high (poor logger coverage), the log-based corrections contribute little, and the estimate is governed by the target Monte Carlo term. SIMCal-W stabilization ensures the corrections don't hurt, even if they don't help much.

Illustrative vignette from Arena experiment

For the parallel_universe policy (hypothetical coverage estimates):

Global ESS after SIMCal-W: 95.4% (observed)
Logger coverage in parallel_universe-typical regions: β̂ ≈ 0.006 (inferred, not computed)
Target mass: α̂ ≈ 0.7 (inferred, not computed)
Coverage penalty: $\alpha / \sqrt{\beta} \approx 0.7 / \sqrt{0.006} \approx 9$ (9× inflation)

Even with perfect shape match (D₂ = 0), the CLE floor would be 9× higher than the naïve $1/\sqrt{n}$ rate. With teacher forcing noise pushing D₂ > 0, the floor becomes prohibitive. Result: SNIPS achieves 8.7% top-1 accuracy (random guessing = 20%), and calibrated-ips only reaches 19.1% despite 95% ESS. Note: The β and α values are illustrative inferences based on policy similarity, not computed from the data. Computing actual TTC and CLE floors for the Arena experiment is future work.

Practical Guidance

When to refuse logs-only estimation

Before running the experiment: If you know the target and logging policies differ substantially (different model sizes, system prompts, temperatures), expect poor coverage. Plan to collect judged fresh draws rather than relying on logs-only OPE.
After collecting data: Compute TTC (α̂), logger coverage (β̂), and the CLE floor SE_min. If SE_min exceeds your precision target (e.g., you need SE ≤ 0.01 but the floor is 0.05), refuse logs-only estimation.
Profile over typicality thresholds: Plot SE_min(τ) to check robustness. If the floor is prohibitive across all plausible definitions of "target-typical," the result is not sensitive to your choice of τ.

Sample size planning

To achieve SE ≤ ε with logs-only data,

n \geq \frac{\sigma_T^2 \alpha^2}{\beta \varepsilon^2} (1+\chi^2).

If β is tiny (poor logger coverage), the required $n$ explodes. In such cases, collecting fresh target draws is far more efficient than increasing the size of the logged dataset.

Cost guidance: With β small, required n scales like α²/(β ε²). A handful of judged fresh draws (m in the low thousands) often beats adding millions of logs. We surface this trade-off directly via the floor.

When logs still help

Even with judged fresh draws (Direct/DR), logged data can improve efficiency if the CLE floor is reasonable. Doubly robust estimators and influence-function stacking (stacked-DR) automatically weight the log-based corrections by their precision. When β is moderate (logger has decent coverage), DR can achieve lower variance than pure Direct estimation. When β is tiny, stacking mutes the log corrections and you recover essentially the Direct estimate.

Comparison to ESS

Metric	What it measures	Failure mode it detects
ESS	Weight concentration: $(\sum w_i)^2 / \sum w_i^2$	A few extreme weights dominate the estimate
TTC (α̂)	Logger coverage in target-typical regions	Logger rarely visits where target concentrates
CLE floor	Minimum achievable SE given coverage and shape mismatch	Logs-only estimation is fundamentally uninformative

Key insight: You can have high ESS (no weight concentration) but low TTC (poor coverage), leading to a prohibitive CLE floor. Both diagnostics are necessary for honest off-policy evaluation.

Limitations

Defining $T$ without fresh draws is heuristic. The surprisal-based typicality definition requires teacher forcing under π′, which may be noisy. Profile over thresholds and disclose sensitivity. The risk-index alternative avoids propensity scoring.
Estimating σ_T with sparse oracle coverage. Use out-of-fold residuals and conservative bounds when oracle labels are limited.
Context shift. If the distribution over contexts $X$ differs between logger and target, the analysis requires separate coverage checks over $X$ . The CLE bound assumes overlap failures are in the action space $A | X$ , not in $X$ itself.
Tightness: The bound is local in T. If you partition 𝒳×𝒜 into bins, convexity yields a weighted combination lower bound; the worst (lowest β/highest χ²) bin often dominates. This explains why global ESS can look fine while a single high-mass target bin sets the floor.

The Relationship: Design-by-Projection and CLE

Design-by-Projection (DbP) methods (AutoCal-R, SIMCal-W) project empirical data onto convex sets encoding structural knowledge (monotonicity, mean preservation). This reduces standard errors from an uncalibrated baseline. But no projection can beat the coverage-limited efficiency floor—it's a hard barrier determined by logger coverage and local mismatch.

Together, DbP and CLE provide complementary perspectives:

DbP methods: Practical tools to reduce variance through calibration and stabilization
CLE floor: Theoretical minimum that tells you when logs-only estimation is fundamentally limited

CJE currently implements Design-by-Projection methods (AutoCal-R, SIMCal-W). CLE diagnostics (TTC, IF-ESS in T, SE floor) are planned for future releases.

Planned Implementation in CJE

CLE diagnostics are planned for integration into the CJE package in a future release. The implementation will include:

TTC (Target-Typicality Coverage): α̂, the target policy's estimated mass on the logger-observed region
Logger coverage: β̂, the fraction of logged data in target-typical regions
Mismatch multiplier: $\hat{M}_T = \exp(D_2)$ , measuring shape divergence inside the region
CLE floor: SE_min, the theoretical minimum standard error
IF-ESS restricted to $T$ : Effective sample size of the influence function within the target-typical region

When implemented, if logs-only estimation is attempted and any REFUSE-LEVEL gate triggers, CJE will issue a warning and suggest collecting judged fresh draws.

# Planned API for CLE diagnostics

from cje import OffPolicyEstimator

results = OffPolicyEstimator(method='calibrated-ips').fit(data)

results.cle_diagnostics() # Coming soon

# Output will include TTC, SE_min, refuse gates

If you're interested in contributing to the CLE implementation or have use cases that would benefit from these diagnostics, please open an issue or discussion on the CJE GitHub repository.

Conclusion

High ESS is necessary but not sufficient for informative off-policy evaluation. Coverage-Limited Efficiency provides a sharp, local lower bound on standard errors that separates logger coverage from shape mismatch and makes refusal decisions rigorous.

Key takeaways:

Logs-only OPE has a hard precision floor determined by logger coverage in target-relevant regions
With judged fresh draws (Direct/DR), the binding floor is the Monte Carlo term—logs become optional control variates
TTC (Target-Typicality Coverage) operationalizes the coverage diagnostic
Design-by-Projection and CLE form complementary perspectives: projections reduce variance, CLE sets the theoretical limit

For practitioners: compute TTC and the CLE floor before committing to logs-only estimation. When the floor is prohibitive, invest in fresh target draws rather than scaling up logged data collection.

Research Direction & Feedback

CLE is an active area of research. Empirical validation across diverse domains, numerical studies of the tightness of the bound, and practical implementation of TTC diagnostics are ongoing work. We invite feedback, critical discussion, and collaboration. If you're interested in testing CLE on your data or have insights about coverage diagnostics for off-policy evaluation, please reach out via our contact page or GitHub.

Appendix: Proofs

Notation & assumptions (for this appendix)

Let $Z=(X,A,Y)$ be drawn i.i.d. under the logging policy distribution $\pi_0$ with density $q$ , and let the target policy distribution be $\pi'$ with density $p$ (both w.r.t. a common base measure). Define the importance ratio $w(Z)=\frac{p(Z)}{q(Z)}$ and a measurable $T\subset\mathcal{X}\times\mathcal{A}$ . Write $\alpha = P_{\pi'}(T),\; \beta = P_{\pi_0}(T)$ and the restricted densities $p_T = p/\alpha,\; q_T = q/\beta$ on $T$ . Let $\sigma_T^2$ denote a lower bound on conditional outcome variance on $T$ , e.g. $\sigma_T^2 \le \operatorname*{ess\,inf}_{(x,a)\in T}\operatorname{Var}(Y\mid X{=}x,A{=}a)$ (any fixed lower bound suffices for the inequality below). We consider regular (asymptotically linear) estimators based only on logs (for Bound 1) and independent logs + target draws (for the two‑sample bound).

Lemma A (weight identity on $T$ )

On $T$ , the importance ratio satisfies $w = \frac{p}{q} = \frac{\alpha}{\beta}\cdot \frac{p_T}{q_T}.$ Consequently,

\mathbb{E}_{\pi_0}\!\left[w^2\,\mathbf{1}_T\right] \;=\; \frac{\alpha^2}{\beta}\,\int_T \frac{p_T^2}{q_T}\,d\mu \;=\; \frac{\alpha^2}{\beta}\,\big(1+\chi^2(p_T\|q_T)\big).

Proof. On $T$ , write $p=\alpha p_T,\; q=\beta q_T$ and expand. The last equality uses $\chi^2(p_T\|q_T)=\int_T \frac{(p_T-q_T)^2}{q_T}d\mu = \int_T \frac{p_T^2}{q_T}d\mu - 1.$ □

Bound 1 (Coverage‑Limited Efficiency, χ² form)

For any regular logs‑only estimator of $\Psi=\mathbb{E}_{\pi'}[Y]$ , the asymptotic standard error obeys

\mathrm{SE}(\hat\Psi) \;\ge\; \frac{\sigma_T\,\alpha}{\sqrt{\beta\,n}}\;\sqrt{\,1+\chi^2\!\big(p_T\|q_T\big)\,}.

Intuition: How the pieces fit together

The floor has four multiplicative components, each capturing a distinct source of difficulty:

$\sigma_T$ : Irreducible outcome noise on $T$ —even with infinite data, you cannot estimate $\mathbb{E}[Y|X,A]$ more precisely than the conditional variance allows.
$\alpha$ : Target mass on $T$ —the higher the target policy concentrates on $T$ , the more this region dominates the overall mean, amplifying any estimation error here.
$1/\sqrt{\beta n}$ : Effective logger sample size in $T$ —you only have $\beta n$ logged samples in the relevant region; sparse coverage directly inflates variance.
$\sqrt{1+\chi^2(p_T\|q_T)}$ : Shape mismatch within $T$ —even when logger visits $T$ , if it explores it differently than the target (high χ²), importance weights become extreme and variance explodes. This is the "effective sample size tax" from reweighting^[1].

Together: outcome noise × target importance × logger scarcity × reweighting penalty. Each factor is unavoidable; the bound is tight when all four sources align to make the problem fundamentally hard.

Proof.

Decompose $\Psi = \mathbb{E}_{\pi'}[Y] = \Psi_T + \Psi_{T^c}$ with $\Psi_T=\mathbb{E}_{\pi'}[Y\,\mathbf{1}_T]=\mathbb{E}_{\pi_0}[w\,Y\,\mathbf{1}_T]$ . Since variances add for independent components and dropping $T^c$ can only reduce difficulty, a lower bound for estimating $\Psi_T$ is a valid lower bound for estimating $\Psi$ .
Consider the model where $w$ is treated as known (oracle propensities). Then $\Psi_T = \mathbb{E}_{\pi_0}[f(Z)]$ with $f(Z)=w\,Y\,\mathbf{1}_T$ is a linear functional of the unknown logging distribution. In this model the efficient influence function (EIF)^[4] is $\phi(Z)=f(Z)-\Psi_T$ , so the semiparametric efficiency bound is $\operatorname{AVAR}(\hat\Psi_T)\ge \frac{1}{n}\,\operatorname{Var}_{\pi_0}\!\big(w Y \mathbf{1}_T\big).$
Center on the target‑restricted mean $\mu_T=\mathbb{E}_{p_T}[Y]$ and note $\mathbb{E}_{\pi_0}[w\,(Y-\mu_T)\,\mathbf{1}_T] = \alpha\,\mathbb{E}_{p_T}[Y-\mu_T]=0$ . Then $\operatorname{Var}_{\pi_0}\!\big(w Y \mathbf{1}_T\big) = \mathbb{E}_{\pi_0}\!\big[w^2 (Y-\mu_T)^2 \mathbf{1}_T\big].$
Take conditional expectation given $(X,A)$ and use $\mathbb{E}\!\big[(Y-\mu_T)^2\mid X,A\big] \;\ge\; \operatorname{Var}(Y\mid X,A)$ . By the definition of $\sigma_T^2$ , we have $\mathbb{E}_{\pi_0}\!\big[w^2 (Y-\mu_T)^2 \mathbf{1}_T\big] \;\ge\; \sigma_T^2\,\mathbb{E}_{\pi_0}\!\big[w^2 \mathbf{1}_T\big].$
Apply Lemma A to conclude $\operatorname{Var}_{\pi_0}\!\big(w Y \mathbf{1}_T\big) \;\ge\; \sigma_T^2\,\frac{\alpha^2}{\beta}\,\big(1+\chi^2(p_T\|q_T)\big).$ Taking square roots yields the stated bound.

Remarks.

Treating $w$ as known makes the problem easier; hence this is a valid lower bound for any logs‑only estimator in the real problem (where $w$ is estimated).
The bound is local in $T$ : poor logger coverage ( $\beta\ll 1$ ) or shape mismatch ( $\chi^2\gg 1$ ) on any high‑mass $T$ inflates the floor multiplicatively.

Corollary (ESS form on $T$ )

Define normalized restricted weights $\tilde w_i = w_i / \mathbb{E}_{\pi_0}[w\mid i\in T]$ and the usual weight ESS on $T$ :

\text{ESS}_T \;=\; \frac{\big(\sum_{i\in T} \tilde w_i\big)^2}{\sum_{i\in T} \tilde w_i^{\,2}}\;.

Then $\mathbb{E}[\tilde w_i]=1$ and $\mathbb{E}[\tilde w_i^{\,2}] = 1+\chi^2(p_T\|q_T)$ , so $\mathbb{E}[\text{ESS}_T] \approx \frac{|T|}{1+\chi^2(p_T\|q_T)}.$ Since $\mathbb{E}[|T|]=\beta n$ , the Bound 1 floor can be rewritten as

\mathrm{SE}_{\min} \;\gtrsim\; \frac{\sigma_T\,\alpha}{\sqrt{\text{ESS}_T}} \quad \text{(up to concentration of } |T| \text{ around } \beta n).

Proof sketch. With $\tilde w$ normalized, the classical identity $\frac{1}{|T|}\sum_{i\in T} \tilde w_i^{\,2} \to 1+\chi^2(p_T\|q_T)$ implies $\text{ESS}_T \approx \frac{|T|}{1+\chi^2}$ . Replace $\sqrt{1+\chi^2}$ in Bound 1 by $\sqrt{|T|/\text{ESS}_T}$ and use $|T|\approx \beta n$ . □

Proposition (Two‑sample variance lower bound)

Suppose in addition to $n$ logs from $\pi_0$ , you have $m$ independent judged fresh draws $U_1,\ldots,U_m \sim \pi'$ , each scored into an oracle‑scale reward $R$ (e.g., a calibrated judge). For any regular estimator using both samples (with sample‑splitting/cross‑fitting so that nuisance fits are independent of the target half),

\operatorname{Var}(\hat\Psi) \;\ge\; B_{\text{logs}} \;+\; B_{\text{target}}, \qquad B_{\text{target}} \;=\; \frac{\operatorname{Var}_{\pi'}(R)}{m}, \quad B_{\text{logs}} \;\ge\; \frac{\sigma_T^2\,\alpha^2}{\beta\,n}\,\big(1+\chi^2(p_T\|q_T)\big).

Proof sketch. The efficient influence function for $\Psi=\mathbb{E}_{\pi'}[R]$ in the two‑sample (independent) experiment decomposes into orthogonal components $\phi = \phi_{\text{target}} + \phi_{\text{logs}}$ , where $\phi_{\text{target}} = R - \mathbb{E}_{\pi'}[R]$ (the nonparametric EIF for a mean under $\pi'$ ) and $\phi_{\text{logs}}$ lies in the tangent space of the logs model. Independence implies $\operatorname{Var}(\phi) = \operatorname{Var}(\phi_{\text{target}}) + \operatorname{Var}(\phi_{\text{logs}})$ . The first term yields $\operatorname{Var}_{\pi'}(R)/m$ ; the second is bounded below by Bound 1 (applied to $R$ in place of $Y$ ) when we use logs as control variates. This gives the stated inequality. □

Remark. The target term is the binding floor when judged fresh draws are available in sufficient quantity; logs can help as control variates only when the CLE floor is not prohibitive.

Lemma B (Rényi‑2 factor equals $\sqrt{1+\chi^2}$ )

By definition, $D_2(p_T\|q_T)=\log\!\int_T \frac{p_T^2}{q_T}\,d\mu=\log\!\big(1+\chi^2(p_T\|q_T)\big)$ .^[3] Therefore $\exp\!\big(D_2/2\big)=\sqrt{1+\chi^2(p_T\|q_T)}$ and the version of the CLE floor in the main text $\frac{\sigma_T\alpha}{\sqrt{\beta n}}\exp(D_2/2)$ is identical to the χ² form here. □

References

[1] Swaminathan, A., & Joachims, T. (2015). Counterfactual Risk Minimization: Learning from Logged Bandit Feedback. NeurIPS. https://papers.nips.cc/paper/5748-the-self-normalized-estimator-for-counterfactual-learning

[2] Bachmann, G., & Nagarajan, V. (2024). The Pitfalls of Next-Token Prediction. ICML / arXiv:2403.06963. https://aclanthology.org/2024.emnlp-main.474/

[3] van Erven, T., & Harremoës, P. (2014). Rényi Divergence and Kullback–Leibler Divergence. IEEE Transactions on Information Theory. https://en.wikipedia.org/wiki/R%C3%A9nyi_entropy

[4] Schnitzer, M. E., et al. (2013). Targeted maximum likelihood estimation for marginal time-dependent treatment effects under density misspecification. Biostatistics.

Cite this work

APA

Eddie Landesberg. (2025, October 15). Coverage-Limited Efficiency: Why High ESS Isn't Enough. CIMO Labs Blog. https://cimolabs.com/blog/coverage-limited-efficiency

BibTeX

@misc{landesberg2025coverage-limited,
  author = {Eddie Landesberg},
  title = {Coverage-Limited Efficiency: Why High ESS Isn't Enough},
  howpublished = {\url{https://cimolabs.com/blog/coverage-limited-efficiency}},
  year = {2025},
  note = {CIMO Labs Blog}
}

Arena Experiment: Empirical Evidence CJE Documentation GitHub Repository