Design-by-Projection: A General Principle for Structure-Aware Estimation

October 10, 2025•Eddie Landesberg•12 min read

The principle: When you know structural properties of your problem (outcomes increase with judge scores, importance weights should average to 1), but don't know the exact functional form, project your empirical data onto the set of functions satisfying those constraints. This preserves unbiasedness while reducing variance.

The framework: Design-by-Projection (DbP) unifies AutoCal-R (reward calibration via isotonic regression) and SIMCal-W (weight stabilization via isotonic projection) under a single principle. Both methods project onto convex constraint sets, but AutoCal-R operates on judge scores → oracle outcomes, while SIMCal-W operates on importance weights.

Rather than estimating unconstrained functions that overfit small samples or imposing rigid parametric forms (linear, logistic) that misspecify, Design-by-Projection finds the closest function (in least-squares sense) that satisfies what you know must be true. The result: automatic variance reduction, bias-variance trade-offs that favor small oracle samples, and interpretable output.

In Arena: DbP instantiates as AutoCal-R (reward calibration with a mean constraint and covariates) and SIMCal-W (mean-one monotone weight calibration). Together they explain why Direct and DR work well—and why IPS alone fails under overlap scarcity.

DbP Assumptions at a glance

AutoCal-R: Monotonicity of $Y$ in a judge-based risk index $T = g(S, X)$ ; mean-preservation enforced on the oracle slice.

SIMCal-W: Stabilized weights are a monotone function of $S$ (or $T$ ), nonnegative, unit-mean; projection cannot create overlap—diagnose ESS and tails.

The Problem: Balancing Flexibility and Structure

Suppose you're calibrating an LLM judge to predict oracle outcomes. You have:

$n$ observations: judge scores $S_i$ and oracle labels $Y_i$
Goal: Learn $f: S \to Y$ to predict oracle outcomes from judge scores
Known constraint: Higher judge scores should predict no worse oracle outcomes (monotonicity)

The Goldilocks problem

Too flexible (unconstrained regression): Learns arbitrary non-monotone wiggles, overfits noise in small samples, produces calibrated predictions that invert judge rankings.

Too rigid (linear regression): Forces $f(S) = \alpha + \beta S$ , misspecifies when the true relationship is nonlinear (e.g., saturation at high scores, floor effects at low scores).

Just right (isotonic regression): Flexible enough to capture nonlinearity, constrained enough to avoid overfitting. Learns piecewise-constant monotone function that fits data while respecting known structure.

The Design-by-Projection Principle

Core idea: Encode what you know (or assume) as a convex constraint set $\mathcal{C}$ , then project your empirical estimate onto $\mathcal{C}$ .

Projection formula

\hat{f}_{\text{DbP}} = \text{argmin}_{g \in \mathcal{C}} \| \hat{f}_{\text{unc}} - g \|^2

Find the function $g$ in the constraint set $\mathcal{C}$ that is closest (in $L^2$ norm) to the unconstrained empirical estimate $\hat{f}_{\text{unc}}$ .

Why does this work?

When $\mathcal{C}$ is convex and contains the true function, projection has three key properties:

Bias–variance trade-off: Projection onto a correct constraint set is a contraction that typically reduces variance and can reduce MSE; it does not generally preserve finite-sample unbiasedness. For reward calibration we recover the right mean by explicit mean-preservation; for weight calibration we enforce unit-mean weights.
Variance reduction: Projection is a smoothing operation. By ruling out functions that violate known constraints, you reduce the effective degrees of freedom, lowering variance. For cones that contain the origin (e.g., the monotone cone), projection weakly reduces the norm. For general convex sets, projection minimizes distance to $\mathcal{C}$ , not necessarily the norm.
Interpretability: The output respects structural knowledge (monotonicity, mean preservation, boundedness), making results easier to validate and debug. You can't get perverse predictions that violate domain knowledge.

Projection in Hilbert spaces (intuition)

For a closed convex set $\mathcal{C}$ , the metric projection $P_{\mathcal{C}}(f)$ is unique and is characterized by the variational inequality $\langle f - P_{\mathcal{C}}(f), g - P_{\mathcal{C}}(f) \rangle \leq 0$ for all $g \in \mathcal{C}$ . For cones containing 0, projection weakly reduces norm; in general it minimizes distance to $\mathcal{C}$ . This orthogonality condition explains why imposing structure reduces variance without overfitting.^[1,2]

Application 1: AutoCal-R (Reward Calibration)

Problem: LLM judge scores $S$ are on an arbitrary scale. You need to map them to oracle outcomes $Y$ for downstream estimation.

Constraint set: $\mathcal{C}_{\text{mono}} = \{f: S \to \mathbb{R} \mid f \text{ is non-decreasing}\}$ . Monotonicity is the minimal assumption: better judge scores shouldn't predict worse outcomes.

Monotone mode

Directly project judge scores to oracle outcomes via isotonic regression:

\hat{f}(S) = \text{argmin}_{g \in \mathcal{C}_{\text{mono}}} \sum_{i=1}^n (Y_i - g(S_i))^2

This is isotonic regression: least-squares fit subject to monotonicity.^[3,4] The solution is a piecewise-constant function computed efficiently via the Pool Adjacent Violators (PAV) algorithm in O(n) time on sorted scores (or O(n log n) including the initial sort).^[5,6]

Two-stage mode (with covariates)

When judge scores have systematic bias (e.g., response length affects scores independent of quality):

Stage 1 (risk index): Learn $T = g(S, X)$ (e.g., spline on $(S, \text{response\_length})$ )
Stage 2 (isotonic): Fit $\hat{f}(T)$ by isotonic regression of $Y$ on $T$ , then apply a constant shift so that the oracle-slice mean of $\hat{f}$ matches that of $Y$

This corrects systematic judge bias (e.g., verbosity preference) while retaining monotonicity in the risk index $T$ .

Why isotonic regression?

Mean preservation (how we enforce it): Vanilla isotonic is an $L^2$ projection onto the monotone cone and does not by itself match the oracle mean.^[3,4] In AutoCal-R we enforce mean preservation via a constant shift: $\hat{f}_{\text{mp}}(s) = \hat{f}_{\text{iso}}(s) + \big(\overline{Y} - \overline{\hat{f}_{\text{iso}}(S)}\big)$ , which preserves monotonicity and puts the calibrator on the oracle scale. (Clip to [0,1] if needed.)
Minimal assumptions: Only requires monotonicity, not linearity or parametric form
Small-sample efficiency: Works with 5-25% oracle coverage (50-1250 labels)
Adaptive complexity: For isotonic regression, the degrees of freedom equals the number of constant blocks in the fit and adapts to signal complexity;^[4] in practice it is far smaller than $n$ , yielding substantial variance reduction.

Application 2: SIMCal-W (Weight Stabilization)

Problem: Off-policy importance weights $w_i = \pi'(A_i|X_i) / \pi_0(A_i|X_i)$ are often extreme, leading to high variance and poor effective sample size (ESS).^[11,12]

Constraint set: $\mathcal{C}_{\text{cal}} = \{h: S \to \mathbb{R}_+ \mid h \text{ is monotone in } T = g(S,X), \, \mathbb{E}_{\pi_0}[h(T)] = 1\}$ , i.e., unit mean under the logger.^[9,10] Calibrated weights should be nonnegative, monotone in a risk index, and preserve unbiasedness.

The stacked isotonic projection

SIMCal-W builds two candidate weight functions:

Increasing candidate: Isotonic regression of $w$ on $S$ (higher scores → higher weights)
Decreasing candidate: Antitonic regression of $w$ on $S$ (higher scores → lower weights; isotonic under reversed order)

After smoothing, we enforce nonnegativity and unit mean: rescale $\tilde{w} = \hat{w}_{\text{iso}} / \overline{\hat{w}_{\text{iso}}}$ .^[9,13] Stacking uses cross-fitted out-of-fold influence functions to tune $\lambda$ by minimizing estimated variance:^[14,15]

\hat{w}_{\text{SIMCal}} = \lambda \cdot \hat{w}_{\text{inc}} + (1-\lambda) \cdot \hat{w}_{\text{dec}}, \quad \lambda^* = \text{argmin}_{\lambda \in [0,1]} \text{Var}_{\text{IF}}(\lambda)

Why stacking?

By considering both directions (increasing and decreasing), SIMCal-W avoids having to assert which direction the monotone relationship should go. The data tells you: if increasing weights better stabilize the estimate, λ → 1; if decreasing weights are better, λ → 0. This makes the method robust to misspecification of the monotone direction.

No new overlap: Stabilization prevents numerical degeneracy but cannot create support where the logger has none.^[16] Always report ESS, max/median weight, and a tail index before/after smoothing.^[11,12] In LLM OPE, raw $w$ often come from teacher-forced sequence likelihoods; these can be noisy or structurally misspecified.^[21] Stabilization helps variance, but cannot fix overlap or propensity misspecification.

Theoretical Guarantees

1. Projection theorem (convex analysis)

For any closed convex set $\mathcal{C}$ in a Hilbert space, the projection $P_{\mathcal{C}}(f)$ exists, is unique, and satisfies:

\langle f - P_{\mathcal{C}}(f), g - P_{\mathcal{C}}(f) \rangle \leq 0 \quad \forall g \in \mathcal{C}

The residual $f - P_{\mathcal{C}}(f)$ is orthogonal to the constraint set. This is the Pythagorean theorem in Hilbert space: projecting reduces norm without introducing bias.

2. Monotone projection bounds variance

For isotonic regression on $n$ observations, the degrees of freedom $\text{df}$ satisfies $1 \leq \text{df} \leq n$ , where $\text{df}$ is the number of constant blocks in the fitted function. For smooth monotone signals, the fitted isotonic has $O_p(n^{1/3})$ constant pieces and risk $O(n^{-2/3})$ ,^[7,8] far fewer degrees of freedom than unconstrained fits, delivering substantial variance reduction.

3. Dispersion reduction (SIMCal-W)

The mean-one isotonic projection reduces $L^2$ dispersion (ESS↑) and typically improves tail metrics (max/median, tail index),^[9,11,13] though strict Lorenz dominance is not guaranteed without additional conditions. For repeated refits (OUA jackknife), isotonic's $O(n)$ complexity after sorting keeps total runtime modest.

Connection to Other Methods

Method	Constraint Set	DbP Perspective
Isotonic regression	Monotone functions	Project onto monotone cone
Platt scaling	Logistic link functions	Parametric constrained fit; not a convex projection
Lasso	Sparse coefficients ( $\\| \beta \\|_1 \leq t$ )	Project onto $\ell^1$ ball
Ridge regression	Small coefficients ( $\\| \beta \\|_2 \leq t$ )	Project onto $\ell^2$ ball
Constrained MLE	Valid probability distributions	Bregman projection (KL) onto simplex
Survey calibration	Weights match moment constraints	Bregman projection minimizing divergence from design weights

Many classical statistical methods can be viewed as projections onto constraint sets. DbP makes this perspective explicit and extensible: define your constraints (monotonicity, sparsity, smoothness, bounds), construct the convex set, and project.

Beyond Euclidean projection: Many calibration problems are more natural in a Bregman divergence (e.g., KL for probabilities). DbP extends beyond $L^2$ : raking/calibration estimators in survey sampling (e.g., Deville–Särndal) are Bregman projections^{[17,18,19,20]} that match moments while staying close to the starting weights—conceptually adjacent to SIMCal-W. DbP is just constrained empirical risk minimization viewed through the lens of projections onto convex sets; the lens is useful because it yields general variance-reduction and stability intuitions.

Implementation in CJE

Design-by-Projection is implemented in the CJE package via:

AutoCal-R: cje.calibration.AutoCal for reward calibration
SIMCal-W: cje.calibration.SIMCal for weight stabilization

# AutoCal-R: Calibrate judge scores to oracle outcomes

from cje.calibration import AutoCal

calibrator = AutoCal(mode='monotone')

f_calibrated = calibrator.fit(judge_scores, oracle_labels)

calibrated_predictions = f_calibrated(new_judge_scores)

# SIMCal-W: Stabilize importance weights

from cje.calibration import SIMCal

weights_raw = target_probs / logger_probs

weights_calibrated = SIMCal().fit(judge_scores, weights_raw)

print(f"ESS before: { ess(weights_raw):.1%}, after: { ess(weights_calibrated):.1%}")

Choosing calibration methods

AutoCal-R (monotone): Default for most cases. Minimal assumptions, works with small samples (5-25% oracle coverage).
AutoCal-R (two-stage): When you have covariates that create non-monotone bias (response length, prompt difficulty).
SIMCal-W: For off-policy estimators (IPS, DR) when raw importance weights have low ESS (< 10-20%).

Inference

When DbP is learned from a partial oracle slice, we include OUA (Oracle Uncertainty Accounting)—delete-one-fold jackknife over oracle folds^[22,23,24]—to account for calibrator learning variance in standard errors. This ensures that confidence intervals reflect both sampling uncertainty and the uncertainty from estimating the calibration function $f$ .

When Design-by-Projection Works Best

Ideal scenarios

You have strong structural knowledge (monotonicity, bounds, sparsity) that's unlikely to be violated
Sample size is moderate (100-10,000 observations) where unconstrained methods overfit but parametric methods misspecify
You need interpretable output (e.g., to audit or explain calibration to stakeholders)
Oracle labels are expensive, so you want maximum efficiency from 5-25% coverage

When to consider alternatives

Abundant data (n > 10,000) + known parametric form: Use parametric calibration (Platt scaling, Beta calibration) for lower variance
Structural assumptions violated: If monotonicity fails in reality, isotonic regression will impose it anyway. Test on holdout data.
Very small samples (n < 50): Consider Bayesian methods with informative priors instead of projection

Caution: Monotonicity violations

If monotonicity fails materially (e.g., adversarial judge artifacts), DbP will enforce it anyway. Use holdout residuals by policy/domain/length to detect such failures, and either expand $g(S, X)$ to include the violating covariates or switch to a richer judge.

Related Work and Extensions

Design-by-Projection builds on several classical results:

Isotonic regression: Barlow et al. (1972), Ayer et al. (1955) - foundational work on monotone regression
Shape-constrained estimation: Groeneboom & Jongbloed (2014) - comprehensive treatment of convex, concave, and monotone constraints
Survey calibration: Deville & Särndal (1992) - calibration estimators that adjust weights to match constraints while minimizing divergence
Calibration for inverse propensity weighting: van der Laan et al. (2025) - isotonic calibration for stabilizing IPW estimators (CLeaR 2025)

Within Causal Learning and Reasoning (CLeaR) / causal-learning circles, DbP fits the broader program of shape-constrained, structure-aware learning that trades small bias for large variance reductions with explicit guarantees.

Extensions under development: Multi-dimensional monotonicity (partial orders), shape constraints beyond monotonicity (convexity, unimodality), adaptive constraint selection via cross-validation.

Conclusion

Design-by-Projection provides a principled framework for incorporating structural knowledge into estimation. By projecting onto convex constraint sets, you get:

Automatic variance reduction without sacrificing unbiasedness (when constraints are correct)
Interpretable output that respects domain knowledge
Unified treatment of reward calibration (AutoCal-R) and weight stabilization (SIMCal-W)
Computational efficiency via fast projection algorithms (O(n) after sorting for isotonic regression)

For LLM evaluation, where oracle labels are expensive and judge scores are plentiful, DbP's efficiency with small oracle samples (5-25% coverage) makes it particularly valuable. The framework scales from quick prototypes (monotone mode) to production systems (two-stage with covariates).

Practical takeaway: Before fitting unconstrained or rigidly parametric models, ask: "What do I know must be true?" Encode that knowledge as constraints, project onto them, and let the projection theorem do the work.

References

[1] Bauschke, H. H., & Combettes, P. L. (2011/2017). Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer.

[2] Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.

[3] Barlow, R. E., Bartholomew, D. J., Bremner, J. M., & Brunk, H. D. (1972). Statistical Inference under Order Restrictions. Wiley.

[4] Robertson, T., Wright, F. T., & Dykstra, R. L. (1988). Order Restricted Statistical Inference. Wiley.

[5] Ayer, M., Brunk, H. D., Ewing, G. M., Reid, W. T., & Silverman, E. (1955). An empirical distribution function for sampling with incomplete information. Ann. Math. Stat.

[6] Best, M. J., & Chakravarti, N. (1990). Active set algorithms for isotonic regression: A unifying framework. Math. Programming.

[7] Zhang, C.-H. (2002). Risk bounds in isotonic regression. Ann. Stat.

[8] Groeneboom, P., & Jongbloed, G. (2014). Nonparametric Estimation under Shape Constraints. Cambridge.

[9] Hesterberg, T. (1995). Weighted average importance sampling and defensive mixtures. Technometrics.

[10] Cole, S. R., & Hernán, M. A. (2008). Constructing inverse probability weights for marginal structural models. AJE.

[11] Owen, A. B. (2013). Monte Carlo Theory, Methods and Examples.

[12] Kish, L. (1965). Survey Sampling. Wiley.

[13] van der Laan, L., Lin, Z., Carone, M., & Luedtke, A. (2024/2025). Stabilized Inverse Probability Weighting via Isotonic Calibration. CLeaR.

[14] van der Laan, M. J., Polley, E., & Hubbard, A. (2007). Super Learner. Stat. Appl. Genet. Mol. Biol.

[15] Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/Debiased ML. Econometrics J.

[16] Dudík, M., Langford, J., & Li, L. (2011). Doubly robust policy evaluation and learning. ICML.

[17] Csiszár, I. (1975). I‑divergence geometry of probability distributions and minimization problems. Ann. Prob.

[18] Banerjee, A., Merugu, S., Dhillon, I. S., & Ghosh, J. (2005). Clustering with Bregman divergences. JMLR.

[19] Deville, J.-C., & Särndal, C.-E. (1992). Calibration estimators in survey sampling. JASA.

[20] Deming, W. E., & Stephan, F. F. (1940). On the adjustment of a contingency table to given marginal totals. Ann. Math. Stat.

[21] Bachmann, G., & Nagarajan, V. (2024). The pitfalls of next‑token prediction. ICML / arXiv.

[22] Quenouille, M. H. (1949). Approximate tests of correlation in time‑series. J. Roy. Stat. Soc. (Series B).

[23] Tukey, J. W. (1958). Bias and confidence in not‑quite large samples. Ann. Math. Stat.

[24] Efron, B. (1982). The Jackknife, the Bootstrap and Other Resampling Plans. SIAM.

Cite this work

APA

Eddie Landesberg. (2025, October 10). Design-by-Projection: A General Principle for Structure-Aware Estimation. CIMO Labs Blog. https://cimolabs.com/blog/design-by-projection

BibTeX

@misc{landesberg2025design-by-projection:,
  author = {Eddie Landesberg},
  title = {Design-by-Projection: A General Principle for Structure-Aware Estimation},
  howpublished = {\url{https://cimolabs.com/blog/design-by-projection}},
  year = {2025},
  note = {CIMO Labs Blog}
}

Arena Experiment: DbP in Action Coverage-Limited Efficiency CJE Documentation GitHub Repository