CIMO LabsCIMO Labs

Design-by-Projection: A General Principle for Structure-Aware Estimation

Eddie Landesberg12 min read

The principle: When you know structural properties of your problem (outcomes increase with judge scores, importance weights should average to 1), but don't know the exact functional form, project your empirical data onto the set of functions satisfying those constraints. This preserves unbiasedness while reducing variance.

The framework: Design-by-Projection (DbP) unifies AutoCal-R (reward calibration via isotonic regression) and SIMCal-W (weight stabilization via isotonic projection) under a single principle. Both methods project onto convex constraint sets, but AutoCal-R operates on judge scores → oracle outcomes, while SIMCal-W operates on importance weights.

Rather than estimating unconstrained functions that overfit small samples or imposing rigid parametric forms (linear, logistic) that misspecify, Design-by-Projection finds the closest function (in least-squares sense) that satisfies what you know must be true. The result: automatic variance reduction, bias-variance trade-offs that favor small oracle samples, and interpretable output.

In Arena: DbP instantiates as AutoCal-R (reward calibration with a mean constraint and covariates) and SIMCal-W (mean-one monotone weight calibration). Together they explain why Direct and DR work well—and why IPS alone fails under overlap scarcity.

DbP Assumptions at a glance

AutoCal-R: Monotonicity of YY in a judge-based risk index T=g(S,X)T = g(S, X); mean-preservation enforced on the oracle slice.

SIMCal-W: Stabilized weights are a monotone function of SS (or TT), nonnegative, unit-mean; projection cannot create overlap—diagnose ESS and tails.

The Problem: Balancing Flexibility and Structure

Suppose you're calibrating an LLM judge to predict oracle outcomes. You have:

  • nn observations: judge scores SiS_i and oracle labels YiY_i
  • Goal: Learn f:SYf: S \to Y to predict oracle outcomes from judge scores
  • Known constraint: Higher judge scores should predict no worse oracle outcomes (monotonicity)

The Goldilocks problem

Too flexible (unconstrained regression): Learns arbitrary non-monotone wiggles, overfits noise in small samples, produces calibrated predictions that invert judge rankings.

Too rigid (linear regression): Forces f(S)=α+βSf(S) = \alpha + \beta S, misspecifies when the true relationship is nonlinear (e.g., saturation at high scores, floor effects at low scores).

Just right (isotonic regression): Flexible enough to capture nonlinearity, constrained enough to avoid overfitting. Learns piecewise-constant monotone function that fits data while respecting known structure.

The Design-by-Projection Principle

Core idea: Encode what you know (or assume) as a convex constraint set C\mathcal{C}, then project your empirical estimate onto C\mathcal{C}.

Projection formula

f^DbP=argmingCf^uncg2\hat{f}_{\text{DbP}} = \text{argmin}_{g \in \mathcal{C}} \| \hat{f}_{\text{unc}} - g \|^2

Find the function gg in the constraint set C\mathcal{C} that is closest (in L2L^2 norm) to the unconstrained empirical estimate f^unc\hat{f}_{\text{unc}}.

Why does this work?

When C\mathcal{C} is convex and contains the true function, projection has three key properties:

  1. Bias–variance trade-off: Projection onto a correct constraint set is a contraction that typically reduces variance and can reduce MSE; it does not generally preserve finite-sample unbiasedness. For reward calibration we recover the right mean by explicit mean-preservation; for weight calibration we enforce unit-mean weights.
  2. Variance reduction: Projection is a smoothing operation. By ruling out functions that violate known constraints, you reduce the effective degrees of freedom, lowering variance. For cones that contain the origin (e.g., the monotone cone), projection weakly reduces the norm. For general convex sets, projection minimizes distance to C\mathcal{C}, not necessarily the norm.
  3. Interpretability: The output respects structural knowledge (monotonicity, mean preservation, boundedness), making results easier to validate and debug. You can't get perverse predictions that violate domain knowledge.

Projection in Hilbert spaces (intuition)

For a closed convex set C\mathcal{C}, the metric projection PC(f)P_{\mathcal{C}}(f) is unique and is characterized by the variational inequality fPC(f),gPC(f)0\langle f - P_{\mathcal{C}}(f), g - P_{\mathcal{C}}(f) \rangle \leq 0 for all gCg \in \mathcal{C}. For cones containing 0, projection weakly reduces norm; in general it minimizes distance to C\mathcal{C}. This orthogonality condition explains why imposing structure reduces variance without overfitting.[1,2]

Application 1: AutoCal-R (Reward Calibration)

Problem: LLM judge scores SS are on an arbitrary scale. You need to map them to oracle outcomes YY for downstream estimation.

Constraint set: Cmono={f:SRf is non-decreasing}\mathcal{C}_{\text{mono}} = \{f: S \to \mathbb{R} \mid f \text{ is non-decreasing}\}. Monotonicity is the minimal assumption: better judge scores shouldn't predict worse outcomes.

Monotone mode

Directly project judge scores to oracle outcomes via isotonic regression:

f^(S)=argmingCmonoi=1n(Yig(Si))2\hat{f}(S) = \text{argmin}_{g \in \mathcal{C}_{\text{mono}}} \sum_{i=1}^n (Y_i - g(S_i))^2

This is isotonic regression: least-squares fit subject to monotonicity.[3,4] The solution is a piecewise-constant function computed efficiently via the Pool Adjacent Violators (PAV) algorithm in O(n) time on sorted scores (or O(n log n) including the initial sort).[5,6]

Two-stage mode (with covariates)

When judge scores have systematic bias (e.g., response length affects scores independent of quality):

  1. Stage 1 (risk index): Learn T=g(S,X)T = g(S, X) (e.g., spline on (S,response_length)(S, \text{response\_length}))
  2. Stage 2 (isotonic): Fit f^(T)\hat{f}(T) by isotonic regression of YY on TT, then apply a constant shift so that the oracle-slice mean of f^\hat{f} matches that of YY

This corrects systematic judge bias (e.g., verbosity preference) while retaining monotonicity in the risk index TT.

Why isotonic regression?

  • Mean preservation (how we enforce it): Vanilla isotonic is an L2L^2 projection onto the monotone cone and does not by itself match the oracle mean.[3,4] In AutoCal-R we enforce mean preservation via a constant shift: f^mp(s)=f^iso(s)+(Yf^iso(S))\hat{f}_{\text{mp}}(s) = \hat{f}_{\text{iso}}(s) + \big(\overline{Y} - \overline{\hat{f}_{\text{iso}}(S)}\big), which preserves monotonicity and puts the calibrator on the oracle scale. (Clip to [0,1] if needed.)
  • Minimal assumptions: Only requires monotonicity, not linearity or parametric form
  • Small-sample efficiency: Works with 5-25% oracle coverage (50-1250 labels)
  • Adaptive complexity: For isotonic regression, the degrees of freedom equals the number of constant blocks in the fit and adapts to signal complexity;[4] in practice it is far smaller than nn, yielding substantial variance reduction.

Application 2: SIMCal-W (Weight Stabilization)

Problem: Off-policy importance weights wi=π(AiXi)/π0(AiXi)w_i = \pi'(A_i|X_i) / \pi_0(A_i|X_i) are often extreme, leading to high variance and poor effective sample size (ESS).[11,12]

Constraint set: Ccal={h:SR+h is monotone in T=g(S,X),Eπ0[h(T)]=1}\mathcal{C}_{\text{cal}} = \{h: S \to \mathbb{R}_+ \mid h \text{ is monotone in } T = g(S,X), \, \mathbb{E}_{\pi_0}[h(T)] = 1\}, i.e., unit mean under the logger.[9,10] Calibrated weights should be nonnegative, monotone in a risk index, and preserve unbiasedness.

The stacked isotonic projection

SIMCal-W builds two candidate weight functions:

  1. Increasing candidate: Isotonic regression of ww on SS (higher scores → higher weights)
  2. Decreasing candidate: Antitonic regression of ww on SS (higher scores → lower weights; isotonic under reversed order)

After smoothing, we enforce nonnegativity and unit mean: rescale w~=w^iso/w^iso\tilde{w} = \hat{w}_{\text{iso}} / \overline{\hat{w}_{\text{iso}}}.[9,13] Stacking uses cross-fitted out-of-fold influence functions to tune λ\lambda by minimizing estimated variance:[14,15]

w^SIMCal=λw^inc+(1λ)w^dec,λ=argminλ[0,1]VarIF(λ)\hat{w}_{\text{SIMCal}} = \lambda \cdot \hat{w}_{\text{inc}} + (1-\lambda) \cdot \hat{w}_{\text{dec}}, \quad \lambda^* = \text{argmin}_{\lambda \in [0,1]} \text{Var}_{\text{IF}}(\lambda)

Why stacking?

By considering both directions (increasing and decreasing), SIMCal-W avoids having to assert which direction the monotone relationship should go. The data tells you: if increasing weights better stabilize the estimate, λ → 1; if decreasing weights are better, λ → 0. This makes the method robust to misspecification of the monotone direction.

No new overlap: Stabilization prevents numerical degeneracy but cannot create support where the logger has none.[16] Always report ESS, max/median weight, and a tail index before/after smoothing.[11,12] In LLM OPE, raw ww often come from teacher-forced sequence likelihoods; these can be noisy or structurally misspecified.[21] Stabilization helps variance, but cannot fix overlap or propensity misspecification.

Theoretical Guarantees

1. Projection theorem (convex analysis)

For any closed convex set C\mathcal{C} in a Hilbert space, the projection PC(f)P_{\mathcal{C}}(f) exists, is unique, and satisfies:

fPC(f),gPC(f)0gC\langle f - P_{\mathcal{C}}(f), g - P_{\mathcal{C}}(f) \rangle \leq 0 \quad \forall g \in \mathcal{C}

The residual fPC(f)f - P_{\mathcal{C}}(f) is orthogonal to the constraint set. This is the Pythagorean theorem in Hilbert space: projecting reduces norm without introducing bias.

2. Monotone projection bounds variance

For isotonic regression on nn observations, the degrees of freedom df\text{df} satisfies 1dfn1 \leq \text{df} \leq n, where df\text{df} is the number of constant blocks in the fitted function. For smooth monotone signals, the fitted isotonic has Op(n1/3)O_p(n^{1/3}) constant pieces and risk O(n2/3)O(n^{-2/3}),[7,8] far fewer degrees of freedom than unconstrained fits, delivering substantial variance reduction.

3. Dispersion reduction (SIMCal-W)

The mean-one isotonic projection reduces L2L^2 dispersion (ESS↑) and typically improves tail metrics (max/median, tail index),[9,11,13] though strict Lorenz dominance is not guaranteed without additional conditions. For repeated refits (OUA jackknife), isotonic's O(n)O(n) complexity after sorting keeps total runtime modest.

Connection to Other Methods

MethodConstraint SetDbP Perspective
Isotonic regressionMonotone functionsProject onto monotone cone
Platt scalingLogistic link functionsParametric constrained fit; not a convex projection
LassoSparse coefficients (β1t\| \beta \|_1 \leq t)Project onto 1\ell^1 ball
Ridge regressionSmall coefficients (β2t\| \beta \|_2 \leq t)Project onto 2\ell^2 ball
Constrained MLEValid probability distributionsBregman projection (KL) onto simplex
Survey calibrationWeights match moment constraintsBregman projection minimizing divergence from design weights

Many classical statistical methods can be viewed as projections onto constraint sets. DbP makes this perspective explicit and extensible: define your constraints (monotonicity, sparsity, smoothness, bounds), construct the convex set, and project.

Beyond Euclidean projection: Many calibration problems are more natural in a Bregman divergence (e.g., KL for probabilities). DbP extends beyond L2L^2: raking/calibration estimators in survey sampling (e.g., Deville–Särndal) are Bregman projections[17,18,19,20] that match moments while staying close to the starting weights—conceptually adjacent to SIMCal-W. DbP is just constrained empirical risk minimization viewed through the lens of projections onto convex sets; the lens is useful because it yields general variance-reduction and stability intuitions.

Implementation in CJE

Design-by-Projection is implemented in the CJE package via:

  • AutoCal-R: cje.calibration.AutoCal for reward calibration
  • SIMCal-W: cje.calibration.SIMCal for weight stabilization
# AutoCal-R: Calibrate judge scores to oracle outcomes
from cje.calibration import AutoCal
calibrator = AutoCal(mode='monotone')
f_calibrated = calibrator.fit(judge_scores, oracle_labels)
calibrated_predictions = f_calibrated(new_judge_scores)
# SIMCal-W: Stabilize importance weights
from cje.calibration import SIMCal
weights_raw = target_probs / logger_probs
weights_calibrated = SIMCal().fit(judge_scores, weights_raw)
print(f"ESS before: { ess(weights_raw):.1%}, after: { ess(weights_calibrated):.1%}")

Choosing calibration methods

  • AutoCal-R (monotone): Default for most cases. Minimal assumptions, works with small samples (5-25% oracle coverage).
  • AutoCal-R (two-stage): When you have covariates that create non-monotone bias (response length, prompt difficulty).
  • SIMCal-W: For off-policy estimators (IPS, DR) when raw importance weights have low ESS (< 10-20%).

Inference

When DbP is learned from a partial oracle slice, we include OUA (Oracle Uncertainty Accounting)—delete-one-fold jackknife over oracle folds[22,23,24]—to account for calibrator learning variance in standard errors. This ensures that confidence intervals reflect both sampling uncertainty and the uncertainty from estimating the calibration function ff.

When Design-by-Projection Works Best

Ideal scenarios

  • You have strong structural knowledge (monotonicity, bounds, sparsity) that's unlikely to be violated
  • Sample size is moderate (100-10,000 observations) where unconstrained methods overfit but parametric methods misspecify
  • You need interpretable output (e.g., to audit or explain calibration to stakeholders)
  • Oracle labels are expensive, so you want maximum efficiency from 5-25% coverage

When to consider alternatives

  • Abundant data (n > 10,000) + known parametric form: Use parametric calibration (Platt scaling, Beta calibration) for lower variance
  • Structural assumptions violated: If monotonicity fails in reality, isotonic regression will impose it anyway. Test on holdout data.
  • Very small samples (n < 50): Consider Bayesian methods with informative priors instead of projection

Caution: Monotonicity violations

If monotonicity fails materially (e.g., adversarial judge artifacts), DbP will enforce it anyway. Use holdout residuals by policy/domain/length to detect such failures, and either expand g(S,X)g(S, X) to include the violating covariates or switch to a richer judge.

Related Work and Extensions

Design-by-Projection builds on several classical results:

  • Isotonic regression: Barlow et al. (1972), Ayer et al. (1955) - foundational work on monotone regression
  • Shape-constrained estimation: Groeneboom & Jongbloed (2014) - comprehensive treatment of convex, concave, and monotone constraints
  • Survey calibration: Deville & Särndal (1992) - calibration estimators that adjust weights to match constraints while minimizing divergence
  • Calibration for inverse propensity weighting: van der Laan et al. (2025) - isotonic calibration for stabilizing IPW estimators (CLeaR 2025)

Within Causal Learning and Reasoning (CLeaR) / causal-learning circles, DbP fits the broader program of shape-constrained, structure-aware learning that trades small bias for large variance reductions with explicit guarantees.

Extensions under development: Multi-dimensional monotonicity (partial orders), shape constraints beyond monotonicity (convexity, unimodality), adaptive constraint selection via cross-validation.

Conclusion

Design-by-Projection provides a principled framework for incorporating structural knowledge into estimation. By projecting onto convex constraint sets, you get:

  • Automatic variance reduction without sacrificing unbiasedness (when constraints are correct)
  • Interpretable output that respects domain knowledge
  • Unified treatment of reward calibration (AutoCal-R) and weight stabilization (SIMCal-W)
  • Computational efficiency via fast projection algorithms (O(n) after sorting for isotonic regression)

For LLM evaluation, where oracle labels are expensive and judge scores are plentiful, DbP's efficiency with small oracle samples (5-25% coverage) makes it particularly valuable. The framework scales from quick prototypes (monotone mode) to production systems (two-stage with covariates).

Practical takeaway: Before fitting unconstrained or rigidly parametric models, ask: "What do I know must be true?" Encode that knowledge as constraints, project onto them, and let the projection theorem do the work.

References

[1] Bauschke, H. H., & Combettes, P. L. (2011/2017). Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer.
[2] Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.
[3] Barlow, R. E., Bartholomew, D. J., Bremner, J. M., & Brunk, H. D. (1972). Statistical Inference under Order Restrictions. Wiley.
[4] Robertson, T., Wright, F. T., & Dykstra, R. L. (1988). Order Restricted Statistical Inference. Wiley.
[5] Ayer, M., Brunk, H. D., Ewing, G. M., Reid, W. T., & Silverman, E. (1955). An empirical distribution function for sampling with incomplete information. Ann. Math. Stat.
[6] Best, M. J., & Chakravarti, N. (1990). Active set algorithms for isotonic regression: A unifying framework. Math. Programming.
[7] Zhang, C.-H. (2002). Risk bounds in isotonic regression. Ann. Stat.
[8] Groeneboom, P., & Jongbloed, G. (2014). Nonparametric Estimation under Shape Constraints. Cambridge.
[9] Hesterberg, T. (1995). Weighted average importance sampling and defensive mixtures. Technometrics.
[10] Cole, S. R., & Hernán, M. A. (2008). Constructing inverse probability weights for marginal structural models. AJE.
[11] Owen, A. B. (2013). Monte Carlo Theory, Methods and Examples.
[12] Kish, L. (1965). Survey Sampling. Wiley.
[13] van der Laan, L., Lin, Z., Carone, M., & Luedtke, A. (2024/2025). Stabilized Inverse Probability Weighting via Isotonic Calibration. CLeaR.
[14] van der Laan, M. J., Polley, E., & Hubbard, A. (2007). Super Learner. Stat. Appl. Genet. Mol. Biol.
[15] Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/Debiased ML. Econometrics J.
[16] Dudík, M., Langford, J., & Li, L. (2011). Doubly robust policy evaluation and learning. ICML.
[17] Csiszár, I. (1975). I‑divergence geometry of probability distributions and minimization problems. Ann. Prob.
[18] Banerjee, A., Merugu, S., Dhillon, I. S., & Ghosh, J. (2005). Clustering with Bregman divergences. JMLR.
[19] Deville, J.-C., & Särndal, C.-E. (1992). Calibration estimators in survey sampling. JASA.
[20] Deming, W. E., & Stephan, F. F. (1940). On the adjustment of a contingency table to given marginal totals. Ann. Math. Stat.
[21] Bachmann, G., & Nagarajan, V. (2024). The pitfalls of next‑token prediction. ICML / arXiv.
[22] Quenouille, M. H. (1949). Approximate tests of correlation in time‑series. J. Roy. Stat. Soc. (Series B).
[23] Tukey, J. W. (1958). Bias and confidence in not‑quite large samples. Ann. Math. Stat.
[24] Efron, B. (1982). The Jackknife, the Bootstrap and Other Resampling Plans. SIAM.

Cite this work

APA

Eddie Landesberg. (2025, October 10). Design-by-Projection: A General Principle for Structure-Aware Estimation. CIMO Labs Blog. https://cimolabs.com/blog/design-by-projection

BibTeX

@misc{landesberg2025design-by-projection:,
  author = {Eddie Landesberg},
  title = {Design-by-Projection: A General Principle for Structure-Aware Estimation},
  howpublished = {\url{https://cimolabs.com/blog/design-by-projection}},
  year = {2025},
  note = {CIMO Labs Blog}
}