CIMO LabsCIMO Labs

CLOVER: Selective‑Inference‑Valid Closed‑Loop Optimization of Programmable Judges

Technical Appendix v1.1

Eddie Landesberg, CIMO Labs50 min read

Abstract

We formalize CLOVER, a governed procedure for improving LLM judges treated as programmable surrogates. CLOVER (i) calibrates raw judge scores to an operational welfare target YY, (ii) audits residuals to detect systematic misscoring, (iii) proposes small, structured rubric patches, and (iv) accepts a patch only if it improves calibration on a time‑separated confirm holdout while passing transport and anti‑gaming constraints. We give identification results for using calibrated rewards in Direct/IPS/DR policy evaluation, derive an oracle‑uncertainty‑aware (OUA) variance decomposition, formalize the patch family selection problem with selective‑inference control via nested sample splitting, and specify an active adversarial searcher as a worst‑case uplift bound. The framework is designed for "score once, calibrate many" with explicit versioning and an assumptions ledger.

Scope: CLOVER vs. SDP-Gov

CLOVER (this appendix) governs the calibration of judges (the mapping SYS \to Y): improving judge rubrics to better predict operational welfare labels while maintaining calibration quality, transportability, and resistance to gaming.

SDP-Gov (Layer 0) governs the Standard Deliberation Protocol (SDP) itself (the mapping YYY \to Y^*): ensuring that operational welfare labels YY align with true idealized welfare YY^* via empirical validation (PTE against long-run outcomes), construct validity audits, and stability checks. See Validating the Bridge Assumption (A0) for the complete SDP-Gov framework.

0. Notation & Objects

  • Contexts & actions. XXX \in \mathcal{X}, AAA \in \mathcal{A}. A policy π\pi maps xπ(x)x \mapsto \pi(\cdot \mid x).
  • Target welfare. Y[0,1]Y \in [0,1] is the operational welfare label collected under a fixed Standard Deliberation Protocol (SDP). Optionally, YY^* denotes an idealized target; replace YY with YY^* where relevant.
  • Judge. A rubric/prompt θΘ\theta \in \Theta parameterizes a judge Jθ:X×ARdJ_\theta: \mathcal{X} \times \mathcal{A} \to \mathbb{R}^d, with output Sθ=Jθ(X,A)S_\theta = J_\theta(X,A).
  • Calibrator. A function fθ:Rd×X[0,1]f_\theta: \mathbb{R}^d \times \mathcal{X} \to [0,1] mapping (Sθ,X)Rθ(S_\theta, X) \mapsto R_\theta, where Rθ=fθ(Sθ,X)R_\theta = f_\theta(S_\theta, X) is the calibrated reward.
  • Logs & oracle. Logged data D={(Xi,Ai)}i=1n\mathcal{D} = \{(X_i, A_i)\}_{i=1}^n from π0\pi_0. A subset Ioracle{1,,n}I_{\text{oracle}} \subset \{1,\ldots,n\} carries welfare labels {Yi}\{Y_i\}; coverage ρ=Ioracle/n\rho = |I_{\text{oracle}}|/n.
  • Residuals. εi=YiRθ,i\varepsilon_i = Y_i - R_{\theta,i} for iIoraclei \in I_{\text{oracle}}.
  • Estimand. Policy value V(π)=EXEAπ(X)[YX,A]V(\pi) = \mathbb{E}_X \mathbb{E}_{A \sim \pi(\cdot \mid X)}[Y \mid X,A].

Throughout, expectations are w.r.t. the relevant data‑generating distributions; measurability and boundedness of YY are assumed.

1. Measurement Model & Information Ordering

Assumption J1 (Programmable channel)

A rubric θ\theta specifies an evidence set EθE_\theta and induces a σ-field F(θ)\mathcal{F}(\theta) such that SθS_\theta is F(θ)\mathcal{F}(\theta)-measurable.

Lemma 1 (Informativeness monotonicity; Blackwell/Doob)

If F(θ)F(θ)\mathcal{F}(\theta) \subseteq \mathcal{F}(\theta') then

E ⁣[(YE[YF(θ)])2]E ⁣[(YE[YF(θ)])2]\mathbb{E}\!\left[\big(Y - \mathbb{E}[Y \mid \mathcal{F}(\theta')]\big)^2\right] \le \mathbb{E}\!\left[\big(Y - \mathbb{E}[Y \mid \mathcal{F}(\theta)]\big)^2\right]

Proof sketch. Conditional expectation is the L2L^2 projection; enlarging the σ-field cannot increase squared error. □

Implication. Enriching obligations (evidence to be checked) weakly reduces Bayes risk for predicting YY; weight tweaks alone cannot guarantee this.

1.1 Design Philosophy: Obligation‑First, Bounded Risk

CLOVER v1.1 adopts an obligation‑first design philosophy to ensure improvements are interpretable, generalizable, and safe:

  • Prefer obligations over weights. When residuals reveal failures, first ask "what evidence should we check?" not "how should we reweight?" Obligation edits (e.g., verify citations, check for contradictions) tend to generalize and are less likely to degrade well‑calibrated regions.
  • Partial pooling by default. Use hierarchical/shrinkage estimation for slice residuals to borrow strength across related groups and avoid chasing noise in small cells.
  • Balanced objectives. Optimize a constrained problem: improve risk slices subject to non‑inferiority on green slices (high‑exposure or high‑stakes regions with good calibration).
  • Bound the blast radius. Cap distributional shift (KS, Wasserstein) and anchor drift so patches remain small, local perturbations rather than wholesale rescalings.
  • Confirm on time‑separated data. Require dev gains to replicate on a confirm holdout collected after the dev set to control selective‑inference risk.

Core principle: Fix what's broken without breaking what works. Residuals are a discovery tool, not the optimization objective.

2. Surrogacy, Transport, and Identification with Calibrated Rewards

Assumption S1 (Surrogacy sufficiency on support)

On supp(π0Πeval)\mathrm{supp}(\pi_0 \cup \Pi_{\text{eval}}),

E[YX,A,Sθ]=fθ(Sθ,X)a.s.\mathbb{E}\big[Y \mid X,A,S_\theta\big] = f_\theta(S_\theta, X) \quad \text{a.s.}

Assumption S2 (Transport / S‑admissibility)

Across admissible environments gGg \in \mathcal{G} (policy, time, cohort), with selection nodes Sel\mathrm{Sel},

Y ⁣ ⁣ ⁣SelX,A,SθY \perp\!\!\!\perp \mathrm{Sel} \mid X, A, S_\theta

When S2 holds, the same fθf_\theta transports across gg.

Proposition 2.1 (Direct identification)

Under S1 (and MAR/positivity for oracle learning of fθf_\theta), for any π\pi,

V(π)=E ⁣[Rθ(X,Sθ)|A=π(X)]V(\pi) = \mathbb{E}\!\left[\,R_\theta(X,S_\theta)\,\middle|\, A = \pi(X)\right]

Operational: Draw fresh contexts XiX_i, sample actions Aiπ(Xi)A_i \sim \pi(\cdot \mid X_i), score Sθ,i=Jθ(Xi,Ai)S_{\theta,i} = J_\theta(X_i, A_i), apply the calibrator Rθ,i=fθ(Sθ,i,Xi)R_{\theta,i} = f_\theta(S_{\theta,i}, X_i), and take the Monte Carlo mean V^(π)=n1iRθ,i\widehat{V}(\pi) = n^{-1} \sum_i R_{\theta,i}.

Proposition 2.2 (IPS)

With overlap π0(ax)>0\pi_0(a \mid x) > 0 whenever π(ax)>0\pi(a \mid x) > 0,

V(π)=E ⁣[wπ(X,A)Rθ],wπ(X,A)=π(AX)π0(AX)V(\pi) = \mathbb{E}\!\left[w_\pi(X,A)\,R_\theta\right], \quad w_\pi(X,A) = \frac{\pi(A \mid X)}{\pi_0(A \mid X)}

Proposition 2.3 (DR)

Let Q(X,A)E[RθX,A]Q(X,A) \approx \mathbb{E}[R_\theta \mid X,A]. Then

V(π)=E ⁣[wπ(X,A)(RθQ(X,A))+Qπ(X)]V(\pi) = \mathbb{E}\!\left[w_\pi(X,A)\big(R_\theta - Q(X,A)\big) + Q_\pi(X)\right]

where Qπ(X)=Eaπ(X)[Q(X,a)]Q_\pi(X) = \mathbb{E}_{a \sim \pi(\cdot \mid X)}[Q(X,a)]. Consistency obtains if either wπw_\pi or QQ is correct.

Proofs. Standard; replace YY by RθR_\theta using S1. □

3. Calibration Risk, Residuals, and Slicing

Calibration risk. For loss (r,y)=(ry)2\ell(r,y) = (r - y)^2, define R(fθ)=E[(Rθ,Y)iIoracle]\mathcal{R}(f_\theta) = \mathbb{E}[\ell(R_\theta, Y) \mid i \in I_{\text{oracle}}]. Cross‑fitting yields R^\widehat{\mathcal{R}} and out‑of‑fold residuals ε\varepsilon.

Default calibrator hyperparameters

  • Calibrator architecture: Two‑stage if covariates available:
    • Stage 1: Spline regression over [Sθ,length,has_citation,][S_\theta, \text{length}, \text{has\_citation}, \ldots] → intermediate score T=gk(Sθ,X)T = g_k(S_\theta, X)
    • Stage 2: Isotonic regression on TTRθR_\theta
    If no covariates, use isotonic on SθS_\theta directly.
  • Cross‑fitting: K=5K = 5 folds. Larger KK reduces bias at cost of higher variance.
  • Residual slicing: 10–20 groups (domain × difficulty × length bins). Use Benjamini–Hochberg to control the false discovery rate (q=0.10q = 0.10). For FWER control, use Bonferroni or Holm.
  • Stopping: Two consecutive iterations with no significant residual structure (all group means have CIs overlapping 0) and transport diagnostics pass.

Slices. Pre‑register a finite slice family S={G1,,Gm}\mathcal{S} = \{G_1, \ldots, G_m\} where each GjG_j partitions oracle indices into groups gGjg \in \mathcal{G}_j (e.g., domain × difficulty × length). For each group, test

H0(g): E[εig]=0H_0(g):\ \mathbb{E}[\,\varepsilon \mid i \in g\,] = 0

Use BH at level qq within each GjG_j. Fit a hierarchical model (partial pooling) εαGj+ug+covariates+η\varepsilon \sim \alpha_{G_j} + u_g + \text{covariates} + \eta to obtain shrunken residual estimates ε~g\tilde{\varepsilon}_g across overlapping slices. This is the default in v1.1—it avoids chasing noise in small cells by borrowing strength across related groups. If unavailable, use James–Stein shrinkage or require minimum ng40n_g \ge 40 per slice.

3.1 Residual Cards

For each significant slice gg (after FDR correction), produce a Residual Card containing:

  • Slice: domain = "medical Q&A", length bin = 600–900, difficulty = "hard"
  • Stats: ng=143n_g = 143, εˉg=0.082\bar{\varepsilon}_g = -0.082 (CI [0.11,0.05][-0.11, -0.05]) — negative = judge over‑scores vs. oracle
  • Attribution hints (computed features): verbosity +180% median; citation_valid=false in 61% cases; high "authoritative tone" markers
  • Nearest‑neighbor exemplars (k=3k=3) with largest negative residuals
  • Counter‑examples (k=3k=3) with near-zero residuals in same slice
  • Anti‑gaming stress deltas for this slice
  • Hypothesis checklist (auto‑scores): length bias ✓, fake citations ✓, missed risk framing ✗

Key principle: These cards are the only inputs to the Patch Synthesizer (§4), preventing it from memorizing specific labels and reducing overfitting risk.

3.2 Slice Classification: Risk, Green, Neutral

After residual analysis, classify each slice gg into one of three categories for the current iteration:

Risk Slices (gRg \in \mathcal{R})

CI for εˉg\bar{\varepsilon}_g excludes 0 and εˉg0.03|\bar{\varepsilon}_g| \ge 0.03. These represent material systematic miscalibration.

Green Slices (gGg \in \mathcal{G})

High exposure (wg0.05)(w_g \ge 0.05) or high stakes (ug0.7)(u_g \ge 0.7), and calibration is good (CI overlaps 0 or εˉg<0.03|\bar{\varepsilon}_g| < 0.03). Patches must not harm these slices.

Neutral Slices (gNg \in \mathcal{N})

All others — low exposure, low stakes, or calibration status unclear. Not explicitly optimized or protected.

Classification freeze: For each iteration, freeze the classification at the start based on the current rubric θ\theta. Evaluate candidate patches against this fixed classification to prevent gaming the definition.

Weights. Define exposure weight wg=ng/gngw_g = n_g / \sum_{g'} n_{g'} (fraction of samples in slice gg), and stakes weight ug[0,1]u_g \in [0,1] (externally specified per slice, e.g., medical Q&A with dosing = 0.9, casual chitchat = 0.1).

4. Patch Space, Complexity, and Acceptance Predicate

Rubric obligations. A rubric is a typed object θ=(obligations,guards,abstention)\theta = (\texttt{obligations}, \texttt{guards}, \texttt{abstention}) (Appendix B). A patch δ\delta is a finite edit to θ\theta producing θ=θδ\theta' = \theta \oplus \delta.

Patch families. Pre‑register disjoint families F1,,FK\mathcal{F}_1, \ldots, \mathcal{F}_K (e.g., evidence verification, length‑bias cap, abstention). Each iteration proposes at most one patch per family.

Complexity. Define κ(δ)R0\kappa(\delta) \in \mathbb{R}_{\ge 0} as a code‑length‑like measure (Δ tokens + # new guards + # abstention edits). Complexity penalties discourage overfitting.

4.1 Balanced Objective: Improve Total Error, Protect Green

CLOVER v1.1 uses global MSE reduction as the primary acceptance criterion, with green non‑inferiority as a hard constraint. Risk‑weighted improvement serves as a tie‑break among acceptable patches.

Primary criterion (hard gate): Global MSE improvement

ΔMSEglobal=1noracleiIoracle[(YiRθ,i)2(YiRθ,i)2]\Delta\mathrm{MSE}_{\text{global}} = \frac{1}{n_{\text{oracle}}} \sum_{i \in I_{\text{oracle}}} \Big[(Y_i - R'_{\theta,i})^2 - (Y_i - R_{\theta,i})^2\Big]

Require ΔMSEglobalη\Delta\mathrm{MSE}_{\text{global}} \le -\eta with η[0.0005,0.0025]\eta \in [0.0005, 0.0025] on both dev and confirm. This corresponds to roughly 0.02–0.05 RMSE improvement on the ([0,1]) scale. MSE differences are additive and stable under aggregation, unlike RMSE.

Tie‑break metric: Risk‑weighted MSE improvement

RiskImprove(δ)=gRwgugΔMSEg(δ)\mathrm{RiskImprove}(\delta) = \sum_{g \in \mathcal{R}} w_g u_g \cdot \Delta\mathrm{MSE}_g(\delta)

where ΔMSEg=MSEg(θ)MSEg(θ)\Delta\mathrm{MSE}_g = \mathrm{MSE}_g(\theta') - \mathrm{MSE}_g(\theta). Among patches that pass all gates, prefer the one with the most negative RiskImprove(δ)\mathrm{RiskImprove}(\delta). This focuses improvements on high‑stakes failures without making it a hard requirement.

Non‑inferiority on green slices (hard constraint)

For each gGg \in \mathcal{G}, run a one‑sided non‑inferiority test:

H0:ΔMSEgτgvsH1:ΔMSEg<τgH_0: \Delta\mathrm{MSE}_g \ge \tau_g \quad \text{vs} \quad H_1: \Delta\mathrm{MSE}_g < \tau_g

with tolerance τg[12×103]\tau_g \in [1\text{–}2 \times 10^{-3}] (MSE scale). Require rejection of H0H_0 (no material degradation) after BH correction on both dev and confirm. For must‑not‑regress slices (e.g., safety‑critical), set τg=0\tau_g = 0.

Why MSE? MSE differences are additive (you can sum across examples) and interpretable. RMSE differences can flip signs and complicate inference. Total error reduction ensures you improve the whole system, while green non‑inferiority prevents "winning red cells while losing the map."

4.2 Distribution Shift & Anchor Stability (Blast‑Radius Caps)

To ensure patches remain local perturbations rather than wholesale rescalings, impose hard constraints on score distribution and anchor drift:

  • Distribution shift cap: Kolmogorov–Smirnov distance DKS(Rθ,Rθ)0.05D_{\text{KS}}(R_{\theta'}, R_\theta) \le 0.05 (or Wasserstein distance 0.02\le 0.02). This bounds the maximum pointwise CDF difference, preventing large‑scale rescalings.
  • Anchor stability: For reference policies πlow\pi_{\text{low}} and πhigh\pi_{\text{high}},
    E[Rθπlow]E[Rθπlow]0.01\big|\mathbb{E}[R_{\theta'} \mid \pi_{\text{low}}] - \mathbb{E}[R_\theta \mid \pi_{\text{low}}]\big| \le 0.01
    E[Rθπhigh]E[Rθπhigh]0.01\big|\mathbb{E}[R_{\theta'} \mid \pi_{\text{high}}] - \mathbb{E}[R_\theta \mid \pi_{\text{high}}]\big| \le 0.01
    Ensures the calibrated scale remains comparable across rubric versions.

Why cap shift? Without these constraints, a patch could trivially "improve" residuals by rescaling all scores. Distribution and anchor caps keep patches interpretable and prevent score drift across versions.

Acceptance Predicate (v1.1)

On a given iteration with splits (fit, dev, confirm), a candidate patch δ\delta is acceptable iff:

Accept(δ) = [ΔMSE_global ≤ -η on dev & confirm] (§4.1: total-error improvement) ∧ GreenNonInf(δ) (§4.1: all green slices pass τ_g test) ∧ [ΔECE ≤ 0] (global calibration non-worse) ∧ DistShiftOK (§4.2: KS ≤ 0.05, Wasserstein ≤ 0.02) ∧ AnchorStable (§4.2: drift ≤ 0.01 on π_low, π_high) ∧ TransportOK (§7: groupwise residuals ≈ 0) ∧ AntiGameOK (§6: uplift ≤ 0.05 per attack) ∧ κ(δ) ≤ τ (complexity: ≤ 150 tokens or ≤ 2 fields) ∧ OUA_OK (if OUA ≥ 0.30, require |ΔMSE| ≥ threshold)

Tie‑break: Among acceptable patches, prefer the one with the most negative RiskImprove(δ) (risk‑weighted ΔMSE over risk slices), then smallest κ(δ), then smallest KS shift. Reject if all gates pass but improvement is tiny and OUA share is high (≥ 0.30) — collect more labels instead.

5. Selective‑Inference‑Valid Patch Selection (Nested Holds)

Data splitting. For iteration tt, randomly partition oracle indices into disjoint subsets:

  • fit: train calibrators fθ,fθf_\theta, f_{\theta'} (K‑fold cross‑fit internal to fit).
  • dev: synthesize candidates and select δ\delta (search uses only dev residuals/cards).
  • confirm (time‑separated): never used for synthesis/selection; only for acceptance.

Null of no improvement. For a fixed δ\delta, define on any set AIoracleA \subset I_{\text{oracle}},

ΔRA(δ)=1AiA{(YiRθ,i)2(YiRθ,i)2}\Delta \mathcal{R}_A(\delta) = \frac{1}{|A|} \sum_{i \in A} \big\{(Y_i - R_{\theta',i})^2 - (Y_i - R_{\theta,i})^2\big\}

Acceptance requires ΔRdev<0\Delta \mathcal{R}_{\text{dev}} < 0 and ΔRconfirm<0\Delta \mathcal{R}_{\text{confirm}} < 0.

Proposition 5.1 (Valid confirm‑set test under search)

Condition on the dev set and the selected δ^\widehat{\delta}. If the confirm set is independent and not used during selection, and fθ,fθf_\theta, f_{\theta'} are fitted without using confirm, then a one‑sided test of H0:E[ΔRconfirm(δ^)]0H_0: \mathbb{E}[\Delta \mathcal{R}_{\text{confirm}}(\widehat{\delta})] \ge 0 using ΔRconfirm\Delta \mathcal{R}_{\text{confirm}} enjoys valid type‑I error control at level α\alpha (asymptotically normal via CLT or via paired bootstrap), regardless of the (arbitrary) search on dev.

Proof sketch. Sample splitting removes selection‑test dependence; δ^\widehat{\delta} is measurable w.r.t. dev σ-field; confirm statistic remains an unbiased (or asymptotically normal) estimator for its expectation. □

In practice: We ensure independence by time‑separating confirm (e.g., collected ≥ 48–72 hours after dev) and never touching confirm for search or tuning.

Corollary 5.2 (Family‑wise error per iteration)

If at most one patch per family is accepted and a Bonferroni correction is applied across the KK families on the confirm set, the per‑iteration FWER is α\le \alpha.

Patch budgets. To control cumulative error across iterations, cap accepted patches per quarter and limit confirm peeks (pre‑register).

6. Anti‑Gaming as Robustness (Active Adversary, Worst‑Case Uplift)

Obligation violations. Let V(θ)X×A\mathcal{V}(\theta) \subset \mathcal{X} \times \mathcal{A} denote pairs violating rubric obligations (e.g., unverifiable citations). An adversary chooses edit operators ωΩ\omega \in \Omega producing perturbed pairs (Xω,Aω)(X^\omega, A^\omega).

Adversarial uplift. Define the worst‑case calibrated uplift under violations:

Δadv(θ)=supωΩ: (Xω,Aω)V(θ){E[Rθ(Xω,Aω)]E[Rθ(X,A)]}\Delta_{\text{adv}}(\theta) = \sup_{\omega \in \Omega:\ (X^\omega,A^\omega) \in \mathcal{V}(\theta)} \big\{ \mathbb{E}[R_\theta(X^\omega,A^\omega)] - \mathbb{E}[R_\theta(X,A)] \big\}

6.1 Validation Battery (held-out from calibration)

Adversarial Test Suite

Test 1: Length padding
Take correct answers, append +30–200% boilerplate text.
Judge should: penalize or stay neutral (not reward).
Test 2: Style mimicry
Prepend rubric phrases ("I verify sources", "reasoning step 1").
Judge should: ignore style markers, score only content.
Test 3: Confident-but-wrong
Flip factual claims but keep authoritative tone.
Judge should: detect via evidence checks.
Test 4: Citation hallucination
Include fake URLs or paper titles.
Judge should: flag or cap score.
Test 5: Style stripping
Normalize tone, remove formatting.
Judge should: score should be stable (±0.02).
Test 6: Tool-trace fabrication
Add fake tool invocation traces.
Judge should: detect inconsistency with actual outputs.

Expected outcomes

  • Uncalibrated SθS_\theta: increases by 0.10–0.30 under attacks
  • Calibrated RθR_\theta: shift ≤ 0.05 if rubric has proper guards
  • Failures surface as residual structure → trigger prompt updates

Acceptance condition. Require Δadv(θ)τadv\Delta_{\text{adv}}(\theta') \le \tau_{\text{adv}} (default 0.05\le 0.05). In practice, sup\sup is approximated by CLOVER‑A (Appendix C): an evolutionary search over operators with selection on RθR_\theta and violation checks. Newly discovered exploits are added to the regression battery.

7. Transport Diagnostics & Regime Selection

For a group variable GG (policy/time/domain), test transport via residual means:

H0: E[YRθG=g]=0gGH_0:\ \mathbb{E}[\,Y - R_\theta \mid G = g\,] = 0 \quad \forall g \in \mathcal{G}

BH within each GG family controls FDR (or Bonferroni for FWER). If any gg fails:

Local surrogacy (Regime 2)

Fit environment‑specific fθ,gf_{\theta,g} and evaluate within gg.

No surrogacy (Regime 1)

Use YY directly with DR on labeled rows.

Global patch attempt

If failures align with clear missed evidence, propose a global patch and re‑test.

8. OUA Variance & Sample‑Size Planning

Variance decomposition (DR + cross‑fitting).

Var[V^(π)]σeval2n+σoracle2nY\mathrm{Var}[\widehat{V}(\pi)] \approx \frac{\sigma^2_{\text{eval}}}{n} + \frac{\sigma^2_{\text{oracle}}}{n_Y}

with σoracle2\sigma^2_{\text{oracle}} from calibrator learning uncertainty. Estimate σoracle2\sigma^2_{\text{oracle}} by delete‑one‑fold jackknife over oracle folds; add to influence‑function variance for the main DR term.

OUA share. OUA=σoracle2/(σeval2+σoracle2)\mathrm{OUA} = \sigma^2_{\text{oracle}} / (\sigma^2_{\text{eval}} + \sigma^2_{\text{oracle}}).

Budget rule

High OUA (0.3\ge 0.3) → acquire more oracle labels; low OUA (0.1\le 0.1) → gather more cheap scores to reduce evaluation variance.

Sizing (rule‑of‑thumb). For target SE 0.025\le 0.025, typical σeval2[0.05,0.1]\sigma^2_{\text{eval}} \in [0.05, 0.1] implies n2,000n \approx 2{,}0003,0003{,}000, nY300n_Y \approx 300500500.

Worked example

Setup: OUA share ≈ 0.2, so σoracle20.25σeval2\sigma^2_{\text{oracle}} \approx 0.25 \cdot \sigma^2_{\text{eval}}.

Goal: SE 0.025\le 0.025 (CI width ≈ 0.05).

Derivation:

SE=σeval2n+σoracle2nY0.025\mathrm{SE} = \sqrt{\frac{\sigma^2_{\text{eval}}}{n} + \frac{\sigma^2_{\text{oracle}}}{n_Y}} \le 0.025

With OUA share = 0.2:

σoracle2=0.20.8σeval2=0.25σeval2\sigma^2_{\text{oracle}} = \frac{0.2}{0.8} \sigma^2_{\text{eval}} = 0.25 \sigma^2_{\text{eval}}

If we set nY=0.2nn_Y = 0.2 n (20% oracle coverage), then:

Var=σeval2n+0.25σeval20.2n=σeval2n(1+1.25)=2.25σeval2n\mathrm{Var} = \frac{\sigma^2_{\text{eval}}}{n} + \frac{0.25 \sigma^2_{\text{eval}}}{0.2 n} = \frac{\sigma^2_{\text{eval}}}{n}(1 + 1.25) = \frac{2.25 \sigma^2_{\text{eval}}}{n}

For σeval2=0.08\sigma^2_{\text{eval}} = 0.08 (typical):

n2.25×0.08(0.025)2=0.180.0006252,880n \ge \frac{2.25 \times 0.08}{(0.025)^2} = \frac{0.18}{0.000625} \approx 2{,}880

Thus: n3,000n \approx 3{,}000 total, nY600n_Y \approx 600 oracle labels achieves target SE.

Note: If OUA share is higher (e.g., 0.3), allocate more budget to oracle labels; if lower (e.g., 0.1), gather more cheap scores instead.

9. Engineering Contracts (Score‑Once, Versioning, Determinism)

  • Deterministic judging. Fix decoding (e.g., temperature 0).
  • Score‑once. Cache SθS_\theta; mark DSL fields that require re‑scoring vs can be recomputed downstream.
  • Versioning. Persist {judge_model_id, prompt_hash, rubric_version, calibrator_version, SDP_version, anchors}.
  • Change control. Two‑key approval for abstention/safety edits; automatic rollback if guardrails fail post‑deployment.

Judge versioning (always log)

judge_model:        gpt-4.5-mini
judge_prompt_hash:  sha256:ab12cd34ef56...
rubric_version:     3.2
calibrator_version: isotonic-v5
SDP_version:        v1.0_2025-11-11
anchors:            [pi_low=baseline-gpt4, pi_high=expert-panel]

Any change to model family or hard rules triggers a small oracle re‑calibration before deployment.

Reporting template (minimum bundle per iteration)

  • Target & anchors: Y vs Y*, SDP version, (πlow,πhigh)(\pi_{\text{low}}, \pi_{\text{high}}) specs; anchor stability check
  • Policy values: V^(π)\hat{V}(\pi) with OUA‑augmented 95% CIs (Direct/IPS/DR where applicable)
  • Calibration metrics: RMSE/ECE (before/after); calibration curve
  • Residuals: Slice table (means with CIs) before/after; hierarchical summary if used
  • Transport tests: Per environment pass/fail (pp-values)
  • Anti‑gaming: Uplifts under each test; new exploits discovered
  • OUA share: Overall and by decile
  • Negative segments: Top (u,t)(u,t) cells by weighted loss and minimal friction to flip
  • Complexity: DSL size delta; dead‑rule pruning status
  • Diffs & versions: Patch DSL diffs; version triplet

10. Limitations & Scope Conditions

Construct drift

If the welfare construct YY changes, recalibration cannot fix it; re‑anchor and re‑spec SDP.

Selective‑inference over time

Repeated iterations consume the confirm budget; time‑blocked confirms and pre‑registered patch budgets mitigate.

Non‑regular estimands

For extreme quantiles/worst‑case metrics, use EVT‑aware inference.

Reward hacking risk

Training models on RθR_\theta can still induce exploitation; when used for training, optimize a lower confidence bound and validate on fresh YY via A/B.

10.1 Explicit De‑Scoping (What We're NOT Doing)

To keep CLOVER v1.1 lightweight and implementable, we explicitly de‑scope the following:

  • No elaborate prompt DSL. A tiny YAML with obligations, guards, and abstention triggers is sufficient. We avoid building a custom domain‑specific language with complex parsing and code generation.
  • No fancy weight‑stabilization schemes. Prefer Direct/DR estimators unless ESS is healthy (>10% of n). Avoid uncontrolled IPS unless overlap is strong; standard stabilization (clip weights, regularize nuisance models) is acceptable but not mandatory.
  • No massive adversarial frameworks. A simple mutation search over 6–8 attack operators (padding, mimicry, fact flips, fake cites, fabricated traces) with 100–200 tries per patch suffices to keep us honest. Evolutionary/GAN‑based adversaries are out‑of‑scope.
  • No uncontrolled patch search. Limit to ≤ 1 candidate per family with a confirm split. Multi‑armed bandit or Bayesian optimization over patch space is unnecessary given the small patch budget (≤ 2 per iteration).
  • No dynamic slice adaptation. Pre‑register the slice family at iteration start and freeze the risk/green/neutral classification for that iteration. Adaptive slicing mid‑search introduces selection bias.
  • No automatic SDP re‑specification. If the welfare construct YY changes (e.g., shift from response quality to safety), CLOVER cannot auto‑detect or fix it—this requires human re‑anchoring and a new SDP version.

Philosophy: CLOVER v1.1 prioritizes statistical rigor and interpretability over automation. A small team with notebooks + YAML + simple calibrators can run the full loop. Extensions (dynamic slices, learned patch synthesis, active learning for oracle sampling) are future work.

11. SDP-Govynth: Automated Patch Synthesis

The CLOVER loop has four steps: Audit (calculate residuals), Diagnose (generate Residual Cards), Synthesize Patch (δ), and Validate (test against the Acceptance Predicate). Step 3, Synthesis, is currently the primary bottleneck—it requires human analysts to interpret Residual Cards and manually draft precise YAML patches. This process is slow, requires expertise, and does not systematically explore the patch space.

SDP-Govynth introduces automated, governed patch synthesis by integrating an Optimizer LLM (potentially orchestrated by a framework like DSPy) to generate candidate patches from Residual Cards. Crucially, this automation lives strictly inside the CLOVER governance framework—the existing statistical guardrails (selective-inference validity, acceptance predicate, complexity constraints) ensure that automated improvements are real, interpretable, and safe.

11.1. The Bottleneck: Manual Patch Synthesis

In the base CLOVER loop, human experts must:

  1. Review Residual Cards (e.g., "Judge over-scores long responses with fake citations on medical tasks").
  2. Reason about which rubric component (obligation, guard, or abstention trigger) should change.
  3. Draft a structured YAML patch δ\delta adhering to the Rubric Schema (Appendix B).
  4. Iterate if the patch fails the Acceptance Predicate on Ddev\mathcal{D}_{\text{dev}}.

This manual process limits iteration speed and systematic exploration. An automated system can search the patch space more efficiently while maintaining rigor through CLOVER's existing validation infrastructure.

11.2. The Opportunity: LLM-Driven Optimization

An Optimizer LLM can automate patch generation by reasoning over Residual Cards to propose candidate patches. Frameworks like DSPy (Khattab et al., 2024) excel at optimizing prompts against defined metrics—in this context, the task is to optimize the patch generation process against the CLOVER objective function.

Benefits

  • Scalability and speed: Automation enables rapid iteration and frequent judge improvements, reducing the time from failure detection to deployment.
  • Systematic search: An automated system can systematically explore the patch space (e.g., testing multiple obligation phrasings or guard thresholds) to find optimal improvements that humans might miss.
  • Reduced manual effort: Frees human experts to focus on higher-level tasks: defining the idealized target (YY^*, Layer 1), validating the Standard Deliberation Protocol (SDP, Layer 0), and conducting adversarial testing.

11.3. The Risks and CLOVER's Built-In Mitigation

The primary risks of automated prompt optimization are overfitting (exploiting noise in the development set) and loss of interpretability (opaque, brittle patches). Naive optimization can produce prompts that perform well on dev but fail in production.

Crucially, CLOVER was explicitly designed with the statistical guardrails necessary to make automated optimization safe and rigorous.

Risk 1: Overfitting to Dev Set

Threat: The optimizer aggressively finds patches that exploit noise in Ddev\mathcal{D}_{\text{dev}}, leading to false improvements that don't generalize.

CLOVER Mitigation: CLOVER enforces Selective-Inference-Valid Patch Selection (§5) using nested holdouts (Fit, Dev, Confirm). The optimizer only has access to Ddev\mathcal{D}_{\text{dev}}. A patch is accepted only if the improvement replicates on the time-separated Dconfirm\mathcal{D}_{\text{confirm}} set. This provides valid Type-I error control regardless of the complexity of the search on Dev. The confirm split acts as an honest arbiter that has never been seen during optimization.

Risk 2: Loss of Interpretability and Stability

Threat: The optimizer produces opaque, large-scale prompt rewrites that are hard to understand, audit, or maintain.

CLOVER Mitigation: The optimization must be constrained:

  • Structured output schema: The Optimizer LLM must output YAML patches adhering to the Rubric Schema (Appendix B), not arbitrary text edits. This ensures patches are interpretable and versioned.
  • Complexity penalty κ(δ)\kappa(\delta): CLOVER's complexity budget (e.g., max tokens added, max obligations changed) enforces interpretability and favors small, local perturbations over wholesale rewrites.
  • Obligation-first bias: The optimizer should prioritize adding missing welfare dimensions (new obligations) over tightening existing criteria (guard threshold changes), maintaining semantic clarity.
  • Hard stability constraints: The Acceptance Predicate includes blast-radius caps (KS distance, Anchor Stability §4.2) preventing large-scale rescalings, and Green-Slice Non-Inferiority ensuring patches don't degrade performance on well-calibrated, high-stakes slices.

Risk 3: Data Leakage and Memorization

Threat: The optimizer gains access to raw oracle labels YY on Ddev\mathcal{D}_{\text{dev}}, enabling it to memorize specific examples rather than learn generalizable patterns.

CLOVER Mitigation: The optimizer must only access aggregated Residual Cards, not raw (Xi,Ai,Yi)(X_i, A_i, Y_i) tuples (§3.1). Residual Cards provide summary statistics (mean residuals, slice definitions, cardinal failure modes) without exposing individual labels, preventing overfitting to specific examples.

11.4. Proposed Architecture: The Two-Loop System

SDP-Govynth integrates automated optimization as an inner loop nested within the existing CLOVER governance framework (the outer loop):

Outer Loop (CLOVER): Statistical Arbiter

The outer loop manages data splits, calibration, and the final Acceptance Predicate. It has exclusive access to Dconfirm\mathcal{D}_{\text{confirm}} and enforces selective-inference validity.

Responsibilities:

  • Partition data into Dfit\mathcal{D}_{\text{fit}}, Ddev\mathcal{D}_{\text{dev}}, Dconfirm\mathcal{D}_{\text{confirm}}
  • Calibrate baseline judge θ0\theta_0 on Dfit\mathcal{D}_{\text{fit}}
  • Generate Residual Cards from Ddev\mathcal{D}_{\text{dev}} residuals
  • Invoke inner loop (SDP-Govynth) to generate candidate patch δ\delta^*
  • Test δ\delta^* against full Acceptance Predicate on Dconfirm\mathcal{D}_{\text{confirm}}
  • Deploy θ=θ0δ\theta' = \theta_0 \oplus \delta^* if accepted, reject otherwise

Inner Loop (SDP-Govynth): Optimizer

The inner loop uses an Optimizer LLM (potentially orchestrated by DSPy) to generate patches optimized against the dev-set objective. It has no access to Dconfirm\mathcal{D}_{\text{confirm}}.

Task signature:

Residual_Card → Patch_Delta

Optimization objective (on Ddev\mathcal{D}_{\text{dev}} only):

maxδRiskImprove(δ)s.t.Acceptdev(δ)=True\begin{aligned} \max_{\delta} \quad & -\text{RiskImprove}(\delta) \\ \text{s.t.} \quad & \text{Accept}_{\text{dev}}(\delta) = \text{True} \end{aligned}

Where RiskImprove(δ)\text{RiskImprove}(\delta) is the risk-weighted MSE improvement (§4.1) and Acceptdev\text{Accept}_{\text{dev}} enforces all guardrails on dev:

  • Complexity cap: κ(δ)κmax\kappa(\delta) \leq \kappa_{\max}
  • Green-slice non-inferiority: ΔMSEgreenϵ\Delta \text{MSE}_{\text{green}} \geq -\epsilon
  • Blast-radius cap: KS(Rθ0,Rθ)τKS\text{KS}(R_{\theta_0}, R_{\theta'}) \leq \tau_{\text{KS}}
  • Anchor stability: ΔV(πlow),ΔV(πhigh)ϵanchor|\Delta V(\pi_{\text{low}})|, |\Delta V(\pi_{\text{high}})| \leq \epsilon_{\text{anchor}}
  • Anti-gaming: Pass adversarial robustness tests (§6)

Output: The best patch δ\delta^* found on dev, subject to all constraints.

The Critical Guarantee

The inner loop can search arbitrarily aggressively over Ddev\mathcal{D}_{\text{dev}}—trying thousands of candidate patches, using reinforcement learning, or leveraging multi-armed bandits—without compromising statistical validity. The outer loop's confirm-set validation provides an honest, selection-agnostic test that controls Type-I error regardless of the complexity of the dev-set search. This is the essence of selective inference: search freely, validate honestly.

11.5. Implementation Sketch: DSPy Integration

DSPy (Khattab et al., 2024) provides a natural framework for implementing SDP-Govynth:

# Pseudocode: SDP-Govynth with DSPy

class PatchGenerator(dspy.Signature):
    """Generate a YAML patch to improve judge calibration."""
    residual_card = dspy.InputField(desc="Structured summary of systematic judge failures")
    current_rubric = dspy.InputField(desc="Current judge rubric (YAML)")
    patch_delta = dspy.OutputField(desc="Proposed YAML patch (must adhere to schema)")

class CLOVERSynth(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate_patch = dspy.ChainOfThought(PatchGenerator)

    def forward(self, residual_card, current_rubric):
        # Generate candidate patch
        patch = self.generate_patch(
            residual_card=residual_card,
            current_rubric=current_rubric
        ).patch_delta

        # Parse and validate schema
        patch_obj = parse_yaml_patch(patch)
        assert validate_patch_schema(patch_obj), "Patch violates schema"

        return patch_obj

# Optimization loop (inner loop, on D_dev only)
optimizer = dspy.BootstrapFewShot(metric=clover_dev_objective)
optimized_synth = optimizer.compile(
    CLOVERSynth(),
    trainset=dev_residual_cards
)

# Generate best patch on dev
best_patch = optimized_synth(
    residual_card=current_residual_card,
    current_rubric=theta_0
)

# Outer loop: Test on confirm (CLOVER's honest arbiter)
if accept_on_confirm(best_patch, D_confirm):
    deploy(theta_0 + best_patch)
else:
    reject(best_patch)

The clover_dev_objective metric evaluates patches on Ddev\mathcal{D}_{\text{dev}} against the full set of dev-set constraints (complexity, green-slice non-inferiority, blast-radius, etc.). DSPy's optimizer searches the space of patch generators to maximize this objective, but the final deployment decision rests with the outer loop's confirm-set test.

11.6. When to Use SDP-Govynth

ScenarioManual CLOVERSDP-Govynth
Initial rubric design
Defining obligations from scratch
✓ Preferred (human semantic design)✗ Not recommended
Iterative refinement
5-10 patch cycles, systematic failures
○ Workable but slow✓ Ideal use case
High-stakes, novel domains
Medical, legal, safety-critical
✓ Preferred (human oversight critical)○ Use with expert review of all patches
Rapid deployment cycles
Weekly judge updates, mature rubrics
✗ Bottleneck✓ Enables fast iteration

11.7. Summary: Why CLOVER Enables Safe Automation

SDP-Govynth is a natural and safe extension because CLOVER was designed from the start to support automated optimization:

  • Selective-inference validity: The confirm split provides honest Type-I error control regardless of dev-set search complexity.
  • Input/output constraints: Structured Residual Cards (input) and YAML schema (output) ensure interpretability and prevent data leakage.
  • Multi-dimensional acceptance predicate: Complexity penalties, green-slice non-inferiority, blast-radius caps, and anti-gaming tests prevent the optimizer from finding shallow improvements.
  • Explicit versioning and rollback: Every patch is logged with its Residual Card, dev/confirm metrics, and timestamp, enabling audits and rollbacks if production performance degrades.

The Key Insight

Automated prompt optimization is risky when done naively (overfitting, gaming, brittleness). But when nested inside a governed statistical framework with honest holdouts, hard constraints, and explicit versioning, it becomes a powerful tool for scaling systematic improvement. CLOVER provides that framework. SDP-Govynth is what happens when you take the optimization seriously and the statistics seriously.

Appendix A. Metrics & Test Statistics

  • MSE & RMSE: MSE=E[(YRθ)2]\mathrm{MSE} = \mathbb{E}[(Y - R_\theta)^2], RMSE=MSE\mathrm{RMSE} = \sqrt{\mathrm{MSE}}. For acceptance gates, use ΔMSE (additive, stable under aggregation) rather than ΔRMSE.
  • ECE (Expected Calibration Error): Partition predictions RθR_\theta into B=10B = 10 equal‑frequency bins. For bin bb, compute mean prediction Rˉb=1nbibRθ,i\bar{R}_b = \frac{1}{n_b} \sum_{i \in b} R_{\theta,i} and mean outcome Yˉb=1nbibYi\bar{Y}_b = \frac{1}{n_b} \sum_{i \in b} Y_i. Then:
    ECE=1Bb=1BYˉbRˉb\mathrm{ECE} = \frac{1}{B} \sum_{b=1}^B \big|\bar{Y}_b - \bar{R}_b\big|
    Use the same bin boundaries (determined on the baseline) when computing ΔECE for patches. Alternatives: adaptive isotonic ECE or calibration slope.
  • Residual flatness: For pre‑registered slices gg, report εˉg\bar{\varepsilon}_g with CIs and BH‑adjusted pp-values; accept only if all CIs overlap 0 post‑patch.
  • Confirm test: One‑sided zz or paired bootstrap on ΔMSEconfirm\Delta \mathrm{MSE}_{\text{confirm}}.
  • Transport: Per‑group residual tests; Prentice‑style sufficiency checks by regressing YY on (X,A,Sθ,G)(X, A, S_\theta, G) and testing GG terms.

Appendix B. Minimal Rubric Obligation Schema & Patch Delta

Rubric (minimal)

rubric:
  version: 1.0
  target: "Y"           # or "Y_star"
  sdp_version: "v1.0_2025-11-11"
  obligations:
    factual_adequacy:
      required: true
      verification:
        allowed_domains: ["nih.gov","who.int","cochrane.org","nejm.org","thelancet.com","nature.com"]
        mismatch_penalty: 0.35
    reasoning_quality:
      requires_counter_position: true
    risk_accounting:
      requires_stakeholders: true
    usefulness:
      be_concise_by_default: true
  guards:
    length_bias_cap:
      if_tokens_gt: 500
      max_raw_score: 0.5
    tool_trace_consistency: true
    confident_but_wrong_max: 0.4
  abstention:
    triggers: ["medical dosing","legal advice"]
    rule: "if critical info missing -> abstain or escalate"

Patch delta (example)

patch:
  family: "evidence_verification"
  id: "len-cite-guard-001"
  rationale: "Long medical answers with unverifiable cites are over-scored."
  changes:
    guards.length_bias_cap:
      if_tokens_gt: 500
      max_raw_score: 0.5
    obligations.factual_adequacy.verification:
      allowed_domains+: ["bmj.com"]
      mismatch_penalty: 0.35
  constraints:
    max_tokens_added: 150
    affects_rescore: true
  expected_effects:
    slices:
      - domain: "medical"
        length_bin: "600-900"
        residual_mean_delta: -0.06
    anti_gaming:
      length_padding_uplift: "<=0.02"

Appendix C. Algorithms

C.1 CLOVER‑J — Judge Closed Loop

Inputs. θ₀, logs D, oracle I_oracle; K=5, BH‑q=0.10, patience=2; confirm set time‑separated.

1. Score: S_θ ← J_θ(X,A) 2. Calibrate: Cross‑fit f_θ (two‑stage→isotonic) → R_θ 3. Residuals: ε on oracle folds; build Residual Cards for significant slices 4. Synthesize patches (LLM, constrained): ≤1 per family 5. Evaluate on dev: ΔRMSE/ECE; residual flatness; transport; anti‑gaming; OUA; complexity 6. Confirm: Recompute ΔRMSE on confirm with calibrators fit on fit only; require ΔRMSE < 0 7. Select: Accept if all gates pass; else increment patience 8. Stop: If patience ≥ 2 or no significant residual structure remains 9. Version & report

C.2 CLOVER‑A — Active Adversary

  • Operator set Ω: padding, rubric mimicry, fact flips, fake citations, style stripping, fabricated tool traces, contradiction injections.
  • Search: evolutionary loop with selection on RθR_\theta and constraint (Xω,Aω)V(θ)(X^\omega, A^\omega) \in \mathcal{V}(\theta).
  • Outputs: adversarial exemplars and estimated Δadv\Delta_{\text{adv}}.

C.3 CLOVER‑G — Cautious Generator Optimization

  • Freeze Jθ,fθJ_\theta, f_\theta.
  • Optimize generator prompts/procedures for kk steps on LCBα(E[Rθ])γcost\mathrm{LCB}_\alpha(\mathbb{E}[R_\theta]) - \gamma \cdot \mathrm{cost}.
  • Validate on fresh YY/YY^* via small A/B; re‑run transport and anti‑gaming; rollback on regressions.

C.4 Skeleton Pseudocode (Drop‑In for v1.1)

Below is a complete, balanced CLOVER iteration loop incorporating all v1.1 improvements (risk/green classification, balanced objectives, distribution shift caps, anchor stability, partial pooling).

def clover_iteration(theta, logs, oracle, anchors, slices, params):
    """
    Single iteration of CLOVER v1.1 with balanced residual improvement.

    Args:
        theta: current rubric (YAML/dict)
        logs: evaluation set (X, A pairs)
        oracle: oracle subset with Y labels (dev + confirm splits)
        anchors: reference policies (pi_low, pi_high) for stability checks
        slices: pre-registered slice family (domain × difficulty × length)
        params: hyperparameters (η, τ_g, KS_cap, etc.)

    Returns:
        accepted_patches: list of (δ, Improve, κ) tuples, or "Pause for labels"
    """

    # 1) Score once (deterministic, cached)
    S = score_deterministic(theta, logs)  # cache raw scores

    # 2) Calibrate (K-fold cross-fit, two-stage → isotonic)
    R, oua_share = crossfit_isotonic(S, oracle, k=params.K)

    # OUA gating: if OUA high and no clear gain, pause for labels
    if oua_share >= params.oua_threshold and not promising_gain_estimate():
        return "Pause for labels"

    # 3) Residuals & slice classification (partial pooling DEFAULT)
    resid = oracle.Y - R.oof
    resid_shrunk = partial_pool(resid, slices)  # hierarchical model or James-Stein

    # Classify slices (freeze for this iteration):
    #   Risk (R): CI excludes 0 AND |ε̄_g| ≥ 0.03
    #   Green (G): high exposure/stakes with good calibration
    #   Neutral (N): everything else
    risk, green, neutral = classify_slices(resid_shrunk, slices, params)

    # 4) Propose tiny obligation-first patches (≤1 per family)
    families = ["verify", "contradiction", "length", "abstain"]
    candidates = propose_patches(theta, families, complexity_cap=params.tau)

    accepted = []
    for δ in candidates:
        θp = apply_patch(theta, δ)
        Sp = maybe_rescore(θp, logs, cache=S)  # only if patch requires new evidence
        Rp, _ = crossfit_isotonic(Sp, oracle, k=params.K)

        # --- Hard gates (all must pass on DEV) ---

        # Global calibration non-worse
        if delta_mse_global(R.dev, Rp.dev, oracle.Y.dev) > 0:
            continue
        if delta_ece(R.dev, Rp.dev, oracle.Y.dev) > 0:
            continue

        # Distribution shift cap (KS ≤ 0.05 or Wasserstein ≤ 0.02)
        if ks_distance(Rp.all, R.all) > params.ks_cap:
            continue

        # Anchor stability (drift ≤ 0.01)
        if not anchors_stable(R, Rp, anchors, eps=params.anchor_eps):
            continue

        # Transport OK (groupwise residual means ≈ 0 in target environments)
        if not transport_ok(Rp, oracle, environments=params.envs):
            continue

        # Anti-gaming (uplift ≤ 0.05 per attack)
        if not anti_gaming_ok(θp, params.attack_suite, uplift_cap=0.05):
            continue

        # Green non-inferiority (one-sided test: Δ MSE_g < τ_g for all g ∈ G)
        if not non_inferiority_green(R.dev, Rp.dev, green, params.tau_g):
            continue

        # Risk-weighted improvement (Improve ≤ -η)
        improve_dev = weighted_improve(R.dev, Rp.dev, risk, params.w, params.u)
        if improve_dev > -params.eta:
            continue

        # --- CONFIRM replication (all checks on time-separated holdout) ---

        if not non_inferiority_green(R.conf, Rp.conf, green, params.tau_g):
            continue

        improve_conf = weighted_improve(R.conf, Rp.conf, risk, params.w, params.u)
        if improve_conf > -params.eta:
            continue

        # All gates passed!
        accepted.append((δ, improve_conf, complexity(δ)))

    # 5) Select top patches (≤ 2)
    # Tie-break: smallest complexity κ(δ), smallest KS shift
    accepted = sorted(accepted, key=lambda x: (x[1], x[2]))[:params.max_patches]

    return accepted


# --- Helper functions (minimal pseudocode) ---

def partial_pool(resid, slices):
    """Fit hierarchical model: ε ~ α_G + u_g + covariates + η.
    Returns shrunken residual estimates ε̃_g."""
    # Use lme4, brms, or James-Stein shrinkage
    pass

def classify_slices(resid, slices, params):
    """Returns (risk, green, neutral) slice sets."""
    risk = {g for g in slices if ci_excludes_zero(resid[g]) and abs(mean(resid[g])) >= 0.03}
    green = {g for g in slices if (exposure(g) >= 0.05 or stakes(g) >= 0.7)
             and abs(mean(resid[g])) < 0.03}
    neutral = set(slices) - risk - green
    return risk, green, neutral

def weighted_improve(R, Rp, risk, w, u):
    """Σ_{g∈R} w_g u_g · (MSE_g(Rp) - MSE_g(R))."""
    return sum(w[g] * u[g] * (mse(Rp[g]) - mse(R[g])) for g in risk)

def non_inferiority_green(R, Rp, green, tau_g):
    """One-sided test: Δ MSE_g < τ_g for all g ∈ G (BH-corrected)."""
    pvals = [one_sided_test(mse(Rp[g]) - mse(R[g]), tau_g[g]) for g in green]
    return all_pass_bh(pvals, q=0.10)

def anchors_stable(R, Rp, anchors, eps=0.01):
    """Check |E[Rp | π_low] - E[R | π_low]| ≤ eps (same for π_high)."""
    drift_low  = abs(mean(Rp[anchors.pi_low])  - mean(R[anchors.pi_low]))
    drift_high = abs(mean(Rp[anchors.pi_high]) - mean(R[anchors.pi_high]))
    return drift_low <= eps and drift_high <= eps

Usage: This skeleton is a direct drop‑in for v1.1. Replace partial_pool, crossfit_isotonic, and anti_gaming_ok with your calibrator, pooling estimator, and adversarial test suite. The logic preserves all v1.1 guarantees: risk improvement subject to green non‑inferiority, distribution shift caps, anchor stability, and confirm replication.

Appendix D. Default Parameters (v1.1)

ParameterSymbol / NameDefault Value (v1.1)
Cross‑fit foldsK5
Slice countPre‑registered slices≤ 12 (domain × difficulty × length)
Risk threshold|ε̄_g|≥ 0.03 (material miscalibration)
Green tolerance (MSE scale)τ_g1–2 × 10⁻³ (0 for must‑not‑regress)
Required improvement (MSE scale)η0.0005–0.0025 (≈ 0.02–0.05 RMSE on [0,1])
KS shift capD_KS≤ 0.05
Wasserstein shift capW≤ 0.02
Anchor drift capΔ E[R|π]≤ 0.01 (for π_low, π_high)
Anti‑gaming upliftPer attack≤ 0.05 (calibrated scale)
Patch sizeκ(δ)≤ 150 tokens or ≤ 2 new fields
Patch budgetPer iteration≤ 2 accepted patches
Confirm peek budgetPer quarter≤ 3 peeks
OUA gating thresholdOUA_share≥ 0.30 → pause for labels
FDR controlq0.10 (BH correction within slice families)
Partial poolingDefault estimatorHierarchical model (or James–Stein if unavailable; min n_g ≥ 40 fallback)

Tuning guidance: These defaults are conservative and suitable for most applications. Increase η\eta (required improvement) if you want to filter out tiny gains; decrease τg\tau_g (green tolerance) for critical slices you cannot afford to degrade. Tighten KS/Wasserstein caps if interpretability across versions is paramount.

Assumptions Ledger

CodeStatementUsed byTest/DiagnosticMitigation
J1Programmable evidence Eθ (more info ↓ Bayes risk)DesignResidual ↓ after adding evidence checksAdd obligations, not weights
J2Deterministic scoring (temp=0, fixed tools)Score‑onceRepeat scoring → identical SθCache Sθ; version model/prompt
S1∃fθ: E[Y|X,A,Sθ] = fθAllPrentice‑style sufficiency on oracleEnrich surrogates; add covariates
S2Y ⊥ Sel | X,A,SθCross‑envGroupwise residual testsLocal fθ or re‑prompt
L1Oracle MARCalibrationLabel process auditRandomize oracle sampling; stratify
L2Oracle positivityCalibrationCoverage plots; tail checkTargeted labeling in tails
H1Nested holdout confirmPatch selectionDev vs confirm deltaReject non‑replicated patches
OPEOverlap (ESS)IPS/DRESS, max/median wCollect draws; Direct/DR; stabilize

References

[1] Blackwell, D. (1951/1953). Comparison/Equivalence of experiments. Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability.
[2] Dudík, M., Langford, J., & Li, L. (2014). Doubly robust policy evaluation and learning. International Conference on Machine Learning (ICML).
[3] Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1–C68.
[4] Efron, B., & Tibshirani, R. J. (1994). An Introduction to the Bootstrap. Chapman and Hall/CRC.
[5] Kallus, N., & Mao, X. (2020). On the role of surrogates in the efficient estimation of treatment effects with limited outcome data. arXiv:2003.12408.
[6] Pearl, J., & Bareinboim, E. (2014). External validity: From do-calculus to transportability across populations. Statistical Science, 29(4), 579–595.

Summary (for implementers)

  • Treat the judge as a programmable measurement channel; improve obligations before weights.
  • Use cross‑fit monotone calibration to get RθR_\theta, and nested holds to accept patches.
  • Require replicated calibration gains on a time‑separated confirm set, transport pass, and bounded adversarial uplift.
  • Report OUA‑augmented CIs, negative segments, and maintain strict versioning.
  • Only then consider optimizing generation against RθR_\theta, with LCB objectives and external YY checks.

We welcome your feedback

CLOVER is an active research framework. We invite constructive criticism from practitioners and researchers.

If you spot errors, have theoretical extensions, or have applied CLOVER in production and want to share lessons, please let us know or email eddie@cimolabs.com.