CLOVER: Closed-Loop Optimization of Programmable Judges

Abstract

We formalize CLOVER, a governed procedure for improving LLM judges treated as programmable surrogates. CLOVER (i) calibrates raw judge scores to an operational welfare target $Y$ , (ii) audits residuals to detect systematic misscoring, (iii) proposes small, structured rubric patches, and (iv) accepts a patch only if it improves calibration on a time‑separated confirm holdout while passing transport and anti‑gaming constraints. We give identification results for using calibrated rewards in Direct/IPS/DR policy evaluation, derive an oracle‑uncertainty‑aware (OUA) variance decomposition, formalize the patch family selection problem with selective‑inference control via nested sample splitting, and specify an active adversarial searcher as a worst‑case uplift bound. The framework is designed for "score once, calibrate many" with explicit versioning and an assumptions ledger.

Scope: CLOVER vs. SDP-Gov

CLOVER (this appendix) governs the calibration of judges (the mapping $S \to Y$ ): improving judge rubrics to better predict operational welfare labels while maintaining calibration quality, transportability, and resistance to gaming.

SDP-Gov (Layer 0) governs the Standard Deliberation Protocol (SDP) itself (the mapping $Y \to Y^*$ ): ensuring that operational welfare labels $Y$ align with true idealized welfare $Y^*$ via empirical validation (PTE against long-run outcomes), construct validity audits, and stability checks. See Validating the Bridge Assumption (A0) for the complete SDP-Gov framework.

0. Notation & Objects

Contexts & actions. $X \in \mathcal{X}$ , $A \in \mathcal{A}$ . A policy $\pi$ maps $x \mapsto \pi(\cdot \mid x)$ .
Target welfare. $Y \in [0,1]$ is the operational welfare label collected under a fixed Standard Deliberation Protocol (SDP). Optionally, $Y^*$ denotes an idealized target; replace $Y$ with $Y^*$ where relevant.
Judge. A rubric/prompt $\theta \in \Theta$ parameterizes a judge $J_\theta: \mathcal{X} \times \mathcal{A} \to \mathbb{R}^d$ , with output $S_\theta = J_\theta(X,A)$ .
Calibrator. A function $f_\theta: \mathbb{R}^d \times \mathcal{X} \to [0,1]$ mapping $(S_\theta, X) \mapsto R_\theta$ , where $R_\theta = f_\theta(S_\theta, X)$ is the calibrated reward.
Logs & oracle. Logged data $\mathcal{D} = \{(X_i, A_i)\}_{i=1}^n$ from $\pi_0$ . A subset $I_{\text{oracle}} \subset \{1,\ldots,n\}$ carries welfare labels $\{Y_i\}$ ; coverage $\rho = |I_{\text{oracle}}|/n$ .
Residuals. $\varepsilon_i = Y_i - R_{\theta,i}$ for $i \in I_{\text{oracle}}$ .
Estimand. Policy value $V(\pi) = \mathbb{E}_X \mathbb{E}_{A \sim \pi(\cdot \mid X)}[Y \mid X,A]$ .

Throughout, expectations are w.r.t. the relevant data‑generating distributions; measurability and boundedness of $Y$ are assumed.

1. Measurement Model & Information Ordering

Assumption J1 (Programmable channel)

A rubric $\theta$ specifies an evidence set $E_\theta$ and induces a σ-field $\mathcal{F}(\theta)$ such that $S_\theta$ is $\mathcal{F}(\theta)$ -measurable.

Lemma 1 (Informativeness monotonicity; Blackwell/Doob)

If $\mathcal{F}(\theta) \subseteq \mathcal{F}(\theta')$ then

\mathbb{E}\!\left[\big(Y - \mathbb{E}[Y \mid \mathcal{F}(\theta')]\big)^2\right] \le \mathbb{E}\!\left[\big(Y - \mathbb{E}[Y \mid \mathcal{F}(\theta)]\big)^2\right]

Proof sketch. Conditional expectation is the $L^2$ projection; enlarging the σ-field cannot increase squared error. □

Implication. Enriching obligations (evidence to be checked) weakly reduces Bayes risk for predicting $Y$ ; weight tweaks alone cannot guarantee this.

1.1 Design Philosophy: Obligation‑First, Bounded Risk

CLOVER v1.1 adopts an obligation‑first design philosophy to ensure improvements are interpretable, generalizable, and safe:

Prefer obligations over weights. When residuals reveal failures, first ask "what evidence should we check?" not "how should we reweight?" Obligation edits (e.g., verify citations, check for contradictions) tend to generalize and are less likely to degrade well‑calibrated regions.
Partial pooling by default. Use hierarchical/shrinkage estimation for slice residuals to borrow strength across related groups and avoid chasing noise in small cells.
Balanced objectives. Optimize a constrained problem: improve risk slices subject to non‑inferiority on green slices (high‑exposure or high‑stakes regions with good calibration).
Bound the blast radius. Cap distributional shift (KS, Wasserstein) and anchor drift so patches remain small, local perturbations rather than wholesale rescalings.
Confirm on time‑separated data. Require dev gains to replicate on a confirm holdout collected after the dev set to control selective‑inference risk.

Core principle: Fix what's broken without breaking what works. Residuals are a discovery tool, not the optimization objective.

2. Surrogacy, Transport, and Identification with Calibrated Rewards

Assumption S1 (Surrogacy sufficiency on support)

On $\mathrm{supp}(\pi_0 \cup \Pi_{\text{eval}})$ ,

\mathbb{E}\big[Y \mid X,A,S_\theta\big] = f_\theta(S_\theta, X) \quad \text{a.s.}

Assumption S2 (Transport / S‑admissibility)

Across admissible environments $g \in \mathcal{G}$ (policy, time, cohort), with selection nodes $\mathrm{Sel}$ ,

Y \perp\!\!\!\perp \mathrm{Sel} \mid X, A, S_\theta

When S2 holds, the same $f_\theta$ transports across $g$ .

Proposition 2.1 (Direct identification)

Under S1 (and MAR/positivity for oracle learning of $f_\theta$ ), for any $\pi$ ,

V(\pi) = \mathbb{E}\!\left[\,R_\theta(X,S_\theta)\,\middle|\, A = \pi(X)\right]

Operational: Draw fresh contexts $X_i$ , sample actions $A_i \sim \pi(\cdot \mid X_i)$ , score $S_{\theta,i} = J_\theta(X_i, A_i)$ , apply the calibrator $R_{\theta,i} = f_\theta(S_{\theta,i}, X_i)$ , and take the Monte Carlo mean $\widehat{V}(\pi) = n^{-1} \sum_i R_{\theta,i}$ .

Proposition 2.2 (IPS)

With overlap $\pi_0(a \mid x) > 0$ whenever $\pi(a \mid x) > 0$ ,

V(\pi) = \mathbb{E}\!\left[w_\pi(X,A)\,R_\theta\right], \quad w_\pi(X,A) = \frac{\pi(A \mid X)}{\pi_0(A \mid X)}

Proposition 2.3 (DR)

Let $Q(X,A) \approx \mathbb{E}[R_\theta \mid X,A]$ . Then

V(\pi) = \mathbb{E}\!\left[w_\pi(X,A)\big(R_\theta - Q(X,A)\big) + Q_\pi(X)\right]

where $Q_\pi(X) = \mathbb{E}_{a \sim \pi(\cdot \mid X)}[Q(X,a)]$ . Consistency obtains if either $w_\pi$ or $Q$ is correct.

Proofs. Standard; replace $Y$ by $R_\theta$ using S1. □

3. Calibration Risk, Residuals, and Slicing

Calibration risk. For loss $\ell(r,y) = (r - y)^2$ , define $\mathcal{R}(f_\theta) = \mathbb{E}[\ell(R_\theta, Y) \mid i \in I_{\text{oracle}}]$ . Cross‑fitting yields $\widehat{\mathcal{R}}$ and out‑of‑fold residuals $\varepsilon$ .

Default calibrator hyperparameters

Calibrator architecture: Two‑stage if covariates available:
- Stage 1: Spline regression over $[S_\theta, \text{length}, \text{has\_citation}, \ldots]$ → intermediate score $T = g_k(S_\theta, X)$
- Stage 2: Isotonic regression on $T$ → $R_\theta$
If no covariates, use isotonic on $S_\theta$ directly.
Cross‑fitting: $K = 5$ folds. Larger $K$ reduces bias at cost of higher variance.
Residual slicing: 10–20 groups (domain × difficulty × length bins). Use Benjamini–Hochberg to control the false discovery rate ( $q = 0.10$ ). For FWER control, use Bonferroni or Holm.
Stopping: Two consecutive iterations with no significant residual structure (all group means have CIs overlapping 0) and transport diagnostics pass.

Slices. Pre‑register a finite slice family $\mathcal{S} = \{G_1, \ldots, G_m\}$ where each $G_j$ partitions oracle indices into groups $g \in \mathcal{G}_j$ (e.g., domain × difficulty × length). For each group, test

H_0(g):\ \mathbb{E}[\,\varepsilon \mid i \in g\,] = 0

Use BH at level $q$ within each $G_j$ . Fit a hierarchical model (partial pooling) $\varepsilon \sim \alpha_{G_j} + u_g + \text{covariates} + \eta$ to obtain shrunken residual estimates $\tilde{\varepsilon}_g$ across overlapping slices. This is the default in v1.1—it avoids chasing noise in small cells by borrowing strength across related groups. If unavailable, use James–Stein shrinkage or require minimum $n_g \ge 40$ per slice.

3.1 Residual Cards

For each significant slice $g$ (after FDR correction), produce a Residual Card containing:

Slice: domain = "medical Q&A", length bin = 600–900, difficulty = "hard"
Stats: $n_g = 143$ , $\bar{\varepsilon}_g = -0.082$ (CI $[-0.11, -0.05]$ ) — negative = judge over‑scores vs. oracle
Attribution hints (computed features): verbosity +180% median; citation_valid=false in 61% cases; high "authoritative tone" markers
Nearest‑neighbor exemplars ( $k=3$ ) with largest negative residuals
Counter‑examples ( $k=3$ ) with near-zero residuals in same slice
Anti‑gaming stress deltas for this slice
Hypothesis checklist (auto‑scores): length bias ✓, fake citations ✓, missed risk framing ✗

Key principle: These cards are the only inputs to the Patch Synthesizer (§4), preventing it from memorizing specific labels and reducing overfitting risk.

3.2 Slice Classification: Risk, Green, Neutral

After residual analysis, classify each slice $g$ into one of three categories for the current iteration:

Risk Slices ( $g \in \mathcal{R}$ )

CI for $\bar{\varepsilon}_g$ excludes 0 and $|\bar{\varepsilon}_g| \ge 0.03$ . These represent material systematic miscalibration.

Green Slices ( $g \in \mathcal{G}$ )

High exposure $(w_g \ge 0.05)$ or high stakes $(u_g \ge 0.7)$ , and calibration is good (CI overlaps 0 or $|\bar{\varepsilon}_g| < 0.03$ ). Patches must not harm these slices.

Neutral Slices ( $g \in \mathcal{N}$ )

All others — low exposure, low stakes, or calibration status unclear. Not explicitly optimized or protected.

Classification freeze: For each iteration, freeze the classification at the start based on the current rubric $\theta$ . Evaluate candidate patches against this fixed classification to prevent gaming the definition.

Weights. Define exposure weight $w_g = n_g / \sum_{g'} n_{g'}$ (fraction of samples in slice $g$ ), and stakes weight $u_g \in [0,1]$ (externally specified per slice, e.g., medical Q&A with dosing = 0.9, casual chitchat = 0.1).

4. Patch Space, Complexity, and Acceptance Predicate

Rubric obligations. A rubric is a typed object $\theta = (\texttt{obligations}, \texttt{guards}, \texttt{abstention})$ (Appendix B). A patch $\delta$ is a finite edit to $\theta$ producing $\theta' = \theta \oplus \delta$ .

Patch families. Pre‑register disjoint families $\mathcal{F}_1, \ldots, \mathcal{F}_K$ (e.g., evidence verification, length‑bias cap, abstention). Each iteration proposes at most one patch per family.

Complexity. Define $\kappa(\delta) \in \mathbb{R}_{\ge 0}$ as a code‑length‑like measure (Δ tokens + # new guards + # abstention edits). Complexity penalties discourage overfitting.

4.1 Balanced Objective: Improve Total Error, Protect Green

CLOVER v1.1 uses global MSE reduction as the primary acceptance criterion, with green non‑inferiority as a hard constraint. Risk‑weighted improvement serves as a tie‑break among acceptable patches.

Primary criterion (hard gate): Global MSE improvement

\Delta\mathrm{MSE}_{\text{global}} = \frac{1}{n_{\text{oracle}}} \sum_{i \in I_{\text{oracle}}} \Big[(Y_i - R'_{\theta,i})^2 - (Y_i - R_{\theta,i})^2\Big]

Require $\Delta\mathrm{MSE}_{\text{global}} \le -\eta$ with $\eta \in [0.0005, 0.0025]$ on both dev and confirm. This corresponds to roughly 0.02–0.05 RMSE improvement on the ([0,1]) scale. MSE differences are additive and stable under aggregation, unlike RMSE.

Tie‑break metric: Risk‑weighted MSE improvement

\mathrm{RiskImprove}(\delta) = \sum_{g \in \mathcal{R}} w_g u_g \cdot \Delta\mathrm{MSE}_g(\delta)

where $\Delta\mathrm{MSE}_g = \mathrm{MSE}_g(\theta') - \mathrm{MSE}_g(\theta)$ . Among patches that pass all gates, prefer the one with the most negative $\mathrm{RiskImprove}(\delta)$ . This focuses improvements on high‑stakes failures without making it a hard requirement.

Non‑inferiority on green slices (hard constraint)

For each $g \in \mathcal{G}$ , run a one‑sided non‑inferiority test:

H_0: \Delta\mathrm{MSE}_g \ge \tau_g \quad \text{vs} \quad H_1: \Delta\mathrm{MSE}_g < \tau_g

with tolerance $\tau_g \in [1\text{–}2 \times 10^{-3}]$ (MSE scale). Require rejection of $H_0$ (no material degradation) after BH correction on both dev and confirm. For must‑not‑regress slices (e.g., safety‑critical), set $\tau_g = 0$ .

Why MSE? MSE differences are additive (you can sum across examples) and interpretable. RMSE differences can flip signs and complicate inference. Total error reduction ensures you improve the whole system, while green non‑inferiority prevents "winning red cells while losing the map."

4.2 Distribution Shift & Anchor Stability (Blast‑Radius Caps)

To ensure patches remain local perturbations rather than wholesale rescalings, impose hard constraints on score distribution and anchor drift:

Distribution shift cap: Kolmogorov–Smirnov distance $D_{\text{KS}}(R_{\theta'}, R_\theta) \le 0.05$ (or Wasserstein distance $\le 0.02$ ). This bounds the maximum pointwise CDF difference, preventing large‑scale rescalings.
Anchor stability: For reference policies $\pi_{\text{low}}$ and $\pi_{\text{high}}$ ,
$\big|\mathbb{E}[R_{\theta'} \mid \pi_{\text{low}}] - \mathbb{E}[R_\theta \mid \pi_{\text{low}}]\big| \le 0.01$
$\big|\mathbb{E}[R_{\theta'} \mid \pi_{\text{high}}] - \mathbb{E}[R_\theta \mid \pi_{\text{high}}]\big| \le 0.01$
Ensures the calibrated scale remains comparable across rubric versions.

Why cap shift? Without these constraints, a patch could trivially "improve" residuals by rescaling all scores. Distribution and anchor caps keep patches interpretable and prevent score drift across versions.

Acceptance Predicate (v1.1)

On a given iteration with splits (fit, dev, confirm), a candidate patch $\delta$ is acceptable iff:

Accept(δ) = [ΔMSE_global ≤ -η on dev & confirm] (§4.1: total-error improvement) ∧ GreenNonInf(δ) (§4.1: all green slices pass τ_g test) ∧ [ΔECE ≤ 0] (global calibration non-worse) ∧ DistShiftOK (§4.2: KS ≤ 0.05, Wasserstein ≤ 0.02) ∧ AnchorStable (§4.2: drift ≤ 0.01 on π_low, π_high) ∧ TransportOK (§7: groupwise residuals ≈ 0) ∧ AntiGameOK (§6: uplift ≤ 0.05 per attack) ∧ κ(δ) ≤ τ (complexity: ≤ 150 tokens or ≤ 2 fields) ∧ OUA_OK (if OUA ≥ 0.30, require |ΔMSE| ≥ threshold)

Tie‑break: Among acceptable patches, prefer the one with the most negative RiskImprove(δ) (risk‑weighted ΔMSE over risk slices), then smallest κ(δ), then smallest KS shift. Reject if all gates pass but improvement is tiny and OUA share is high (≥ 0.30) — collect more labels instead.

5. Selective‑Inference‑Valid Patch Selection (Nested Holds)

Data splitting. For iteration $t$ , randomly partition oracle indices into disjoint subsets:

fit: train calibrators $f_\theta, f_{\theta'}$ (K‑fold cross‑fit internal to fit).
dev: synthesize candidates and select $\delta$ (search uses only dev residuals/cards).
confirm (time‑separated): never used for synthesis/selection; only for acceptance.

Null of no improvement. For a fixed $\delta$ , define on any set $A \subset I_{\text{oracle}}$ ,

\Delta \mathcal{R}_A(\delta) = \frac{1}{|A|} \sum_{i \in A} \big\{(Y_i - R_{\theta',i})^2 - (Y_i - R_{\theta,i})^2\big\}

Acceptance requires $\Delta \mathcal{R}_{\text{dev}} < 0$ and $\Delta \mathcal{R}_{\text{confirm}} < 0$ .

Proposition 5.1 (Valid confirm‑set test under search)

Condition on the dev set and the selected $\widehat{\delta}$ . If the confirm set is independent and not used during selection, and $f_\theta, f_{\theta'}$ are fitted without using confirm, then a one‑sided test of $H_0: \mathbb{E}[\Delta \mathcal{R}_{\text{confirm}}(\widehat{\delta})] \ge 0$ using $\Delta \mathcal{R}_{\text{confirm}}$ enjoys valid type‑I error control at level $\alpha$ (asymptotically normal via CLT or via paired bootstrap), regardless of the (arbitrary) search on dev.

Proof sketch. Sample splitting removes selection‑test dependence; $\widehat{\delta}$ is measurable w.r.t. dev σ-field; confirm statistic remains an unbiased (or asymptotically normal) estimator for its expectation. □

In practice: We ensure independence by time‑separating confirm (e.g., collected ≥ 48–72 hours after dev) and never touching confirm for search or tuning.

Corollary 5.2 (Family‑wise error per iteration)

If at most one patch per family is accepted and a Bonferroni correction is applied across the $K$ families on the confirm set, the per‑iteration FWER is $\le \alpha$ .

Patch budgets. To control cumulative error across iterations, cap accepted patches per quarter and limit confirm peeks (pre‑register).

6. Anti‑Gaming as Robustness (Active Adversary, Worst‑Case Uplift)

Obligation violations. Let $\mathcal{V}(\theta) \subset \mathcal{X} \times \mathcal{A}$ denote pairs violating rubric obligations (e.g., unverifiable citations). An adversary chooses edit operators $\omega \in \Omega$ producing perturbed pairs $(X^\omega, A^\omega)$ .

Adversarial uplift. Define the worst‑case calibrated uplift under violations:

\Delta_{\text{adv}}(\theta) = \sup_{\omega \in \Omega:\ (X^\omega,A^\omega) \in \mathcal{V}(\theta)} \big\{ \mathbb{E}[R_\theta(X^\omega,A^\omega)] - \mathbb{E}[R_\theta(X,A)] \big\}

6.1 Validation Battery (held-out from calibration)

Adversarial Test Suite

Test 1: Length padding

Take correct answers, append +30–200% boilerplate text.
Judge should: penalize or stay neutral (not reward).

Test 2: Style mimicry

Prepend rubric phrases ("I verify sources", "reasoning step 1").
Judge should: ignore style markers, score only content.

Test 3: Confident-but-wrong

Flip factual claims but keep authoritative tone.
Judge should: detect via evidence checks.

Test 4: Citation hallucination

Include fake URLs or paper titles.
Judge should: flag or cap score.

Test 5: Style stripping

Normalize tone, remove formatting.
Judge should: score should be stable (±0.02).

Test 6: Tool-trace fabrication

Add fake tool invocation traces.
Judge should: detect inconsistency with actual outputs.

Expected outcomes

Uncalibrated $S_\theta$ : increases by 0.10–0.30 under attacks
Calibrated $R_\theta$ : shift ≤ 0.05 if rubric has proper guards
Failures surface as residual structure → trigger prompt updates

Acceptance condition. Require $\Delta_{\text{adv}}(\theta') \le \tau_{\text{adv}}$ (default $\le 0.05$ ). In practice, $\sup$ is approximated by CLOVER‑A (Appendix C): an evolutionary search over operators with selection on $R_\theta$ and violation checks. Newly discovered exploits are added to the regression battery.

7. Transport Diagnostics & Regime Selection

For a group variable $G$ (policy/time/domain), test transport via residual means:

H_0:\ \mathbb{E}[\,Y - R_\theta \mid G = g\,] = 0 \quad \forall g \in \mathcal{G}

BH within each $G$ family controls FDR (or Bonferroni for FWER). If any $g$ fails:

Local surrogacy (Regime 2)

Fit environment‑specific $f_{\theta,g}$ and evaluate within $g$ .

No surrogacy (Regime 1)

Use $Y$ directly with DR on labeled rows.

Global patch attempt

If failures align with clear missed evidence, propose a global patch and re‑test.

8. OUA Variance & Sample‑Size Planning

Variance decomposition (DR + cross‑fitting).

\mathrm{Var}[\widehat{V}(\pi)] \approx \frac{\sigma^2_{\text{eval}}}{n} + \frac{\sigma^2_{\text{oracle}}}{n_Y}

with $\sigma^2_{\text{oracle}}$ from calibrator learning uncertainty. Estimate $\sigma^2_{\text{oracle}}$ by delete‑one‑fold jackknife over oracle folds; add to influence‑function variance for the main DR term.

OUA share. $\mathrm{OUA} = \sigma^2_{\text{oracle}} / (\sigma^2_{\text{eval}} + \sigma^2_{\text{oracle}})$ .

Budget rule

High OUA ( $\ge 0.3$ ) → acquire more oracle labels; low OUA ( $\le 0.1$ ) → gather more cheap scores to reduce evaluation variance.

Sizing (rule‑of‑thumb). For target SE $\le 0.025$ , typical $\sigma^2_{\text{eval}} \in [0.05, 0.1]$ implies $n \approx 2{,}000$ – $3{,}000$ , $n_Y \approx 300$ – $500$ .

Worked example

Setup: OUA share ≈ 0.2, so $\sigma^2_{\text{oracle}} \approx 0.25 \cdot \sigma^2_{\text{eval}}$ .

Goal: SE $\le 0.025$ (CI width ≈ 0.05).

Derivation:

\mathrm{SE} = \sqrt{\frac{\sigma^2_{\text{eval}}}{n} + \frac{\sigma^2_{\text{oracle}}}{n_Y}} \le 0.025

With OUA share = 0.2:

\sigma^2_{\text{oracle}} = \frac{0.2}{0.8} \sigma^2_{\text{eval}} = 0.25 \sigma^2_{\text{eval}}

If we set $n_Y = 0.2 n$ (20% oracle coverage), then:

\mathrm{Var} = \frac{\sigma^2_{\text{eval}}}{n} + \frac{0.25 \sigma^2_{\text{eval}}}{0.2 n} = \frac{\sigma^2_{\text{eval}}}{n}(1 + 1.25) = \frac{2.25 \sigma^2_{\text{eval}}}{n}

For $\sigma^2_{\text{eval}} = 0.08$ (typical):

n \ge \frac{2.25 \times 0.08}{(0.025)^2} = \frac{0.18}{0.000625} \approx 2{,}880

Thus: $n \approx 3{,}000$ total, $n_Y \approx 600$ oracle labels achieves target SE.

Note: If OUA share is higher (e.g., 0.3), allocate more budget to oracle labels; if lower (e.g., 0.1), gather more cheap scores instead.

9. Engineering Contracts (Score‑Once, Versioning, Determinism)

Deterministic judging. Fix decoding (e.g., temperature 0).
Score‑once. Cache $S_\theta$ ; mark DSL fields that require re‑scoring vs can be recomputed downstream.
Versioning. Persist {judge_model_id, prompt_hash, rubric_version, calibrator_version, SDP_version, anchors}.
Change control. Two‑key approval for abstention/safety edits; automatic rollback if guardrails fail post‑deployment.

Judge versioning (always log)

judge_model:        gpt-4.5-mini
judge_prompt_hash:  sha256:ab12cd34ef56...
rubric_version:     3.2
calibrator_version: isotonic-v5
SDP_version:        v1.0_2025-11-11
anchors:            [pi_low=baseline-gpt4, pi_high=expert-panel]

Any change to model family or hard rules triggers a small oracle re‑calibration before deployment.

Reporting template (minimum bundle per iteration)

Target & anchors: Y vs Y*, SDP version, $(\pi_{\text{low}}, \pi_{\text{high}})$ specs; anchor stability check
Policy values: $\hat{V}(\pi)$ with OUA‑augmented 95% CIs (Direct/IPS/DR where applicable)
Calibration metrics: RMSE/ECE (before/after); calibration curve
Residuals: Slice table (means with CIs) before/after; hierarchical summary if used
Transport tests: Per environment pass/fail ( $p$ -values)
Anti‑gaming: Uplifts under each test; new exploits discovered
OUA share: Overall and by decile
Negative segments: Top $(u,t)$ cells by weighted loss and minimal friction to flip
Complexity: DSL size delta; dead‑rule pruning status
Diffs & versions: Patch DSL diffs; version triplet

10. Limitations & Scope Conditions

Construct drift

If the welfare construct $Y$ changes, recalibration cannot fix it; re‑anchor and re‑spec SDP.

Selective‑inference over time

Repeated iterations consume the confirm budget; time‑blocked confirms and pre‑registered patch budgets mitigate.

Non‑regular estimands

For extreme quantiles/worst‑case metrics, use EVT‑aware inference.

Reward hacking risk

Training models on $R_\theta$ can still induce exploitation; when used for training, optimize a lower confidence bound and validate on fresh $Y$ via A/B.

10.1 Explicit De‑Scoping (What We're NOT Doing)

To keep CLOVER v1.1 lightweight and implementable, we explicitly de‑scope the following:

No elaborate prompt DSL. A tiny YAML with obligations, guards, and abstention triggers is sufficient. We avoid building a custom domain‑specific language with complex parsing and code generation.
No fancy weight‑stabilization schemes. Prefer Direct/DR estimators unless ESS is healthy (>10% of n). Avoid uncontrolled IPS unless overlap is strong; standard stabilization (clip weights, regularize nuisance models) is acceptable but not mandatory.
No massive adversarial frameworks. A simple mutation search over 6–8 attack operators (padding, mimicry, fact flips, fake cites, fabricated traces) with 100–200 tries per patch suffices to keep us honest. Evolutionary/GAN‑based adversaries are out‑of‑scope.
No uncontrolled patch search. Limit to ≤ 1 candidate per family with a confirm split. Multi‑armed bandit or Bayesian optimization over patch space is unnecessary given the small patch budget (≤ 2 per iteration).
No dynamic slice adaptation. Pre‑register the slice family at iteration start and freeze the risk/green/neutral classification for that iteration. Adaptive slicing mid‑search introduces selection bias.
No automatic SDP re‑specification. If the welfare construct $Y$ changes (e.g., shift from response quality to safety), CLOVER cannot auto‑detect or fix it—this requires human re‑anchoring and a new SDP version.

Philosophy: CLOVER v1.1 prioritizes statistical rigor and interpretability over automation. A small team with notebooks + YAML + simple calibrators can run the full loop. Extensions (dynamic slices, learned patch synthesis, active learning for oracle sampling) are future work.

11. SDP-Govynth: Automated Patch Synthesis

The CLOVER loop has four steps: Audit (calculate residuals), Diagnose (generate Residual Cards), Synthesize Patch (δ), and Validate (test against the Acceptance Predicate). Step 3, Synthesis, is currently the primary bottleneck—it requires human analysts to interpret Residual Cards and manually draft precise YAML patches. This process is slow, requires expertise, and does not systematically explore the patch space.

SDP-Govynth introduces automated, governed patch synthesis by integrating an Optimizer LLM (potentially orchestrated by a framework like DSPy) to generate candidate patches from Residual Cards. Crucially, this automation lives strictly inside the CLOVER governance framework—the existing statistical guardrails (selective-inference validity, acceptance predicate, complexity constraints) ensure that automated improvements are real, interpretable, and safe.

11.1. The Bottleneck: Manual Patch Synthesis

In the base CLOVER loop, human experts must:

Review Residual Cards (e.g., "Judge over-scores long responses with fake citations on medical tasks").
Reason about which rubric component (obligation, guard, or abstention trigger) should change.
Draft a structured YAML patch $\delta$ adhering to the Rubric Schema (Appendix B).
Iterate if the patch fails the Acceptance Predicate on $\mathcal{D}_{\text{dev}}$ .

This manual process limits iteration speed and systematic exploration. An automated system can search the patch space more efficiently while maintaining rigor through CLOVER's existing validation infrastructure.

11.2. The Opportunity: LLM-Driven Optimization

An Optimizer LLM can automate patch generation by reasoning over Residual Cards to propose candidate patches. Frameworks like DSPy (Khattab et al., 2024) excel at optimizing prompts against defined metrics—in this context, the task is to optimize the patch generation process against the CLOVER objective function.

Benefits

Scalability and speed: Automation enables rapid iteration and frequent judge improvements, reducing the time from failure detection to deployment.
Systematic search: An automated system can systematically explore the patch space (e.g., testing multiple obligation phrasings or guard thresholds) to find optimal improvements that humans might miss.
Reduced manual effort: Frees human experts to focus on higher-level tasks: defining the idealized target ( $Y^*$ , Layer 1), validating the Standard Deliberation Protocol (SDP, Layer 0), and conducting adversarial testing.

11.3. The Risks and CLOVER's Built-In Mitigation

The primary risks of automated prompt optimization are overfitting (exploiting noise in the development set) and loss of interpretability (opaque, brittle patches). Naive optimization can produce prompts that perform well on dev but fail in production.

Crucially, CLOVER was explicitly designed with the statistical guardrails necessary to make automated optimization safe and rigorous.

Risk 1: Overfitting to Dev Set

Threat: The optimizer aggressively finds patches that exploit noise in $\mathcal{D}_{\text{dev}}$ , leading to false improvements that don't generalize.

CLOVER Mitigation: CLOVER enforces Selective-Inference-Valid Patch Selection (§5) using nested holdouts (Fit, Dev, Confirm). The optimizer only has access to $\mathcal{D}_{\text{dev}}$ . A patch is accepted only if the improvement replicates on the time-separated $\mathcal{D}_{\text{confirm}}$ set. This provides valid Type-I error control regardless of the complexity of the search on Dev. The confirm split acts as an honest arbiter that has never been seen during optimization.

Risk 2: Loss of Interpretability and Stability

Threat: The optimizer produces opaque, large-scale prompt rewrites that are hard to understand, audit, or maintain.

CLOVER Mitigation: The optimization must be constrained:

Structured output schema: The Optimizer LLM must output YAML patches adhering to the Rubric Schema (Appendix B), not arbitrary text edits. This ensures patches are interpretable and versioned.
Complexity penalty $\kappa(\delta)$ : CLOVER's complexity budget (e.g., max tokens added, max obligations changed) enforces interpretability and favors small, local perturbations over wholesale rewrites.
Obligation-first bias: The optimizer should prioritize adding missing welfare dimensions (new obligations) over tightening existing criteria (guard threshold changes), maintaining semantic clarity.
Hard stability constraints: The Acceptance Predicate includes blast-radius caps (KS distance, Anchor Stability §4.2) preventing large-scale rescalings, and Green-Slice Non-Inferiority ensuring patches don't degrade performance on well-calibrated, high-stakes slices.

Risk 3: Data Leakage and Memorization

Threat: The optimizer gains access to raw oracle labels $Y$ on $\mathcal{D}_{\text{dev}}$ , enabling it to memorize specific examples rather than learn generalizable patterns.

CLOVER Mitigation: The optimizer must only access aggregated Residual Cards, not raw $(X_i, A_i, Y_i)$ tuples (§3.1). Residual Cards provide summary statistics (mean residuals, slice definitions, cardinal failure modes) without exposing individual labels, preventing overfitting to specific examples.

11.4. Proposed Architecture: The Two-Loop System

SDP-Govynth integrates automated optimization as an inner loop nested within the existing CLOVER governance framework (the outer loop):

Outer Loop (CLOVER): Statistical Arbiter

The outer loop manages data splits, calibration, and the final Acceptance Predicate. It has exclusive access to $\mathcal{D}_{\text{confirm}}$ and enforces selective-inference validity.

Responsibilities:

Partition data into $\mathcal{D}_{\text{fit}}$ , $\mathcal{D}_{\text{dev}}$ , $\mathcal{D}_{\text{confirm}}$
Calibrate baseline judge $\theta_0$ on $\mathcal{D}_{\text{fit}}$
Generate Residual Cards from $\mathcal{D}_{\text{dev}}$ residuals
Invoke inner loop (SDP-Govynth) to generate candidate patch $\delta^*$
Test $\delta^*$ against full Acceptance Predicate on $\mathcal{D}_{\text{confirm}}$
Deploy $\theta' = \theta_0 \oplus \delta^*$ if accepted, reject otherwise

Inner Loop (SDP-Govynth): Optimizer

The inner loop uses an Optimizer LLM (potentially orchestrated by DSPy) to generate patches optimized against the dev-set objective. It has no access to $\mathcal{D}_{\text{confirm}}$ .

Task signature:

Residual_Card → Patch_Delta

Optimization objective (on $\mathcal{D}_{\text{dev}}$ only):

\begin{aligned} \max_{\delta} \quad & -\text{RiskImprove}(\delta) \\ \text{s.t.} \quad & \text{Accept}_{\text{dev}}(\delta) = \text{True} \end{aligned}

Where $\text{RiskImprove}(\delta)$ is the risk-weighted MSE improvement (§4.1) and $\text{Accept}_{\text{dev}}$ enforces all guardrails on dev:

Complexity cap: $\kappa(\delta) \leq \kappa_{\max}$
Green-slice non-inferiority: $\Delta \text{MSE}_{\text{green}} \geq -\epsilon$
Blast-radius cap: $\text{KS}(R_{\theta_0}, R_{\theta'}) \leq \tau_{\text{KS}}$
Anchor stability: $|\Delta V(\pi_{\text{low}})|, |\Delta V(\pi_{\text{high}})| \leq \epsilon_{\text{anchor}}$
Anti-gaming: Pass adversarial robustness tests (§6)

Output: The best patch $\delta^*$ found on dev, subject to all constraints.

The Critical Guarantee

The inner loop can search arbitrarily aggressively over $\mathcal{D}_{\text{dev}}$ —trying thousands of candidate patches, using reinforcement learning, or leveraging multi-armed bandits—without compromising statistical validity. The outer loop's confirm-set validation provides an honest, selection-agnostic test that controls Type-I error regardless of the complexity of the dev-set search. This is the essence of selective inference: search freely, validate honestly.

11.5. Implementation Sketch: DSPy Integration

DSPy (Khattab et al., 2024) provides a natural framework for implementing SDP-Govynth:

# Pseudocode: SDP-Govynth with DSPy

class PatchGenerator(dspy.Signature):
    """Generate a YAML patch to improve judge calibration."""
    residual_card = dspy.InputField(desc="Structured summary of systematic judge failures")
    current_rubric = dspy.InputField(desc="Current judge rubric (YAML)")
    patch_delta = dspy.OutputField(desc="Proposed YAML patch (must adhere to schema)")

class CLOVERSynth(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate_patch = dspy.ChainOfThought(PatchGenerator)

    def forward(self, residual_card, current_rubric):
        # Generate candidate patch
        patch = self.generate_patch(
            residual_card=residual_card,
            current_rubric=current_rubric
        ).patch_delta

        # Parse and validate schema
        patch_obj = parse_yaml_patch(patch)
        assert validate_patch_schema(patch_obj), "Patch violates schema"

        return patch_obj

# Optimization loop (inner loop, on D_dev only)
optimizer = dspy.BootstrapFewShot(metric=clover_dev_objective)
optimized_synth = optimizer.compile(
    CLOVERSynth(),
    trainset=dev_residual_cards
)

# Generate best patch on dev
best_patch = optimized_synth(
    residual_card=current_residual_card,
    current_rubric=theta_0
)

# Outer loop: Test on confirm (CLOVER's honest arbiter)
if accept_on_confirm(best_patch, D_confirm):
    deploy(theta_0 + best_patch)
else:
    reject(best_patch)

The clover_dev_objective metric evaluates patches on $\mathcal{D}_{\text{dev}}$ against the full set of dev-set constraints (complexity, green-slice non-inferiority, blast-radius, etc.). DSPy's optimizer searches the space of patch generators to maximize this objective, but the final deployment decision rests with the outer loop's confirm-set test.

11.6. When to Use SDP-Govynth

Scenario	Manual CLOVER	SDP-Govynth
Initial rubric design Defining obligations from scratch	✓ Preferred (human semantic design)	✗ Not recommended
Iterative refinement 5-10 patch cycles, systematic failures	○ Workable but slow	✓ Ideal use case
High-stakes, novel domains Medical, legal, safety-critical	✓ Preferred (human oversight critical)	○ Use with expert review of all patches
Rapid deployment cycles Weekly judge updates, mature rubrics	✗ Bottleneck	✓ Enables fast iteration

11.7. Summary: Why CLOVER Enables Safe Automation

SDP-Govynth is a natural and safe extension because CLOVER was designed from the start to support automated optimization:

Selective-inference validity: The confirm split provides honest Type-I error control regardless of dev-set search complexity.
Input/output constraints: Structured Residual Cards (input) and YAML schema (output) ensure interpretability and prevent data leakage.
Multi-dimensional acceptance predicate: Complexity penalties, green-slice non-inferiority, blast-radius caps, and anti-gaming tests prevent the optimizer from finding shallow improvements.
Explicit versioning and rollback: Every patch is logged with its Residual Card, dev/confirm metrics, and timestamp, enabling audits and rollbacks if production performance degrades.

The Key Insight

Automated prompt optimization is risky when done naively (overfitting, gaming, brittleness). But when nested inside a governed statistical framework with honest holdouts, hard constraints, and explicit versioning, it becomes a powerful tool for scaling systematic improvement. CLOVER provides that framework. SDP-Govynth is what happens when you take the optimization seriously and the statistics seriously.

Appendix A. Metrics & Test Statistics

MSE & RMSE: $\mathrm{MSE} = \mathbb{E}[(Y - R_\theta)^2]$ , $\mathrm{RMSE} = \sqrt{\mathrm{MSE}}$ . For acceptance gates, use ΔMSE (additive, stable under aggregation) rather than ΔRMSE.
ECE (Expected Calibration Error): Partition predictions $R_\theta$ into $B = 10$ equal‑frequency bins. For bin $b$ , compute mean prediction $\bar{R}_b = \frac{1}{n_b} \sum_{i \in b} R_{\theta,i}$ and mean outcome $\bar{Y}_b = \frac{1}{n_b} \sum_{i \in b} Y_i$ . Then:
$\mathrm{ECE} = \frac{1}{B} \sum_{b=1}^B \big|\bar{Y}_b - \bar{R}_b\big|$
Use the same bin boundaries (determined on the baseline) when computing ΔECE for patches. Alternatives: adaptive isotonic ECE or calibration slope.
Residual flatness: For pre‑registered slices $g$ , report $\bar{\varepsilon}_g$ with CIs and BH‑adjusted $p$ -values; accept only if all CIs overlap 0 post‑patch.
Confirm test: One‑sided $z$ or paired bootstrap on $\Delta \mathrm{MSE}_{\text{confirm}}$ .
Transport: Per‑group residual tests; Prentice‑style sufficiency checks by regressing $Y$ on $(X, A, S_\theta, G)$ and testing $G$ terms.

Appendix B. Minimal Rubric Obligation Schema & Patch Delta

Rubric (minimal)

rubric:
  version: 1.0
  target: "Y"           # or "Y_star"
  sdp_version: "v1.0_2025-11-11"
  obligations:
    factual_adequacy:
      required: true
      verification:
        allowed_domains: ["nih.gov","who.int","cochrane.org","nejm.org","thelancet.com","nature.com"]
        mismatch_penalty: 0.35
    reasoning_quality:
      requires_counter_position: true
    risk_accounting:
      requires_stakeholders: true
    usefulness:
      be_concise_by_default: true
  guards:
    length_bias_cap:
      if_tokens_gt: 500
      max_raw_score: 0.5
    tool_trace_consistency: true
    confident_but_wrong_max: 0.4
  abstention:
    triggers: ["medical dosing","legal advice"]
    rule: "if critical info missing -> abstain or escalate"

Patch delta (example)

patch:
  family: "evidence_verification"
  id: "len-cite-guard-001"
  rationale: "Long medical answers with unverifiable cites are over-scored."
  changes:
    guards.length_bias_cap:
      if_tokens_gt: 500
      max_raw_score: 0.5
    obligations.factual_adequacy.verification:
      allowed_domains+: ["bmj.com"]
      mismatch_penalty: 0.35
  constraints:
    max_tokens_added: 150
    affects_rescore: true
  expected_effects:
    slices:
      - domain: "medical"
        length_bin: "600-900"
        residual_mean_delta: -0.06
    anti_gaming:
      length_padding_uplift: "<=0.02"

Appendix C. Algorithms

C.1 CLOVER‑J — Judge Closed Loop

Inputs. θ₀, logs D, oracle I_oracle; K=5, BH‑q=0.10, patience=2; confirm set time‑separated.

1. Score: S_θ ← J_θ(X,A) 2. Calibrate: Cross‑fit f_θ (two‑stage→isotonic) → R_θ 3. Residuals: ε on oracle folds; build Residual Cards for significant slices 4. Synthesize patches (LLM, constrained): ≤1 per family 5. Evaluate on dev: ΔRMSE/ECE; residual flatness; transport; anti‑gaming; OUA; complexity 6. Confirm: Recompute ΔRMSE on confirm with calibrators fit on fit only; require ΔRMSE < 0 7. Select: Accept if all gates pass; else increment patience 8. Stop: If patience ≥ 2 or no significant residual structure remains 9. Version & report

C.2 CLOVER‑A — Active Adversary

Operator set Ω: padding, rubric mimicry, fact flips, fake citations, style stripping, fabricated tool traces, contradiction injections.
Search: evolutionary loop with selection on $R_\theta$ and constraint $(X^\omega, A^\omega) \in \mathcal{V}(\theta)$ .
Outputs: adversarial exemplars and estimated $\Delta_{\text{adv}}$ .

C.3 CLOVER‑G — Cautious Generator Optimization

Freeze $J_\theta, f_\theta$ .
Optimize generator prompts/procedures for $k$ steps on $\mathrm{LCB}_\alpha(\mathbb{E}[R_\theta]) - \gamma \cdot \mathrm{cost}$ .
Validate on fresh $Y$ / $Y^*$ via small A/B; re‑run transport and anti‑gaming; rollback on regressions.

C.4 Skeleton Pseudocode (Drop‑In for v1.1)

Below is a complete, balanced CLOVER iteration loop incorporating all v1.1 improvements (risk/green classification, balanced objectives, distribution shift caps, anchor stability, partial pooling).

def clover_iteration(theta, logs, oracle, anchors, slices, params):
    """
    Single iteration of CLOVER v1.1 with balanced residual improvement.

    Args:
        theta: current rubric (YAML/dict)
        logs: evaluation set (X, A pairs)
        oracle: oracle subset with Y labels (dev + confirm splits)
        anchors: reference policies (pi_low, pi_high) for stability checks
        slices: pre-registered slice family (domain × difficulty × length)
        params: hyperparameters (η, τ_g, KS_cap, etc.)

    Returns:
        accepted_patches: list of (δ, Improve, κ) tuples, or "Pause for labels"
    """

    # 1) Score once (deterministic, cached)
    S = score_deterministic(theta, logs)  # cache raw scores

    # 2) Calibrate (K-fold cross-fit, two-stage → isotonic)
    R, oua_share = crossfit_isotonic(S, oracle, k=params.K)

    # OUA gating: if OUA high and no clear gain, pause for labels
    if oua_share >= params.oua_threshold and not promising_gain_estimate():
        return "Pause for labels"

    # 3) Residuals & slice classification (partial pooling DEFAULT)
    resid = oracle.Y - R.oof
    resid_shrunk = partial_pool(resid, slices)  # hierarchical model or James-Stein

    # Classify slices (freeze for this iteration):
    #   Risk (R): CI excludes 0 AND |ε̄_g| ≥ 0.03
    #   Green (G): high exposure/stakes with good calibration
    #   Neutral (N): everything else
    risk, green, neutral = classify_slices(resid_shrunk, slices, params)

    # 4) Propose tiny obligation-first patches (≤1 per family)
    families = ["verify", "contradiction", "length", "abstain"]
    candidates = propose_patches(theta, families, complexity_cap=params.tau)

    accepted = []
    for δ in candidates:
        θp = apply_patch(theta, δ)
        Sp = maybe_rescore(θp, logs, cache=S)  # only if patch requires new evidence
        Rp, _ = crossfit_isotonic(Sp, oracle, k=params.K)

        # --- Hard gates (all must pass on DEV) ---

        # Global calibration non-worse
        if delta_mse_global(R.dev, Rp.dev, oracle.Y.dev) > 0:
            continue
        if delta_ece(R.dev, Rp.dev, oracle.Y.dev) > 0:
            continue

        # Distribution shift cap (KS ≤ 0.05 or Wasserstein ≤ 0.02)
        if ks_distance(Rp.all, R.all) > params.ks_cap:
            continue

        # Anchor stability (drift ≤ 0.01)
        if not anchors_stable(R, Rp, anchors, eps=params.anchor_eps):
            continue

        # Transport OK (groupwise residual means ≈ 0 in target environments)
        if not transport_ok(Rp, oracle, environments=params.envs):
            continue

        # Anti-gaming (uplift ≤ 0.05 per attack)
        if not anti_gaming_ok(θp, params.attack_suite, uplift_cap=0.05):
            continue

        # Green non-inferiority (one-sided test: Δ MSE_g < τ_g for all g ∈ G)
        if not non_inferiority_green(R.dev, Rp.dev, green, params.tau_g):
            continue

        # Risk-weighted improvement (Improve ≤ -η)
        improve_dev = weighted_improve(R.dev, Rp.dev, risk, params.w, params.u)
        if improve_dev > -params.eta:
            continue

        # --- CONFIRM replication (all checks on time-separated holdout) ---

        if not non_inferiority_green(R.conf, Rp.conf, green, params.tau_g):
            continue

        improve_conf = weighted_improve(R.conf, Rp.conf, risk, params.w, params.u)
        if improve_conf > -params.eta:
            continue

        # All gates passed!
        accepted.append((δ, improve_conf, complexity(δ)))

    # 5) Select top patches (≤ 2)
    # Tie-break: smallest complexity κ(δ), smallest KS shift
    accepted = sorted(accepted, key=lambda x: (x[1], x[2]))[:params.max_patches]

    return accepted


# --- Helper functions (minimal pseudocode) ---

def partial_pool(resid, slices):
    """Fit hierarchical model: ε ~ α_G + u_g + covariates + η.
    Returns shrunken residual estimates ε̃_g."""
    # Use lme4, brms, or James-Stein shrinkage
    pass

def classify_slices(resid, slices, params):
    """Returns (risk, green, neutral) slice sets."""
    risk = {g for g in slices if ci_excludes_zero(resid[g]) and abs(mean(resid[g])) >= 0.03}
    green = {g for g in slices if (exposure(g) >= 0.05 or stakes(g) >= 0.7)
             and abs(mean(resid[g])) < 0.03}
    neutral = set(slices) - risk - green
    return risk, green, neutral

def weighted_improve(R, Rp, risk, w, u):
    """Σ_{g∈R} w_g u_g · (MSE_g(Rp) - MSE_g(R))."""
    return sum(w[g] * u[g] * (mse(Rp[g]) - mse(R[g])) for g in risk)

def non_inferiority_green(R, Rp, green, tau_g):
    """One-sided test: Δ MSE_g < τ_g for all g ∈ G (BH-corrected)."""
    pvals = [one_sided_test(mse(Rp[g]) - mse(R[g]), tau_g[g]) for g in green]
    return all_pass_bh(pvals, q=0.10)

def anchors_stable(R, Rp, anchors, eps=0.01):
    """Check |E[Rp | π_low] - E[R | π_low]| ≤ eps (same for π_high)."""
    drift_low  = abs(mean(Rp[anchors.pi_low])  - mean(R[anchors.pi_low]))
    drift_high = abs(mean(Rp[anchors.pi_high]) - mean(R[anchors.pi_high]))
    return drift_low <= eps and drift_high <= eps

Usage: This skeleton is a direct drop‑in for v1.1. Replace partial_pool, crossfit_isotonic, and anti_gaming_ok with your calibrator, pooling estimator, and adversarial test suite. The logic preserves all v1.1 guarantees: risk improvement subject to green non‑inferiority, distribution shift caps, anchor stability, and confirm replication.

Appendix D. Default Parameters (v1.1)

Parameter	Symbol / Name	Default Value (v1.1)
Cross‑fit folds	K	5
Slice count	Pre‑registered slices	≤ 12 (domain × difficulty × length)
Risk threshold	\|ε̄_g\|	≥ 0.03 (material miscalibration)
Green tolerance (MSE scale)	τ_g	1–2 × 10⁻³ (0 for must‑not‑regress)
Required improvement (MSE scale)	η	0.0005–0.0025 (≈ 0.02–0.05 RMSE on [0,1])
KS shift cap	D_KS	≤ 0.05
Wasserstein shift cap	W	≤ 0.02
Anchor drift cap	Δ E[R\|π]	≤ 0.01 (for π_low, π_high)
Anti‑gaming uplift	Per attack	≤ 0.05 (calibrated scale)
Patch size	κ(δ)	≤ 150 tokens or ≤ 2 new fields
Patch budget	Per iteration	≤ 2 accepted patches
Confirm peek budget	Per quarter	≤ 3 peeks
OUA gating threshold	OUA_share	≥ 0.30 → pause for labels
FDR control	q	0.10 (BH correction within slice families)
Partial pooling	Default estimator	Hierarchical model (or James–Stein if unavailable; min n_g ≥ 40 fallback)

Tuning guidance: These defaults are conservative and suitable for most applications. Increase $\eta$ (required improvement) if you want to filter out tiny gains; decrease $\tau_g$ (green tolerance) for critical slices you cannot afford to degrade. Tighten KS/Wasserstein caps if interpretability across versions is paramount.

Assumptions Ledger

Code	Statement	Used by	Test/Diagnostic	Mitigation
J1	Programmable evidence E_θ (more info ↓ Bayes risk)	Design	Residual ↓ after adding evidence checks	Add obligations, not weights
J2	Deterministic scoring (temp=0, fixed tools)	Score‑once	Repeat scoring → identical S_θ	Cache S_θ; version model/prompt
S1	∃f_θ: E[Y\|X,A,S_θ] = f_θ	All	Prentice‑style sufficiency on oracle	Enrich surrogates; add covariates
S2	Y ⊥ Sel \| X,A,S_θ	Cross‑env	Groupwise residual tests	Local f_θ or re‑prompt
L1	Oracle MAR	Calibration	Label process audit	Randomize oracle sampling; stratify
L2	Oracle positivity	Calibration	Coverage plots; tail check	Targeted labeling in tails
H1	Nested holdout confirm	Patch selection	Dev vs confirm delta	Reject non‑replicated patches
OPE	Overlap (ESS)	IPS/DR	ESS, max/median w	Collect draws; Direct/DR; stabilize

References

[1] Blackwell, D. (1951/1953). Comparison/Equivalence of experiments. Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability.

[2] Dudík, M., Langford, J., & Li, L. (2014). Doubly robust policy evaluation and learning. International Conference on Machine Learning (ICML).

[3] Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1–C68.

[4] Efron, B., & Tibshirani, R. J. (1994). An Introduction to the Bootstrap. Chapman and Hall/CRC.

[5] Kallus, N., & Mao, X. (2020). On the role of surrogates in the efficient estimation of treatment effects with limited outcome data. arXiv:2003.12408.

[6] Pearl, J., & Bareinboim, E. (2014). External validity: From do-calculus to transportability across populations. Statistical Science, 29(4), 579–595.

Summary (for implementers)

Treat the judge as a programmable measurement channel; improve obligations before weights.
Use cross‑fit monotone calibration to get $R_\theta$ , and nested holds to accept patches.
Require replicated calibration gains on a time‑separated confirm set, transport pass, and bounded adversarial uplift.
Report OUA‑augmented CIs, negative segments, and maintain strict versioning.
Only then consider optimizing generation against $R_\theta$ , with LCB objectives and external $Y$ checks.

We welcome your feedback

CLOVER is an active research framework. We invite constructive criticism from practitioners and researchers.

If you spot errors, have theoretical extensions, or have applied CLOVER in production and want to share lessons, please let us know or email eddie@cimolabs.com.