CLOVER: Selective‑Inference‑Valid Closed‑Loop Optimization of Programmable Judges
Technical Appendix v1.1
Abstract
We formalize CLOVER, a governed procedure for improving LLM judges treated as programmable surrogates. CLOVER (i) calibrates raw judge scores to an operational welfare target , (ii) audits residuals to detect systematic misscoring, (iii) proposes small, structured rubric patches, and (iv) accepts a patch only if it improves calibration on a time‑separated confirm holdout while passing transport and anti‑gaming constraints. We give identification results for using calibrated rewards in Direct/IPS/DR policy evaluation, derive an oracle‑uncertainty‑aware (OUA) variance decomposition, formalize the patch family selection problem with selective‑inference control via nested sample splitting, and specify an active adversarial searcher as a worst‑case uplift bound. The framework is designed for "score once, calibrate many" with explicit versioning and an assumptions ledger.
Scope: CLOVER vs. SDP-Gov
CLOVER (this appendix) governs the calibration of judges (the mapping ): improving judge rubrics to better predict operational welfare labels while maintaining calibration quality, transportability, and resistance to gaming.
SDP-Gov (Layer 0) governs the Standard Deliberation Protocol (SDP) itself (the mapping ): ensuring that operational welfare labels align with true idealized welfare via empirical validation (PTE against long-run outcomes), construct validity audits, and stability checks. See Validating the Bridge Assumption (A0) for the complete SDP-Gov framework.
0. Notation & Objects
- Contexts & actions. , . A policy maps .
- Target welfare. is the operational welfare label collected under a fixed Standard Deliberation Protocol (SDP). Optionally, denotes an idealized target; replace with where relevant.
- Judge. A rubric/prompt parameterizes a judge , with output .
- Calibrator. A function mapping , where is the calibrated reward.
- Logs & oracle. Logged data from . A subset carries welfare labels ; coverage .
- Residuals. for .
- Estimand. Policy value .
Throughout, expectations are w.r.t. the relevant data‑generating distributions; measurability and boundedness of are assumed.
1. Measurement Model & Information Ordering
Assumption J1 (Programmable channel)
A rubric specifies an evidence set and induces a σ-field such that is -measurable.
Lemma 1 (Informativeness monotonicity; Blackwell/Doob)
If then
Proof sketch. Conditional expectation is the projection; enlarging the σ-field cannot increase squared error. □
Implication. Enriching obligations (evidence to be checked) weakly reduces Bayes risk for predicting ; weight tweaks alone cannot guarantee this.
1.1 Design Philosophy: Obligation‑First, Bounded Risk
CLOVER v1.1 adopts an obligation‑first design philosophy to ensure improvements are interpretable, generalizable, and safe:
- Prefer obligations over weights. When residuals reveal failures, first ask "what evidence should we check?" not "how should we reweight?" Obligation edits (e.g., verify citations, check for contradictions) tend to generalize and are less likely to degrade well‑calibrated regions.
- Partial pooling by default. Use hierarchical/shrinkage estimation for slice residuals to borrow strength across related groups and avoid chasing noise in small cells.
- Balanced objectives. Optimize a constrained problem: improve risk slices subject to non‑inferiority on green slices (high‑exposure or high‑stakes regions with good calibration).
- Bound the blast radius. Cap distributional shift (KS, Wasserstein) and anchor drift so patches remain small, local perturbations rather than wholesale rescalings.
- Confirm on time‑separated data. Require dev gains to replicate on a confirm holdout collected after the dev set to control selective‑inference risk.
Core principle: Fix what's broken without breaking what works. Residuals are a discovery tool, not the optimization objective.
2. Surrogacy, Transport, and Identification with Calibrated Rewards
Assumption S1 (Surrogacy sufficiency on support)
On ,
Assumption S2 (Transport / S‑admissibility)
Across admissible environments (policy, time, cohort), with selection nodes ,
When S2 holds, the same transports across .
Proposition 2.1 (Direct identification)
Under S1 (and MAR/positivity for oracle learning of ), for any ,
Operational: Draw fresh contexts , sample actions , score , apply the calibrator , and take the Monte Carlo mean .
Proposition 2.2 (IPS)
With overlap whenever ,
Proposition 2.3 (DR)
Let . Then
where . Consistency obtains if either or is correct.
Proofs. Standard; replace by using S1. □
3. Calibration Risk, Residuals, and Slicing
Calibration risk. For loss , define . Cross‑fitting yields and out‑of‑fold residuals .
Default calibrator hyperparameters
- Calibrator architecture: Two‑stage if covariates available:
- Stage 1: Spline regression over → intermediate score
- Stage 2: Isotonic regression on →
- Cross‑fitting: folds. Larger reduces bias at cost of higher variance.
- Residual slicing: 10–20 groups (domain × difficulty × length bins). Use Benjamini–Hochberg to control the false discovery rate (). For FWER control, use Bonferroni or Holm.
- Stopping: Two consecutive iterations with no significant residual structure (all group means have CIs overlapping 0) and transport diagnostics pass.
Slices. Pre‑register a finite slice family where each partitions oracle indices into groups (e.g., domain × difficulty × length). For each group, test
Use BH at level within each . Fit a hierarchical model (partial pooling) to obtain shrunken residual estimates across overlapping slices. This is the default in v1.1—it avoids chasing noise in small cells by borrowing strength across related groups. If unavailable, use James–Stein shrinkage or require minimum per slice.
3.1 Residual Cards
For each significant slice (after FDR correction), produce a Residual Card containing:
- Slice: domain = "medical Q&A", length bin = 600–900, difficulty = "hard"
- Stats: , (CI ) — negative = judge over‑scores vs. oracle
- Attribution hints (computed features): verbosity +180% median; citation_valid=false in 61% cases; high "authoritative tone" markers
- Nearest‑neighbor exemplars () with largest negative residuals
- Counter‑examples () with near-zero residuals in same slice
- Anti‑gaming stress deltas for this slice
- Hypothesis checklist (auto‑scores): length bias ✓, fake citations ✓, missed risk framing ✗
Key principle: These cards are the only inputs to the Patch Synthesizer (§4), preventing it from memorizing specific labels and reducing overfitting risk.
3.2 Slice Classification: Risk, Green, Neutral
After residual analysis, classify each slice into one of three categories for the current iteration:
Risk Slices ()
CI for excludes 0 and . These represent material systematic miscalibration.
Green Slices ()
High exposure or high stakes , and calibration is good (CI overlaps 0 or ). Patches must not harm these slices.
Neutral Slices ()
All others — low exposure, low stakes, or calibration status unclear. Not explicitly optimized or protected.
Classification freeze: For each iteration, freeze the classification at the start based on the current rubric . Evaluate candidate patches against this fixed classification to prevent gaming the definition.
Weights. Define exposure weight (fraction of samples in slice ), and stakes weight (externally specified per slice, e.g., medical Q&A with dosing = 0.9, casual chitchat = 0.1).
4. Patch Space, Complexity, and Acceptance Predicate
Rubric obligations. A rubric is a typed object (Appendix B). A patch is a finite edit to producing .
Patch families. Pre‑register disjoint families (e.g., evidence verification, length‑bias cap, abstention). Each iteration proposes at most one patch per family.
Complexity. Define as a code‑length‑like measure (Δ tokens + # new guards + # abstention edits). Complexity penalties discourage overfitting.
4.1 Balanced Objective: Improve Total Error, Protect Green
CLOVER v1.1 uses global MSE reduction as the primary acceptance criterion, with green non‑inferiority as a hard constraint. Risk‑weighted improvement serves as a tie‑break among acceptable patches.
Primary criterion (hard gate): Global MSE improvement
Require with on both dev and confirm. This corresponds to roughly 0.02–0.05 RMSE improvement on the ([0,1]) scale. MSE differences are additive and stable under aggregation, unlike RMSE.
Tie‑break metric: Risk‑weighted MSE improvement
where . Among patches that pass all gates, prefer the one with the most negative . This focuses improvements on high‑stakes failures without making it a hard requirement.
Non‑inferiority on green slices (hard constraint)
For each , run a one‑sided non‑inferiority test:
with tolerance (MSE scale). Require rejection of (no material degradation) after BH correction on both dev and confirm. For must‑not‑regress slices (e.g., safety‑critical), set .
Why MSE? MSE differences are additive (you can sum across examples) and interpretable. RMSE differences can flip signs and complicate inference. Total error reduction ensures you improve the whole system, while green non‑inferiority prevents "winning red cells while losing the map."
4.2 Distribution Shift & Anchor Stability (Blast‑Radius Caps)
To ensure patches remain local perturbations rather than wholesale rescalings, impose hard constraints on score distribution and anchor drift:
- Distribution shift cap: Kolmogorov–Smirnov distance (or Wasserstein distance ). This bounds the maximum pointwise CDF difference, preventing large‑scale rescalings.
- Anchor stability: For reference policies and ,Ensures the calibrated scale remains comparable across rubric versions.
Why cap shift? Without these constraints, a patch could trivially "improve" residuals by rescaling all scores. Distribution and anchor caps keep patches interpretable and prevent score drift across versions.
Acceptance Predicate (v1.1)
On a given iteration with splits (fit, dev, confirm), a candidate patch is acceptable iff:
Tie‑break: Among acceptable patches, prefer the one with the most negative RiskImprove(δ) (risk‑weighted ΔMSE over risk slices), then smallest κ(δ), then smallest KS shift. Reject if all gates pass but improvement is tiny and OUA share is high (≥ 0.30) — collect more labels instead.
5. Selective‑Inference‑Valid Patch Selection (Nested Holds)
Data splitting. For iteration , randomly partition oracle indices into disjoint subsets:
- fit: train calibrators (K‑fold cross‑fit internal to fit).
- dev: synthesize candidates and select (search uses only dev residuals/cards).
- confirm (time‑separated): never used for synthesis/selection; only for acceptance.
Null of no improvement. For a fixed , define on any set ,
Acceptance requires and .
Proposition 5.1 (Valid confirm‑set test under search)
Condition on the dev set and the selected . If the confirm set is independent and not used during selection, and are fitted without using confirm, then a one‑sided test of using enjoys valid type‑I error control at level (asymptotically normal via CLT or via paired bootstrap), regardless of the (arbitrary) search on dev.
Proof sketch. Sample splitting removes selection‑test dependence; is measurable w.r.t. dev σ-field; confirm statistic remains an unbiased (or asymptotically normal) estimator for its expectation. □
In practice: We ensure independence by time‑separating confirm (e.g., collected ≥ 48–72 hours after dev) and never touching confirm for search or tuning.
Corollary 5.2 (Family‑wise error per iteration)
If at most one patch per family is accepted and a Bonferroni correction is applied across the families on the confirm set, the per‑iteration FWER is .
Patch budgets. To control cumulative error across iterations, cap accepted patches per quarter and limit confirm peeks (pre‑register).
6. Anti‑Gaming as Robustness (Active Adversary, Worst‑Case Uplift)
Obligation violations. Let denote pairs violating rubric obligations (e.g., unverifiable citations). An adversary chooses edit operators producing perturbed pairs .
Adversarial uplift. Define the worst‑case calibrated uplift under violations:
6.1 Validation Battery (held-out from calibration)
Adversarial Test Suite
Judge should: penalize or stay neutral (not reward).
Judge should: ignore style markers, score only content.
Judge should: detect via evidence checks.
Judge should: flag or cap score.
Judge should: score should be stable (±0.02).
Judge should: detect inconsistency with actual outputs.
Expected outcomes
- Uncalibrated : increases by 0.10–0.30 under attacks
- Calibrated : shift ≤ 0.05 if rubric has proper guards
- Failures surface as residual structure → trigger prompt updates
Acceptance condition. Require (default ). In practice, is approximated by CLOVER‑A (Appendix C): an evolutionary search over operators with selection on and violation checks. Newly discovered exploits are added to the regression battery.
7. Transport Diagnostics & Regime Selection
For a group variable (policy/time/domain), test transport via residual means:
BH within each family controls FDR (or Bonferroni for FWER). If any fails:
Local surrogacy (Regime 2)
Fit environment‑specific and evaluate within .
No surrogacy (Regime 1)
Use directly with DR on labeled rows.
Global patch attempt
If failures align with clear missed evidence, propose a global patch and re‑test.
8. OUA Variance & Sample‑Size Planning
Variance decomposition (DR + cross‑fitting).
with from calibrator learning uncertainty. Estimate by delete‑one‑fold jackknife over oracle folds; add to influence‑function variance for the main DR term.
OUA share. .
Budget rule
High OUA () → acquire more oracle labels; low OUA () → gather more cheap scores to reduce evaluation variance.
Sizing (rule‑of‑thumb). For target SE , typical implies –, –.
Worked example
Setup: OUA share ≈ 0.2, so .
Goal: SE (CI width ≈ 0.05).
Derivation:
With OUA share = 0.2:
If we set (20% oracle coverage), then:
For (typical):
Thus: total, oracle labels achieves target SE.
Note: If OUA share is higher (e.g., 0.3), allocate more budget to oracle labels; if lower (e.g., 0.1), gather more cheap scores instead.
9. Engineering Contracts (Score‑Once, Versioning, Determinism)
- Deterministic judging. Fix decoding (e.g., temperature 0).
- Score‑once. Cache ; mark DSL fields that require re‑scoring vs can be recomputed downstream.
- Versioning. Persist
{judge_model_id, prompt_hash, rubric_version, calibrator_version, SDP_version, anchors}. - Change control. Two‑key approval for abstention/safety edits; automatic rollback if guardrails fail post‑deployment.
Judge versioning (always log)
judge_model: gpt-4.5-mini judge_prompt_hash: sha256:ab12cd34ef56... rubric_version: 3.2 calibrator_version: isotonic-v5 SDP_version: v1.0_2025-11-11 anchors: [pi_low=baseline-gpt4, pi_high=expert-panel]
Any change to model family or hard rules triggers a small oracle re‑calibration before deployment.
Reporting template (minimum bundle per iteration)
- Target & anchors: Y vs Y*, SDP version, specs; anchor stability check
- Policy values: with OUA‑augmented 95% CIs (Direct/IPS/DR where applicable)
- Calibration metrics: RMSE/ECE (before/after); calibration curve
- Residuals: Slice table (means with CIs) before/after; hierarchical summary if used
- Transport tests: Per environment pass/fail (-values)
- Anti‑gaming: Uplifts under each test; new exploits discovered
- OUA share: Overall and by decile
- Negative segments: Top cells by weighted loss and minimal friction to flip
- Complexity: DSL size delta; dead‑rule pruning status
- Diffs & versions: Patch DSL diffs; version triplet
10. Limitations & Scope Conditions
Construct drift
If the welfare construct changes, recalibration cannot fix it; re‑anchor and re‑spec SDP.
Selective‑inference over time
Repeated iterations consume the confirm budget; time‑blocked confirms and pre‑registered patch budgets mitigate.
Non‑regular estimands
For extreme quantiles/worst‑case metrics, use EVT‑aware inference.
Reward hacking risk
Training models on can still induce exploitation; when used for training, optimize a lower confidence bound and validate on fresh via A/B.
10.1 Explicit De‑Scoping (What We're NOT Doing)
To keep CLOVER v1.1 lightweight and implementable, we explicitly de‑scope the following:
- No elaborate prompt DSL. A tiny YAML with obligations, guards, and abstention triggers is sufficient. We avoid building a custom domain‑specific language with complex parsing and code generation.
- No fancy weight‑stabilization schemes. Prefer Direct/DR estimators unless ESS is healthy (>10% of n). Avoid uncontrolled IPS unless overlap is strong; standard stabilization (clip weights, regularize nuisance models) is acceptable but not mandatory.
- No massive adversarial frameworks. A simple mutation search over 6–8 attack operators (padding, mimicry, fact flips, fake cites, fabricated traces) with 100–200 tries per patch suffices to keep us honest. Evolutionary/GAN‑based adversaries are out‑of‑scope.
- No uncontrolled patch search. Limit to ≤ 1 candidate per family with a confirm split. Multi‑armed bandit or Bayesian optimization over patch space is unnecessary given the small patch budget (≤ 2 per iteration).
- No dynamic slice adaptation. Pre‑register the slice family at iteration start and freeze the risk/green/neutral classification for that iteration. Adaptive slicing mid‑search introduces selection bias.
- No automatic SDP re‑specification. If the welfare construct changes (e.g., shift from response quality to safety), CLOVER cannot auto‑detect or fix it—this requires human re‑anchoring and a new SDP version.
Philosophy: CLOVER v1.1 prioritizes statistical rigor and interpretability over automation. A small team with notebooks + YAML + simple calibrators can run the full loop. Extensions (dynamic slices, learned patch synthesis, active learning for oracle sampling) are future work.
11. SDP-Govynth: Automated Patch Synthesis
The CLOVER loop has four steps: Audit (calculate residuals), Diagnose (generate Residual Cards), Synthesize Patch (δ), and Validate (test against the Acceptance Predicate). Step 3, Synthesis, is currently the primary bottleneck—it requires human analysts to interpret Residual Cards and manually draft precise YAML patches. This process is slow, requires expertise, and does not systematically explore the patch space.
SDP-Govynth introduces automated, governed patch synthesis by integrating an Optimizer LLM (potentially orchestrated by a framework like DSPy) to generate candidate patches from Residual Cards. Crucially, this automation lives strictly inside the CLOVER governance framework—the existing statistical guardrails (selective-inference validity, acceptance predicate, complexity constraints) ensure that automated improvements are real, interpretable, and safe.
11.1. The Bottleneck: Manual Patch Synthesis
In the base CLOVER loop, human experts must:
- Review Residual Cards (e.g., "Judge over-scores long responses with fake citations on medical tasks").
- Reason about which rubric component (obligation, guard, or abstention trigger) should change.
- Draft a structured YAML patch adhering to the Rubric Schema (Appendix B).
- Iterate if the patch fails the Acceptance Predicate on .
This manual process limits iteration speed and systematic exploration. An automated system can search the patch space more efficiently while maintaining rigor through CLOVER's existing validation infrastructure.
11.2. The Opportunity: LLM-Driven Optimization
An Optimizer LLM can automate patch generation by reasoning over Residual Cards to propose candidate patches. Frameworks like DSPy (Khattab et al., 2024) excel at optimizing prompts against defined metrics—in this context, the task is to optimize the patch generation process against the CLOVER objective function.
Benefits
- Scalability and speed: Automation enables rapid iteration and frequent judge improvements, reducing the time from failure detection to deployment.
- Systematic search: An automated system can systematically explore the patch space (e.g., testing multiple obligation phrasings or guard thresholds) to find optimal improvements that humans might miss.
- Reduced manual effort: Frees human experts to focus on higher-level tasks: defining the idealized target (, Layer 1), validating the Standard Deliberation Protocol (SDP, Layer 0), and conducting adversarial testing.
11.3. The Risks and CLOVER's Built-In Mitigation
The primary risks of automated prompt optimization are overfitting (exploiting noise in the development set) and loss of interpretability (opaque, brittle patches). Naive optimization can produce prompts that perform well on dev but fail in production.
Crucially, CLOVER was explicitly designed with the statistical guardrails necessary to make automated optimization safe and rigorous.
Risk 1: Overfitting to Dev Set
Threat: The optimizer aggressively finds patches that exploit noise in , leading to false improvements that don't generalize.
CLOVER Mitigation: CLOVER enforces Selective-Inference-Valid Patch Selection (§5) using nested holdouts (Fit, Dev, Confirm). The optimizer only has access to . A patch is accepted only if the improvement replicates on the time-separated set. This provides valid Type-I error control regardless of the complexity of the search on Dev. The confirm split acts as an honest arbiter that has never been seen during optimization.
Risk 2: Loss of Interpretability and Stability
Threat: The optimizer produces opaque, large-scale prompt rewrites that are hard to understand, audit, or maintain.
CLOVER Mitigation: The optimization must be constrained:
- Structured output schema: The Optimizer LLM must output YAML patches adhering to the Rubric Schema (Appendix B), not arbitrary text edits. This ensures patches are interpretable and versioned.
- Complexity penalty : CLOVER's complexity budget (e.g., max tokens added, max obligations changed) enforces interpretability and favors small, local perturbations over wholesale rewrites.
- Obligation-first bias: The optimizer should prioritize adding missing welfare dimensions (new obligations) over tightening existing criteria (guard threshold changes), maintaining semantic clarity.
- Hard stability constraints: The Acceptance Predicate includes blast-radius caps (KS distance, Anchor Stability §4.2) preventing large-scale rescalings, and Green-Slice Non-Inferiority ensuring patches don't degrade performance on well-calibrated, high-stakes slices.
Risk 3: Data Leakage and Memorization
Threat: The optimizer gains access to raw oracle labels on , enabling it to memorize specific examples rather than learn generalizable patterns.
CLOVER Mitigation: The optimizer must only access aggregated Residual Cards, not raw tuples (§3.1). Residual Cards provide summary statistics (mean residuals, slice definitions, cardinal failure modes) without exposing individual labels, preventing overfitting to specific examples.
11.4. Proposed Architecture: The Two-Loop System
SDP-Govynth integrates automated optimization as an inner loop nested within the existing CLOVER governance framework (the outer loop):
Outer Loop (CLOVER): Statistical Arbiter
The outer loop manages data splits, calibration, and the final Acceptance Predicate. It has exclusive access to and enforces selective-inference validity.
Responsibilities:
- Partition data into , ,
- Calibrate baseline judge on
- Generate Residual Cards from residuals
- Invoke inner loop (SDP-Govynth) to generate candidate patch
- Test against full Acceptance Predicate on
- Deploy if accepted, reject otherwise
Inner Loop (SDP-Govynth): Optimizer
The inner loop uses an Optimizer LLM (potentially orchestrated by DSPy) to generate patches optimized against the dev-set objective. It has no access to .
Task signature:
Residual_Card → Patch_DeltaOptimization objective (on only):
Where is the risk-weighted MSE improvement (§4.1) and enforces all guardrails on dev:
- Complexity cap:
- Green-slice non-inferiority:
- Blast-radius cap:
- Anchor stability:
- Anti-gaming: Pass adversarial robustness tests (§6)
Output: The best patch found on dev, subject to all constraints.
The Critical Guarantee
The inner loop can search arbitrarily aggressively over —trying thousands of candidate patches, using reinforcement learning, or leveraging multi-armed bandits—without compromising statistical validity. The outer loop's confirm-set validation provides an honest, selection-agnostic test that controls Type-I error regardless of the complexity of the dev-set search. This is the essence of selective inference: search freely, validate honestly.
11.5. Implementation Sketch: DSPy Integration
DSPy (Khattab et al., 2024) provides a natural framework for implementing SDP-Govynth:
# Pseudocode: SDP-Govynth with DSPy
class PatchGenerator(dspy.Signature):
"""Generate a YAML patch to improve judge calibration."""
residual_card = dspy.InputField(desc="Structured summary of systematic judge failures")
current_rubric = dspy.InputField(desc="Current judge rubric (YAML)")
patch_delta = dspy.OutputField(desc="Proposed YAML patch (must adhere to schema)")
class CLOVERSynth(dspy.Module):
def __init__(self):
super().__init__()
self.generate_patch = dspy.ChainOfThought(PatchGenerator)
def forward(self, residual_card, current_rubric):
# Generate candidate patch
patch = self.generate_patch(
residual_card=residual_card,
current_rubric=current_rubric
).patch_delta
# Parse and validate schema
patch_obj = parse_yaml_patch(patch)
assert validate_patch_schema(patch_obj), "Patch violates schema"
return patch_obj
# Optimization loop (inner loop, on D_dev only)
optimizer = dspy.BootstrapFewShot(metric=clover_dev_objective)
optimized_synth = optimizer.compile(
CLOVERSynth(),
trainset=dev_residual_cards
)
# Generate best patch on dev
best_patch = optimized_synth(
residual_card=current_residual_card,
current_rubric=theta_0
)
# Outer loop: Test on confirm (CLOVER's honest arbiter)
if accept_on_confirm(best_patch, D_confirm):
deploy(theta_0 + best_patch)
else:
reject(best_patch)
The clover_dev_objective metric evaluates patches on against the full set of dev-set constraints (complexity, green-slice non-inferiority, blast-radius, etc.). DSPy's optimizer searches the space of patch generators to maximize this objective, but the final deployment decision rests with the outer loop's confirm-set test.
11.6. When to Use SDP-Govynth
| Scenario | Manual CLOVER | SDP-Govynth |
|---|---|---|
| Initial rubric design Defining obligations from scratch | ✓ Preferred (human semantic design) | ✗ Not recommended |
| Iterative refinement 5-10 patch cycles, systematic failures | ○ Workable but slow | ✓ Ideal use case |
| High-stakes, novel domains Medical, legal, safety-critical | ✓ Preferred (human oversight critical) | ○ Use with expert review of all patches |
| Rapid deployment cycles Weekly judge updates, mature rubrics | ✗ Bottleneck | ✓ Enables fast iteration |
11.7. Summary: Why CLOVER Enables Safe Automation
SDP-Govynth is a natural and safe extension because CLOVER was designed from the start to support automated optimization:
- Selective-inference validity: The confirm split provides honest Type-I error control regardless of dev-set search complexity.
- Input/output constraints: Structured Residual Cards (input) and YAML schema (output) ensure interpretability and prevent data leakage.
- Multi-dimensional acceptance predicate: Complexity penalties, green-slice non-inferiority, blast-radius caps, and anti-gaming tests prevent the optimizer from finding shallow improvements.
- Explicit versioning and rollback: Every patch is logged with its Residual Card, dev/confirm metrics, and timestamp, enabling audits and rollbacks if production performance degrades.
The Key Insight
Automated prompt optimization is risky when done naively (overfitting, gaming, brittleness). But when nested inside a governed statistical framework with honest holdouts, hard constraints, and explicit versioning, it becomes a powerful tool for scaling systematic improvement. CLOVER provides that framework. SDP-Govynth is what happens when you take the optimization seriously and the statistics seriously.
Appendix A. Metrics & Test Statistics
- MSE & RMSE: , . For acceptance gates, use ΔMSE (additive, stable under aggregation) rather than ΔRMSE.
- ECE (Expected Calibration Error): Partition predictions into equal‑frequency bins. For bin , compute mean prediction and mean outcome . Then:Use the same bin boundaries (determined on the baseline) when computing ΔECE for patches. Alternatives: adaptive isotonic ECE or calibration slope.
- Residual flatness: For pre‑registered slices , report with CIs and BH‑adjusted -values; accept only if all CIs overlap 0 post‑patch.
- Confirm test: One‑sided or paired bootstrap on .
- Transport: Per‑group residual tests; Prentice‑style sufficiency checks by regressing on and testing terms.
Appendix B. Minimal Rubric Obligation Schema & Patch Delta
Rubric (minimal)
rubric:
version: 1.0
target: "Y" # or "Y_star"
sdp_version: "v1.0_2025-11-11"
obligations:
factual_adequacy:
required: true
verification:
allowed_domains: ["nih.gov","who.int","cochrane.org","nejm.org","thelancet.com","nature.com"]
mismatch_penalty: 0.35
reasoning_quality:
requires_counter_position: true
risk_accounting:
requires_stakeholders: true
usefulness:
be_concise_by_default: true
guards:
length_bias_cap:
if_tokens_gt: 500
max_raw_score: 0.5
tool_trace_consistency: true
confident_but_wrong_max: 0.4
abstention:
triggers: ["medical dosing","legal advice"]
rule: "if critical info missing -> abstain or escalate"Patch delta (example)
patch:
family: "evidence_verification"
id: "len-cite-guard-001"
rationale: "Long medical answers with unverifiable cites are over-scored."
changes:
guards.length_bias_cap:
if_tokens_gt: 500
max_raw_score: 0.5
obligations.factual_adequacy.verification:
allowed_domains+: ["bmj.com"]
mismatch_penalty: 0.35
constraints:
max_tokens_added: 150
affects_rescore: true
expected_effects:
slices:
- domain: "medical"
length_bin: "600-900"
residual_mean_delta: -0.06
anti_gaming:
length_padding_uplift: "<=0.02"Appendix C. Algorithms
C.1 CLOVER‑J — Judge Closed Loop
Inputs. θ₀, logs D, oracle I_oracle; K=5, BH‑q=0.10, patience=2; confirm set time‑separated.
C.2 CLOVER‑A — Active Adversary
- Operator set Ω: padding, rubric mimicry, fact flips, fake citations, style stripping, fabricated tool traces, contradiction injections.
- Search: evolutionary loop with selection on and constraint .
- Outputs: adversarial exemplars and estimated .
C.3 CLOVER‑G — Cautious Generator Optimization
- Freeze .
- Optimize generator prompts/procedures for steps on .
- Validate on fresh / via small A/B; re‑run transport and anti‑gaming; rollback on regressions.
C.4 Skeleton Pseudocode (Drop‑In for v1.1)
Below is a complete, balanced CLOVER iteration loop incorporating all v1.1 improvements (risk/green classification, balanced objectives, distribution shift caps, anchor stability, partial pooling).
def clover_iteration(theta, logs, oracle, anchors, slices, params):
"""
Single iteration of CLOVER v1.1 with balanced residual improvement.
Args:
theta: current rubric (YAML/dict)
logs: evaluation set (X, A pairs)
oracle: oracle subset with Y labels (dev + confirm splits)
anchors: reference policies (pi_low, pi_high) for stability checks
slices: pre-registered slice family (domain × difficulty × length)
params: hyperparameters (η, τ_g, KS_cap, etc.)
Returns:
accepted_patches: list of (δ, Improve, κ) tuples, or "Pause for labels"
"""
# 1) Score once (deterministic, cached)
S = score_deterministic(theta, logs) # cache raw scores
# 2) Calibrate (K-fold cross-fit, two-stage → isotonic)
R, oua_share = crossfit_isotonic(S, oracle, k=params.K)
# OUA gating: if OUA high and no clear gain, pause for labels
if oua_share >= params.oua_threshold and not promising_gain_estimate():
return "Pause for labels"
# 3) Residuals & slice classification (partial pooling DEFAULT)
resid = oracle.Y - R.oof
resid_shrunk = partial_pool(resid, slices) # hierarchical model or James-Stein
# Classify slices (freeze for this iteration):
# Risk (R): CI excludes 0 AND |ε̄_g| ≥ 0.03
# Green (G): high exposure/stakes with good calibration
# Neutral (N): everything else
risk, green, neutral = classify_slices(resid_shrunk, slices, params)
# 4) Propose tiny obligation-first patches (≤1 per family)
families = ["verify", "contradiction", "length", "abstain"]
candidates = propose_patches(theta, families, complexity_cap=params.tau)
accepted = []
for δ in candidates:
θp = apply_patch(theta, δ)
Sp = maybe_rescore(θp, logs, cache=S) # only if patch requires new evidence
Rp, _ = crossfit_isotonic(Sp, oracle, k=params.K)
# --- Hard gates (all must pass on DEV) ---
# Global calibration non-worse
if delta_mse_global(R.dev, Rp.dev, oracle.Y.dev) > 0:
continue
if delta_ece(R.dev, Rp.dev, oracle.Y.dev) > 0:
continue
# Distribution shift cap (KS ≤ 0.05 or Wasserstein ≤ 0.02)
if ks_distance(Rp.all, R.all) > params.ks_cap:
continue
# Anchor stability (drift ≤ 0.01)
if not anchors_stable(R, Rp, anchors, eps=params.anchor_eps):
continue
# Transport OK (groupwise residual means ≈ 0 in target environments)
if not transport_ok(Rp, oracle, environments=params.envs):
continue
# Anti-gaming (uplift ≤ 0.05 per attack)
if not anti_gaming_ok(θp, params.attack_suite, uplift_cap=0.05):
continue
# Green non-inferiority (one-sided test: Δ MSE_g < τ_g for all g ∈ G)
if not non_inferiority_green(R.dev, Rp.dev, green, params.tau_g):
continue
# Risk-weighted improvement (Improve ≤ -η)
improve_dev = weighted_improve(R.dev, Rp.dev, risk, params.w, params.u)
if improve_dev > -params.eta:
continue
# --- CONFIRM replication (all checks on time-separated holdout) ---
if not non_inferiority_green(R.conf, Rp.conf, green, params.tau_g):
continue
improve_conf = weighted_improve(R.conf, Rp.conf, risk, params.w, params.u)
if improve_conf > -params.eta:
continue
# All gates passed!
accepted.append((δ, improve_conf, complexity(δ)))
# 5) Select top patches (≤ 2)
# Tie-break: smallest complexity κ(δ), smallest KS shift
accepted = sorted(accepted, key=lambda x: (x[1], x[2]))[:params.max_patches]
return accepted
# --- Helper functions (minimal pseudocode) ---
def partial_pool(resid, slices):
"""Fit hierarchical model: ε ~ α_G + u_g + covariates + η.
Returns shrunken residual estimates ε̃_g."""
# Use lme4, brms, or James-Stein shrinkage
pass
def classify_slices(resid, slices, params):
"""Returns (risk, green, neutral) slice sets."""
risk = {g for g in slices if ci_excludes_zero(resid[g]) and abs(mean(resid[g])) >= 0.03}
green = {g for g in slices if (exposure(g) >= 0.05 or stakes(g) >= 0.7)
and abs(mean(resid[g])) < 0.03}
neutral = set(slices) - risk - green
return risk, green, neutral
def weighted_improve(R, Rp, risk, w, u):
"""Σ_{g∈R} w_g u_g · (MSE_g(Rp) - MSE_g(R))."""
return sum(w[g] * u[g] * (mse(Rp[g]) - mse(R[g])) for g in risk)
def non_inferiority_green(R, Rp, green, tau_g):
"""One-sided test: Δ MSE_g < τ_g for all g ∈ G (BH-corrected)."""
pvals = [one_sided_test(mse(Rp[g]) - mse(R[g]), tau_g[g]) for g in green]
return all_pass_bh(pvals, q=0.10)
def anchors_stable(R, Rp, anchors, eps=0.01):
"""Check |E[Rp | π_low] - E[R | π_low]| ≤ eps (same for π_high)."""
drift_low = abs(mean(Rp[anchors.pi_low]) - mean(R[anchors.pi_low]))
drift_high = abs(mean(Rp[anchors.pi_high]) - mean(R[anchors.pi_high]))
return drift_low <= eps and drift_high <= epsUsage: This skeleton is a direct drop‑in for v1.1. Replace partial_pool, crossfit_isotonic, and anti_gaming_ok with your calibrator, pooling estimator, and adversarial test suite. The logic preserves all v1.1 guarantees: risk improvement subject to green non‑inferiority, distribution shift caps, anchor stability, and confirm replication.
Appendix D. Default Parameters (v1.1)
| Parameter | Symbol / Name | Default Value (v1.1) |
|---|---|---|
| Cross‑fit folds | K | 5 |
| Slice count | Pre‑registered slices | ≤ 12 (domain × difficulty × length) |
| Risk threshold | |ε̄_g| | ≥ 0.03 (material miscalibration) |
| Green tolerance (MSE scale) | τ_g | 1–2 × 10⁻³ (0 for must‑not‑regress) |
| Required improvement (MSE scale) | η | 0.0005–0.0025 (≈ 0.02–0.05 RMSE on [0,1]) |
| KS shift cap | D_KS | ≤ 0.05 |
| Wasserstein shift cap | W | ≤ 0.02 |
| Anchor drift cap | Δ E[R|π] | ≤ 0.01 (for π_low, π_high) |
| Anti‑gaming uplift | Per attack | ≤ 0.05 (calibrated scale) |
| Patch size | κ(δ) | ≤ 150 tokens or ≤ 2 new fields |
| Patch budget | Per iteration | ≤ 2 accepted patches |
| Confirm peek budget | Per quarter | ≤ 3 peeks |
| OUA gating threshold | OUA_share | ≥ 0.30 → pause for labels |
| FDR control | q | 0.10 (BH correction within slice families) |
| Partial pooling | Default estimator | Hierarchical model (or James–Stein if unavailable; min n_g ≥ 40 fallback) |
Tuning guidance: These defaults are conservative and suitable for most applications. Increase (required improvement) if you want to filter out tiny gains; decrease (green tolerance) for critical slices you cannot afford to degrade. Tighten KS/Wasserstein caps if interpretability across versions is paramount.
Assumptions Ledger
| Code | Statement | Used by | Test/Diagnostic | Mitigation |
|---|---|---|---|---|
| J1 | Programmable evidence Eθ (more info ↓ Bayes risk) | Design | Residual ↓ after adding evidence checks | Add obligations, not weights |
| J2 | Deterministic scoring (temp=0, fixed tools) | Score‑once | Repeat scoring → identical Sθ | Cache Sθ; version model/prompt |
| S1 | ∃fθ: E[Y|X,A,Sθ] = fθ | All | Prentice‑style sufficiency on oracle | Enrich surrogates; add covariates |
| S2 | Y ⊥ Sel | X,A,Sθ | Cross‑env | Groupwise residual tests | Local fθ or re‑prompt |
| L1 | Oracle MAR | Calibration | Label process audit | Randomize oracle sampling; stratify |
| L2 | Oracle positivity | Calibration | Coverage plots; tail check | Targeted labeling in tails |
| H1 | Nested holdout confirm | Patch selection | Dev vs confirm delta | Reject non‑replicated patches |
| OPE | Overlap (ESS) | IPS/DR | ESS, max/median w | Collect draws; Direct/DR; stabilize |
References
Summary (for implementers)
- Treat the judge as a programmable measurement channel; improve obligations before weights.
- Use cross‑fit monotone calibration to get , and nested holds to accept patches.
- Require replicated calibration gains on a time‑separated confirm set, transport pass, and bounded adversarial uplift.
- Report OUA‑augmented CIs, negative segments, and maintain strict versioning.
- Only then consider optimizing generation against , with LCB objectives and external checks.
We welcome your feedback
CLOVER is an active research framework. We invite constructive criticism from practitioners and researchers.
If you spot errors, have theoretical extensions, or have applied CLOVER in production and want to share lessons, please let us know or email eddie@cimolabs.com.
