CIMO LabsCIMO Labs

Blog

Technical insights on causal evaluation, LLM judges, and off-policy learning.

The Visual Guide to Offline LLM Evaluation

10 min readCIMO Labs

9 plots that tell you exactly when to ship your LLM changes and when to collect more data. From weight explosions to confidence intervals—see what reliable evaluation actually looks like.

Read more

From A/B to Offline: Causal Evaluation for LLMs

8 min readCIMO Labs

Standard offline evaluation fails catastrophically for LLMs—0.6% effective sample size on Arena data. Here's how importance weighting with calibration and stabilization enables reliable counterfactual evaluation.

Read more