About CIMO Labs
We build tools that bring causal and statistical rigor to LLM evaluation, turning "vibes-based" judging into engineering discipline.
What is CIMO?
CIMO stands for Causal Information Manifold Optimization.
In plain terms: we help you measure the parallel universe—what would have happened if you'd chosen a different prompt or model—without waiting weeks for an A/B test.
Every time you evaluate an AI model, you pay a cost (time, compute, labeling) to reduce uncertainty. Most teams pay too much for low-quality information (random spot checks) or wait too long for high-quality information (A/B tests).
CIMO finds the sweet spot: the statistical confidence of an A/B test at the speed and cost of an offline evaluation. The key insight is that cheap metrics and expensive outcomes live on the same causal structure—learn the mapping on a small labeled sample, apply it at scale, and know when it breaks.
The Evaluation Gap
AI teams currently face a brutal tradeoff between speed and truth.
The "Slow & Right" Way
A/B Testing
- Statistically rigorous
- Measures real outcomes
- Takes weeks to run
- Risks bad user experiences
The "Fast & Wrong" Way
Standard Offline Evals
- Instant feedback
- Cheap to run
- Uncalibrated scores
- No confidence intervals
"A judge score improving from 7.2 to 7.8 is meaningless if you don't know how that score predicts user retention."
Our Solution
We bridge the gap using Surrogate Evaluation. We use techniques from causal inference to calibrate cheap signals (like LLM judges) against expensive ground truth.
- ✔Statistically Principled: Every estimate includes honest confidence intervals that account for uncertainty.
- ✔Causally Interpretable: We estimate what would happen if you deployed the policy, not just observational correlations.
- ✔Diagnostic-First: Tools that tell you when they are unreliable (e.g., "Coverage is too low to estimate this"), rather than failing silently.
Research Focus
Surrogate Metrics & Calibration
Treating fast metrics as calibrated surrogates for an Idealized Deliberation Oracle using mean-preserving transformations.
Off-Policy Estimation
Adapting importance sampling and doubly robust methods to handle the specific quirks of LLMs (distributional shifts and heavy-tailed weights).
What We Are
CIMO Labs is an open research effort—not a startup, not a company (yet). We're formalizing causal inference methods for AI evaluation and building the tools to make them practical.
- •CJE is open source and free to use under MIT license—no vendor lock-in, no API keys, no usage limits
- •Research papers are pre-publication working documents—methods are validated on real data but not yet peer-reviewed
- •For enterprise needs (custom implementation, consulting, dedicated support)—contact us to discuss options
CIMO Labs is:
- ✔An open research project formalizing causal inference for AI evaluation
- ✔A free, MIT-licensed library (CJE) you can use today
- ✔Documentation and methods that belong to the community
- ✔Open to research collaborations and enterprise partnerships
CIMO Labs is not:
- ✗A SaaS product with subscription pricing
- ✗A managed evaluation service (yet)
- ✗A black-box proprietary system
- ✗Vaporware—the library is live and validated
Our model: We believe the ecosystem needs shared standards for rigorous AI evaluation, not proprietary black boxes. The research, methods, and core library are open. We're exploring how to sustainably fund this work while keeping it accessible—enterprise partnerships and consulting help support ongoing research.
Open Source First
CJE (Causal Judge Evaluation) is our MIT-licensed library for rigorous evaluation. No vendor lock-in, no API keys, no usage limits. Run it on your infrastructure with your data.
Founder

Eddie Landesberg
Research Scientist & Engineer
Eddie has spent a decade applying causal inference to production systems.
Previously at Stitch Fix, he built an advertising optimization system managing $150M/year, generating ~$40M in efficiency gains via randomized experiments. He is the author of "Want to make good business decisions? Learn causality", a staple in data science curricula.
At Salesforce, he led the first end-to-end ML deployments for the marketing organization. As co-founder of Fondu (featured by a16z), he built consumer-facing memory systems for LLMs.
Facing evaluation challenges?
We work with teams dealing with high-stakes deployments, limited labeled data, and regulatory requirements.
