About CIMO Labs — Causal Evaluation for LLM Systems

We build tools that bring causal and statistical rigor to LLM evaluation, turning "vibes-based" judging into engineering discipline.

What is CIMO?

CIMO stands for Causal Information Manifold Optimization.

In plain terms: we help you measure the parallel universe—what would have happened if you'd chosen a different prompt or model—without waiting weeks for an A/B test.

Every time you evaluate an AI model, you pay a cost (time, compute, labeling) to reduce uncertainty. Most teams pay too much for low-quality information (random spot checks) or wait too long for high-quality information (A/B tests).

CIMO finds the sweet spot: the statistical confidence of an A/B test at the speed and cost of an offline evaluation. The key insight is that cheap metrics and expensive outcomes live on the same causal structure—learn the mapping on a small labeled sample, apply it at scale, and know when it breaks.

The Evaluation Gap

AI teams currently face a brutal tradeoff between speed and truth.

The "Slow & Right" Way

A/B Testing

Statistically rigorous
Measures real outcomes
Takes weeks to run
Risks bad user experiences

The "Fast & Wrong" Way

Standard Offline Evals

Instant feedback
Cheap to run
Uncalibrated scores
No confidence intervals

"A judge score improving from 7.2 to 7.8 is meaningless if you don't know how that score predicts user retention."

Our Solution

We bridge the gap using Surrogate Evaluation. We use techniques from causal inference to calibrate cheap signals (like LLM judges) against expensive ground truth.

✔Statistically Principled: Every estimate includes honest confidence intervals that account for uncertainty.
✔Causally Interpretable: We estimate what would happen if you deployed the policy, not just observational correlations.
✔Diagnostic-First: Tools that tell you when they are unreliable (e.g., "Coverage is too low to estimate this"), rather than failing silently.

Research Focus

Surrogate Metrics & Calibration

Treating fast metrics as calibrated surrogates for an Idealized Deliberation Oracle using mean-preserving transformations.

Off-Policy Estimation

Adapting importance sampling and doubly robust methods to handle the specific quirks of LLMs (distributional shifts and heavy-tailed weights).

What We Are

CIMO Labs is an open research effort—not a startup, not a company (yet). We're formalizing causal inference methods for AI evaluation and building the tools to make them practical.

•CJE is open source and free to use under MIT license—no vendor lock-in, no API keys, no usage limits
•Research papers are pre-publication working documents—methods are validated on real data but not yet peer-reviewed
•For enterprise needs (custom implementation, consulting, dedicated support)—contact us to discuss options

CIMO Labs is:

✔An open research project formalizing causal inference for AI evaluation
✔A free, MIT-licensed library (CJE) you can use today
✔Documentation and methods that belong to the community
✔Open to research collaborations and enterprise partnerships

CIMO Labs is not:

✗A SaaS product with subscription pricing
✗A managed evaluation service (yet)
✗A black-box proprietary system
✗Vaporware—the library is live and validated

Our model: We believe the ecosystem needs shared standards for rigorous AI evaluation, not proprietary black boxes. The research, methods, and core library are open. We're exploring how to sustainably fund this work while keeping it accessible—enterprise partnerships and consulting help support ongoing research.

Open Source First

CJE (Causal Judge Evaluation) is our MIT-licensed library for rigorous evaluation. No vendor lock-in, no API keys, no usage limits. Run it on your infrastructure with your data.

View CJE on GitHub →|pip install cje-eval

Founder

Eddie Landesberg

Research Scientist & Engineer

Eddie has spent a decade applying causal inference to production systems.

Previously at Stitch Fix, he built an advertising optimization system managing $150M/year, generating ~$40M in efficiency gains via randomized experiments. He is the author of "Want to make good business decisions? Learn causality", a staple in data science curricula.

At Salesforce, he led the first end-to-end ML deployments for the marketing organization. As co-founder of Fondu (featured by a16z), he built consumer-facing memory systems for LLMs.

Connect on LinkedIn →

Facing evaluation challenges?

We work with teams dealing with high-stakes deployments, limited labeled data, and regulatory requirements.

Get in Touch