v0.5.0 — runtime functional

Stop guessing whether
your agent got better.

selfevals is a CLI-first, self-improving evals framework. Point it at your agent, sweep the parameters you expose, and get a report that tells you which configuration to keep — with evidence, not intuition.

Run your first eval Read the docs

selfevals — runzsh

Agnostic to the agent framework underneath

OpenAIAnthropicBedrockVertexLangChainCrewAIOpenAIAnthropicBedrockVertexLangChainCrewAI

Why selfevals

An evals harness that earns the configuration you ship.

Five nouns, one YAML spec, a closed feedback loop. selfevals never calls your provider — your agent does, and selfevals grades the result.

Adapters

Point it at any agent

Embedded callable, CLI subprocess, or HTTP endpoint. selfevals calls your agent, never the provider directly — so it stays framework-agnostic from day one.

Graders

Deterministic and LLM-judge

Score traces with rules — substrings, tools, JSON schema — or a rubric-driven judge. Per-grader scoring reports each grader's own pass@1 instead of a blunt worst-of.

Proposers

Sweep the parameter space

Grid, random, or manual. The grid proposer enumerates its full cartesian product instead of early-stopping on a plateau — no combination left untried.

Decision matrix

A verdict, not a number

Each iteration's metrics become a decision: keep, reject, investigate, spawn a sub-experiment, or require a tradeoff review.

Error analysis

A taxonomy that grows itself

selfevals maintains a per-workspace failure-mode taxonomy and drives the next experiment from it — a closed loop, with a human in the promote step.

Reports

Markdown or JSON, ranked

Iterations ranked, the winner selected, a top failure-modes table — end-to-end in under a second against the bundled echo agent. No API key needed.

60-second quickstart

From install to a ranked report in one command.

No dashboard to configure, no provider to wire up first. The CLI orchestrates the whole run.

01
Copy an example
Seed evals/ into your project with one command.
02
Run the experiment
Cases flow through your adapter, traces get graded, iterations persist.
03
Read the verdict
A ranked markdown report names the configuration to keep.

~/your-agent

01$selfevals examples copy pingpong

02$selfevals run evals/experiments/example_pingpong.yaml

03$selfevals report <ws> <exp>

→ markdown report · iterations ranked · winner selected — <1s, no API key

Case study · brain_os

A framework improved by the agent it was grading.

brain_os is a memory OS for AI agents. It points selfevals at its own hybrid retriever and runs a parameter sweep over its retrieval config. On its golden set it measures MRR 0.896 / Recall@8 1.0 — with a CI regression gate at MRR ≥ 0.80.

Running the sweep surfaced two limitations in selfevals itself: a grid proposer that early-stopped on a plateau, and a conjunctive pass@1 that masked each grader's signal. Both became the headline features of v0.5.0. The experiment did its job — it relocated brain_os's bottleneck with evidence, not intuition.

0.896

MRR on golden set

1.0

Recall@8

deterministic graders

≥0.80

CI regression gate

Why CLI-first

Built for the loop you already work in.

Not a dashboard you log into — a tool that runs where your code runs.

selfevalsHosted eval dashboards

Runs in CI without a hosted service

Never calls your provider — your agent does

Framework-agnostic adapters

Closed error-analysis loop with a taxonomy

Per-grader scoring, not a blunt worst-of

Multi-tenant from day one

From the field

“It relocated our retrieval bottleneck to upstream task-shape classification — with evidence, not intuition. Then it improved itself off the back of our run.”

brain_os

memory OS · production integration

Grade your agent like you mean it.

Install the CLI, run the bundled example offline, and see a ranked report in under a second.

Get started Star on GitHub

Stop guessing whether your agent got better.