v0.5.0 — runtime functional

Stop guessing whether your agent got better.

selfevals is a CLI-first, self-improving evals framework. Point it at your agent, sweep the parameters you expose, and get a report that tells you which configuration to keep — with evidence, not intuition.

selfevals — runzsh

Agnostic to the agent framework underneath

OpenAIAnthropicBedrockVertexLangChainCrewAIOpenAIAnthropicBedrockVertexLangChainCrewAI
Why selfevals

An evals harness that earns the configuration you ship.

Five nouns, one YAML spec, a closed feedback loop. selfevals never calls your provider — your agent does, and selfevals grades the result.

Adapters

Point it at any agent

Embedded callable, CLI subprocess, or HTTP endpoint. selfevals calls your agent, never the provider directly — so it stays framework-agnostic from day one.

Graders

Deterministic and LLM-judge

Score traces with rules — substrings, tools, JSON schema — or a rubric-driven judge. Per-grader scoring reports each grader's own pass@1 instead of a blunt worst-of.

Proposers

Sweep the parameter space

Grid, random, or manual. The grid proposer enumerates its full cartesian product instead of early-stopping on a plateau — no combination left untried.

Decision matrix

A verdict, not a number

Each iteration's metrics become a decision: keep, reject, investigate, spawn a sub-experiment, or require a tradeoff review.

Error analysis

A taxonomy that grows itself

selfevals maintains a per-workspace failure-mode taxonomy and drives the next experiment from it — a closed loop, with a human in the promote step.

Reports

Markdown or JSON, ranked

Iterations ranked, the winner selected, a top failure-modes table — end-to-end in under a second against the bundled echo agent. No API key needed.

60-second quickstart

From install to a ranked report in one command.

No dashboard to configure, no provider to wire up first. The CLI orchestrates the whole run.

  1. 01
    Copy an example
    Seed evals/ into your project with one command.
  2. 02
    Run the experiment
    Cases flow through your adapter, traces get graded, iterations persist.
  3. 03
    Read the verdict
    A ranked markdown report names the configuration to keep.
~/your-agent
01$selfevals examples copy pingpong
02$selfevals run evals/experiments/example_pingpong.yaml
03$selfevals report <ws> <exp>
markdown report · iterations ranked · winner selected <1s, no API key
Case study · brain_os

A framework improved by the agent it was grading.

brain_os is a memory OS for AI agents. It points selfevals at its own hybrid retriever and runs a parameter sweep over its retrieval config. On its golden set it measures MRR 0.896 / Recall@8 1.0 — with a CI regression gate at MRR ≥ 0.80.

Running the sweep surfaced two limitations in selfevals itself: a grid proposer that early-stopped on a plateau, and a conjunctive pass@1 that masked each grader's signal. Both became the headline features of v0.5.0. The experiment did its job — it relocated brain_os's bottleneck with evidence, not intuition.

0.896
MRR on golden set
1.0
Recall@8
5
deterministic graders
≥0.80
CI regression gate
Why CLI-first

Built for the loop you already work in.

Not a dashboard you log into — a tool that runs where your code runs.

selfevalsHosted eval dashboards
Runs in CI without a hosted service
Never calls your provider — your agent does
Framework-agnostic adapters
Closed error-analysis loop with a taxonomy
Per-grader scoring, not a blunt worst-of
Multi-tenant from day one

From the field

It relocated our retrieval bottleneck to upstream task-shape classification — with evidence, not intuition. Then it improved itself off the back of our run.
bo
brain_os
memory OS · production integration

Grade your agent like you mean it.

Install the CLI, run the bundled example offline, and see a ranked report in under a second.