Point it at any agent
Embedded callable, CLI subprocess, or HTTP endpoint. selfevals calls your agent, never the provider directly — so it stays framework-agnostic from day one.
selfevals is a CLI-first, self-improving evals framework. Point it at your agent, sweep the parameters you expose, and get a report that tells you which configuration to keep — with evidence, not intuition.
Agnostic to the agent framework underneath
Five nouns, one YAML spec, a closed feedback loop. selfevals never calls your provider — your agent does, and selfevals grades the result.
Embedded callable, CLI subprocess, or HTTP endpoint. selfevals calls your agent, never the provider directly — so it stays framework-agnostic from day one.
Score traces with rules — substrings, tools, JSON schema — or a rubric-driven judge. Per-grader scoring reports each grader's own pass@1 instead of a blunt worst-of.
Grid, random, or manual. The grid proposer enumerates its full cartesian product instead of early-stopping on a plateau — no combination left untried.
Each iteration's metrics become a decision: keep, reject, investigate, spawn a sub-experiment, or require a tradeoff review.
selfevals maintains a per-workspace failure-mode taxonomy and drives the next experiment from it — a closed loop, with a human in the promote step.
Iterations ranked, the winner selected, a top failure-modes table — end-to-end in under a second against the bundled echo agent. No API key needed.
No dashboard to configure, no provider to wire up first. The CLI orchestrates the whole run.
brain_os is a memory OS for AI agents. It points selfevals at its own hybrid retriever and runs a parameter sweep over its retrieval config. On its golden set it measures MRR 0.896 / Recall@8 1.0 — with a CI regression gate at MRR ≥ 0.80.
Running the sweep surfaced two limitations in selfevals itself: a grid proposer that early-stopped on a plateau, and a conjunctive pass@1 that masked each grader's signal. Both became the headline features of v0.5.0. The experiment did its job — it relocated brain_os's bottleneck with evidence, not intuition.
Not a dashboard you log into — a tool that runs where your code runs.
From the field
“It relocated our retrieval bottleneck to upstream task-shape classification — with evidence, not intuition. Then it improved itself off the back of our run.”
Install the CLI, run the bundled example offline, and see a ranked report in under a second.