Evaluations
Score candidate agent revisions against golden datasets before you ship
Manual testing breaks down once an agent has more than a handful of behaviors to preserve. Evaluations let you systematically score a candidate revision against a golden dataset of inputs, judged by an LLM against criteria you define, and compare the result to the deployed baseline. It's CI for agents.
This page covers offline evaluations against a dataset. For an LLM-as-judge that scores live step output during real executions — optionally with retry-on-low-score — see Inline Judge.
When to use evaluations
- Before promoting any revision to production.
- When you change the system prompt, swap models, or add tools.
- When you suspect quality regression (tickets pile up; users complain).
- Periodically, against a fixed dataset, to track quality drift over time.
Anatomy of an evaluation run
| Piece | What it is |
|---|---|
| Dataset | A list of test cases — inputs, optional expected outputs, optional metadata. |
| Candidate revision | The agent revision you want to score. |
| Baseline revision | What you compare against (usually the deployed one). |
| Judge | An LLM-as-judge that scores each output against your criteria. |
| Metrics | Custom or built-in measures (correctness, helpfulness, format adherence, latency, cost). |
| Run | One execution of the dataset against a candidate, producing per-case results and aggregate scores. |
Build a golden dataset
A good dataset has 20–200 cases that span:
- The most common inputs you see in production
- Known-hard or historically-broken cases (regression tests)
- Adversarial / edge inputs (empty strings, jailbreak attempts, malformed JSON, etc.)
You can seed datasets from real conversations — pick interesting traces from Conversations and add them as cases.
Define your metrics
Open Settings → Evaluation Metrics and define what "good" means for your agent. Examples:
- Correctness — does the answer match the expected output (exact, fuzzy, or judged-by-LLM)?
- Faithfulness — is the answer grounded in retrieved knowledge, or hallucinated?
- Format — does the output parse as the expected schema?
- Tone — does it match brand voice?
- Tool selection — did the agent call the right tool?
Each metric gets a name, a description, and (for LLM-judged metrics) a rubric.
Run an evaluation
- From the agent, open Evaluations → New Run.
- Pick a candidate revision.
- Pick the dataset.
- Pick the baseline (usually deployed).
- Pick which metrics to score.
- Run.
Fruxon executes every case against both candidate and baseline, scores them with the judge, and produces a side-by-side report.
Reading results
Each run yields:
- Aggregate scores per metric (candidate vs baseline)
- Per-case results — input, both outputs, both scores, judge's reasoning
- Cost and latency summaries (real production cost matters as much as quality)
- Regressions — cases where candidate scored worse than baseline, surfaced first
If candidate beats baseline across the metrics that matter, deploy with confidence. If it loses on cases you care about, you've caught the regression before users did.
Patterns
- Pre-deploy gate. Treat evaluations as the gate to deploy. Don't promote a revision that loses on critical metrics.
- Periodic drift checks. Re-run the dataset against the deployed revision on a regular cadence (today, by triggering the run yourself) to catch drift from upstream model updates.
- Per-feature datasets. One dataset per behavior class (Q&A, formatting, tool use) — easier to localize regressions.
- Replay real traffic. Periodically curate failed production cases into the dataset; the agent gets harder to break over time.
What evaluations don't do
- They don't replace observability. Use monitoring to catch production-only failures (timeouts, integration outages, surprise inputs).
- They don't catch issues your judge can't see. If your metrics don't measure the thing that matters, evaluations will pass while users complain. Curate metrics seriously.
Next steps
- Versioning — promote / roll back revisions
- Monitoring — watch the deployed revision live
- Cost & Budgets — keep evaluation runs from blowing the bill