Evaluations

Manual testing breaks down once an agent has more than a handful of behaviors to preserve. Evaluations let you systematically score a candidate revision against a golden dataset of inputs, judged by an LLM against criteria you define, and compare the result to the deployed baseline. It's CI for agents.

This page covers offline evaluations against a dataset. For an LLM-as-judge that scores live step output during real executions — optionally with retry-on-low-score — see Inline Judge.

When to use evaluations

Before promoting any revision to production.
When you change the system prompt, swap models, or add tools.
When you suspect quality regression (tickets pile up; users complain).
Periodically, against a fixed dataset, to track quality drift over time.

Anatomy of an evaluation run

Piece	What it is
Dataset	A list of test cases — inputs, optional expected outputs, optional metadata.
Candidate revision	The agent revision you want to score.
Baseline revision	What you compare against (usually the deployed one).
Judge	An LLM-as-judge that scores each output against your criteria.
Metrics	Custom or built-in measures (correctness, helpfulness, format adherence, latency, cost).
Run	One execution of the dataset against a candidate, producing per-case results and aggregate scores.

Build a golden dataset

A good dataset has 20–200 cases that span:

The most common inputs you see in production
Known-hard or historically-broken cases (regression tests)
Adversarial / edge inputs (empty strings, jailbreak attempts, malformed JSON, etc.)

You can seed datasets from real conversations — pick interesting traces from Conversations and add them as cases.

Define your metrics

Open Settings → Evaluation Metrics and define what "good" means for your agent. Examples:

Correctness — does the answer match the expected output (exact, fuzzy, or judged-by-LLM)?
Faithfulness — is the answer grounded in retrieved knowledge, or hallucinated?
Format — does the output parse as the expected schema?
Tone — does it match brand voice?
Tool selection — did the agent call the right tool?

Each metric gets a name, a description, and (for LLM-judged metrics) a rubric.

Run an evaluation

From the agent, open Evaluations → New Run.
Pick a candidate revision.
Pick the dataset.
Pick the baseline (usually deployed).
Pick which metrics to score.
Run.

Fruxon executes every case against both candidate and baseline, scores them with the judge, and produces a side-by-side report.

Reading results

Each run yields:

Aggregate scores per metric (candidate vs baseline)
Per-case results — input, both outputs, both scores, judge's reasoning
Cost and latency summaries (real production cost matters as much as quality)
Regressions — cases where candidate scored worse than baseline, surfaced first

If candidate beats baseline across the metrics that matter, deploy with confidence. If it loses on cases you care about, you've caught the regression before users did.

Patterns

Pre-deploy gate. Treat evaluations as the gate to deploy. Don't promote a revision that loses on critical metrics.
Periodic drift checks. Re-run the dataset against the deployed revision on a regular cadence (today, by triggering the run yourself) to catch drift from upstream model updates.
Per-feature datasets. One dataset per behavior class (Q&A, formatting, tool use) — easier to localize regressions.
Replay real traffic. Periodically curate failed production cases into the dataset; the agent gets harder to break over time.

What evaluations don't do

They don't replace observability. Use monitoring to catch production-only failures (timeouts, integration outages, surprise inputs).
They don't catch issues your judge can't see. If your metrics don't measure the thing that matters, evaluations will pass while users complain. Curate metrics seriously.

Next steps

Versioning — promote / roll back revisions
Monitoring — watch the deployed revision live
Cost & Budgets — keep evaluation runs from blowing the bill

Evaluations

On this page