Inline Judge

Score agent step outputs at runtime with LLM-as-judge — and optionally retry on low scores

Inline Judge runs an LLM-as-judge evaluation against the output of an agent step as the step executes in production. Each step can carry its own judge config: a set of metrics, weights, a failure mode, and an optional retry budget. The judge runs after the step produces output and either logs the verdict, blocks the run, or feeds the rationale back to the step and asks it to try again.

Inline Judge is distinct from the LLM-as-judge used in evaluations. Eval-run judges grade a step offline, against a dataset. Inline Judge grades the live step output during real executions. Same underlying mechanism; different mental model and different surface in the trace.

When to use it

Reach for Inline Judge when a step has a quality bar that can be defined as a rubric and you want either visibility or a guardrail at runtime:

Sensitive output that needs a verifier. Customer-facing summaries, generated emails, JSON destined for a downstream system.
Production drift detection. Surface a slow-moving quality regression on a high-traffic step without waiting for an evaluation run.
Self-correcting loops. Pair a low-score block with a retry budget so the agent revises before the user sees the output.

For one-off quality work or for benchmarking against a golden dataset, use Evaluations instead.

Configure on a step

Open the agent in Agent Studio.
Select the Agent Step you want to evaluate.
Expand the Quality check panel.
Add metrics from the tenant's Evaluation Metrics catalog. Each metric carries its own scoring rubric.
Set a weight for each metric. Weights are normalized at scoring time, so absolute values don't matter — only ratios.
(Optional) Provide a custom instruction prompt template if the judge needs context beyond the metric's default rubric. Placeholders like {{param.topic}} and {{step.<name>}} are resolved at judge time.
(Optional) Provide a reference-answer template when the metric scores against a known-good target.
(Optional) Enable Include tool transcript if the metric depends on what tools the agent called, not just the final output.

The step records a JudgeConfig. When the config is Empty, no judge runs — existing steps and revisions are unchanged unless you opt them in.

Failure modes

Pick one when configuring the judge:

Mode	Behavior
Log only	The verdict is recorded on the step trace; the run continues regardless. Use this for observability and dataset collection.
Block on low score	If the weighted score falls below the metric's pass threshold, the step is marked failed. The agent's failure path (or the deployment's error contract) handles the result.

Combine Block on low score with retries (below) to get a self-correcting loop. Combine Log only with Observability to track quality drift over time without changing customer-visible behavior.

Retry-with-feedback

Set Max retries to a value greater than zero and the judge becomes part of a feedback loop:

The step produces output.
The judge scores it. If the verdict is pass, the step returns the output as normal.
If the verdict is fail, the judge's per-metric rationale is appended to the step's session as a system message.
The step is re-invoked with the original input plus that rationale.
The loop repeats up to Max retries times. If retries are exhausted and the score is still failing, the step fails (or returns its last output, depending on failure mode).

This is more than a gate — the rationale gives the model concrete, metric-grounded feedback on what to fix, which is far more effective than a plain "try again."

Retries multiply latency and cost. A step with MaxRetries = 2 that always retries is 3× the calls. Start with MaxRetries = 1 and only raise it after you've watched real traces.

In the trace

Judge invocations appear as a distinct JUDGE step in the trace, with:

The per-metric scores and rationales.
The aggregate weighted score and pass/fail verdict.
The decision the runtime took (log, block, retry).
If retries fired, each cycle is visible in sequence so you can see the rationale → revision → next score.

This makes it straightforward to audit why a step blocked or retried in production.

Best practices

Start in Log-only mode. Watch traces for a few days before flipping to Block. False positives at the rubric level are common on the first iteration.
One or two metrics per step. A judge that scores against four rubrics simultaneously is hard to debug. Pick the dimensions that matter most.
Reuse tenant metrics. Define metrics once in Settings → Evaluation Metrics; reference them from every step and every evaluation that needs them.
Don't judge cheap steps. A classification step running on a fast model gains little from a judge that costs more than the step itself.
Cap retries early. MaxRetries = 1 resolves most fixable issues; higher values mostly add latency.

Next steps

Creating Agents — where Judge fits in the broader step config
Evaluations — offline LLM-as-judge against datasets
Observability — track quality drift over time
Agent Studio — visual workflow builder