Testing

Iterate on an agent fast — interactive runs, traces, and the inner-loop workflow

The Test panel in Studio is your inner loop. Make a change, run it, read the trace, fix, repeat. This page covers how to use it well — and how it fits with the heavier Evaluations workflow for systematic quality.

Test vs Evaluate

	Test panel	Evaluations
Use it for	Iterating while you build	Gating deploys, catching regressions
Inputs	A single case at a time	A dataset of N cases
Scoring	You read the output	LLM judge against your metrics
Speed	Seconds	Minutes (depending on dataset size)
When	Constantly, while editing	Before every deploy, on schedule

You'll spend most of your build time in the Test panel and most of your shipping time in Evaluations.

Running a test

Open the agent in Studio.
Click Test in the toolbar (or ⌘/Ctrl + Enter).
Fill in values for every entry parameter.
Click Run.

The test runs the current canvas state — saved or unsaved. You don't need to commit to a revision to test.

Reading the trace

Every test run produces a complete trace, just like a production run:

Per-step inputs and outputs — the resolved prompt, the model's response, any tool calls.
Tokens, latency, cost — both per step and total.
Errors — failing step highlighted, full message and upstream state preserved.
Tool calls — request/response pairs for every tool invocation, expandable inline.

Click any step to drill in. Click any tool call to see exactly what was sent and what came back.

What to test for

The same agent fails for different reasons in different cases. Cover at least:

The happy path. A typical, well-formed input.
The empty case. Empty string, empty array, missing optional field.
The malformed case. Invalid JSON, wrong type, way-too-long input.
The adversarial case. Prompt injection ("ignore previous instructions"), unusual unicode, jailbreak attempts.
The boring case. Inputs that look identical to ones the agent should answer differently. Distinguishes a working classifier from a lucky one.
The expensive case. Inputs that trigger many tool calls or long responses — useful for spotting cost outliers and runaway loops.

A pattern that works: keep a small file of "test cases I always run" and paste them in as you iterate.

Multi-turn testing

For agents with sessions enabled, the Test panel turns into a chat interface — submit follow-up turns and the conversation accumulates. Use this to check:

Whether the agent remembers what was said earlier.
Whether session overflow strategies (drop / summarize) behave the way you expect at the boundary.
Whether the session_search tool actually fires when needed.

Sessions →

Testing tool integrations

Tools execute against your real integrations during a test run. That means:

A test that calls Salesforce will create a real Salesforce record.
A test that runs a SQL query against PostgreSQL will hit your actual database.
A test that posts to Slack will post a real message.

For destructive integrations, use sandbox mode (Sandbox Mode) or a dedicated dev organization with separate credentials. Don't iterate against production.

Tests cost real money. Token spend on test runs counts against your provider bill exactly like production runs do. Watch the cost column in the trace.

Iterating fast

A few habits that compound:

Save frequently. Every save creates a revision, so your history is preserved.
Diff revisions. When something stops working, Compare to the last good revision (Revisions panel) — you'll often spot the change immediately.
Tag interesting traces. Hit the tag button on runs you want to come back to. Tag candidates for the evaluation dataset as you find them.
Test the failing case first. When debugging, start by reproducing the bad output, then tweak. Don't tweak in the dark.
Keep canvas state clean. Half-built nodes can throw confusing errors. Remove or comment them out (use the disable toggle).

When the test panel runs but production doesn't

Test runs hit the current canvas state (your unsaved work). Production hits the deployed revision. Common gotchas:

Tools, secrets, or knowledge attached during testing aren't yet saved to a revision.
The model you tested with isn't the one configured on the deployed revision.
The deployed revision is older than you remember.

Always look at the deployed revision's diff before assuming a production bug is reproducible from your canvas.

Promoting a test case to a regression test

Found a bug, fixed it, and want to make sure it stays fixed? Promote the failing input to your evaluation dataset:

From the trace, click Add to Dataset.
Pick the dataset, set expected output (if applicable), tag it.
Future evaluation runs include this case automatically.

This is how the evaluation dataset grows organically — every bug becomes a test.

Evaluations →

Test panel keyboard shortcuts

Action	Shortcut
Open / focus test panel	`⌘/Ctrl + Enter`
Run	`⌘/Ctrl + Enter` again
Cancel running test	`Esc`
Reset session	clear button in test panel

Next steps

Evaluations — datasets, judges, deploy gating
Observability — production traces
Troubleshooting — common failure modes
Versioning — revisions, comparing, rollback

Testing

On this page