Testing
Iterate on an agent fast — interactive runs, traces, and the inner-loop workflow
The Test panel in Studio is your inner loop. Make a change, run it, read the trace, fix, repeat. This page covers how to use it well — and how it fits with the heavier Evaluations workflow for systematic quality.
Test vs Evaluate
| Test panel | Evaluations | |
|---|---|---|
| Use it for | Iterating while you build | Gating deploys, catching regressions |
| Inputs | A single case at a time | A dataset of N cases |
| Scoring | You read the output | LLM judge against your metrics |
| Speed | Seconds | Minutes (depending on dataset size) |
| When | Constantly, while editing | Before every deploy, on schedule |
You'll spend most of your build time in the Test panel and most of your shipping time in Evaluations.
Running a test
- Open the agent in Studio.
- Click Test in the toolbar (or
⌘/Ctrl + Enter). - Fill in values for every entry parameter.
- Click Run.
The test runs the current canvas state — saved or unsaved. You don't need to commit to a revision to test.
Reading the trace
Every test run produces a complete trace, just like a production run:
- Per-step inputs and outputs — the resolved prompt, the model's response, any tool calls.
- Tokens, latency, cost — both per step and total.
- Errors — failing step highlighted, full message and upstream state preserved.
- Tool calls — request/response pairs for every tool invocation, expandable inline.
Click any step to drill in. Click any tool call to see exactly what was sent and what came back.
What to test for
The same agent fails for different reasons in different cases. Cover at least:
- The happy path. A typical, well-formed input.
- The empty case. Empty string, empty array, missing optional field.
- The malformed case. Invalid JSON, wrong type, way-too-long input.
- The adversarial case. Prompt injection ("ignore previous instructions"), unusual unicode, jailbreak attempts.
- The boring case. Inputs that look identical to ones the agent should answer differently. Distinguishes a working classifier from a lucky one.
- The expensive case. Inputs that trigger many tool calls or long responses — useful for spotting cost outliers and runaway loops.
A pattern that works: keep a small file of "test cases I always run" and paste them in as you iterate.
Multi-turn testing
For agents with sessions enabled, the Test panel turns into a chat interface — submit follow-up turns and the conversation accumulates. Use this to check:
- Whether the agent remembers what was said earlier.
- Whether session overflow strategies (drop / summarize) behave the way you expect at the boundary.
- Whether the
session_searchtool actually fires when needed.
Testing tool integrations
Tools execute against your real integrations during a test run. That means:
- A test that calls Salesforce will create a real Salesforce record.
- A test that runs a SQL query against PostgreSQL will hit your actual database.
- A test that posts to Slack will post a real message.
For destructive integrations, use sandbox mode (Sandbox Mode) or a dedicated dev organization with separate credentials. Don't iterate against production.
Tests cost real money. Token spend on test runs counts against your provider bill exactly like production runs do. Watch the cost column in the trace.
Iterating fast
A few habits that compound:
- Save frequently. Every save creates a revision, so your history is preserved.
- Diff revisions. When something stops working, Compare to the last good revision (Revisions panel) — you'll often spot the change immediately.
- Tag interesting traces. Hit the tag button on runs you want to come back to. Tag candidates for the evaluation dataset as you find them.
- Test the failing case first. When debugging, start by reproducing the bad output, then tweak. Don't tweak in the dark.
- Keep canvas state clean. Half-built nodes can throw confusing errors. Remove or comment them out (use the disable toggle).
When the test panel runs but production doesn't
Test runs hit the current canvas state (your unsaved work). Production hits the deployed revision. Common gotchas:
- Tools, secrets, or knowledge attached during testing aren't yet saved to a revision.
- The model you tested with isn't the one configured on the deployed revision.
- The deployed revision is older than you remember.
Always look at the deployed revision's diff before assuming a production bug is reproducible from your canvas.
Promoting a test case to a regression test
Found a bug, fixed it, and want to make sure it stays fixed? Promote the failing input to your evaluation dataset:
- From the trace, click Add to Dataset.
- Pick the dataset, set expected output (if applicable), tag it.
- Future evaluation runs include this case automatically.
This is how the evaluation dataset grows organically — every bug becomes a test.
Test panel keyboard shortcuts
| Action | Shortcut |
|---|---|
| Open / focus test panel | ⌘/Ctrl + Enter |
| Run | ⌘/Ctrl + Enter again |
| Cancel running test | Esc |
| Reset session | clear button in test panel |
Next steps
- Evaluations — datasets, judges, deploy gating
- Monitoring — production traces
- Troubleshooting — common failure modes
- Versioning — revisions, comparing, rollback