Deployment

Promoting an agent to production is a one-click operation, but doing it well is a discipline. This page is a checklist for shipping changes you can sleep through, and a tour of the rollout patterns Fruxon supports.

Before you deploy

Run through this list every time you promote a revision:

☐ Test interactively. The new revision works on the inputs you care about. (Testing)
☐ Evaluate against the golden dataset. Aggregate scores match or beat the deployed baseline; no regressions on critical cases. (Evaluations)
☐ Diff the revision. You know exactly what changed. (Versioning)
☐ Check tool dependencies. Any new integrations are connected and authorized in the production organization.
☐ Check secrets & API keys. Anything referenced via {{secret.X}} exists in this organization — see Secrets to create or rotate them.
☐ Check sub-agents. Any sub-agents you call have the right revision deployed — calls always hit their deployed revision.
☐ Check budgets. Per-agent budget alerts and caps are still appropriate. (Cost)
☐ Check the rollback target. You know which revision you'd roll back to if this one breaks.

Deploy

In Studio, Revisions → Deploy → Confirm. The switch is atomic; in-flight requests are not dropped.

After deploy

☐ Watch the first traffic. Open Observability and watch the first few real runs.
☐ Spot-check conversations. Skim the first batch in Conversations.
☐ Confirm cost is sane. Per-run cost matches what you saw in evaluation; no runaway loops.
☐ Confirm errors are at baseline. New error patterns warrant a rollback while you investigate.

Rollout patterns

Plain deploy

The default. You believe the change is safe, you deploy, you watch. Fine for low-risk changes (copy tweaks, prompt clarifications, additive tools).

Evaluation gate

Before promoting, score the candidate against a golden dataset and compare it to the deployed baseline (Evaluations). Treat a regression in aggregate score as a blocker. This is the closest thing to a pre-production canary: the execute API always serves the deployed revision, so you can't split live traffic across revisions — you gate on offline signal instead, then deploy.

Shadow mode

Run the candidate revision on real production inputs without serving its responses to users. Save the candidate's output to a log; compare offline. Useful when you can't yet trust the revision but want production-shaped signal.

Implement with a router agent that calls both old and new as sub-agents and returns only the old.

Scheduled rollout

For risky changes, ship during low-traffic windows. Combine with budget alerts so a runaway gets capped automatically.

Production hygiene

Coordinate sub-agent deploys. A sub-agent call always hits that agent's deployed revision, so a downstream refactor changes every caller at once. When a sub-agent's behavior is critical to yours, deploy it deliberately and re-test the parents that depend on it.
Set budget caps, not just alerts. An infinite-loop bug at 4am is much cheaper if the cap kicked in at $50 instead of $5,000.
Use Viewer access for stakeholders. Anyone who only needs to read traces or conversations should be a Viewer on the agent — not an Editor. Editor permits deploy. (Team & Roles)
Keep a "deployed" baseline in your dataset. Run evaluations on the currently deployed revision against the dataset on a schedule. If aggregate scores drop, your underlying model or knowledge base shifted under you.

Multi-environment

Most teams run separate organizations for dev, staging, and production. Organizations are fully isolated — different team members, different secrets, different integrations. Promote between them by rebuilding or copying the agent's configuration into the target organization.

Programmatic promotion across organizations is on the roadmap. For now, treat organization export/import as the cross-environment boundary.

Incident response

When production is broken:

Roll back first, debug after. Re-deploy the previous revision. Stop the bleeding.
Capture evidence. Pull failing traces from Observability before they age out of your retention window.
Add the failing cases to your golden dataset. Whatever broke today should be a regression test tomorrow.
Postmortem the deploy. Was the issue catchable in evaluation? Update the dataset / metrics so it would be next time.

Next steps

Versioning — revisions and rollback
Evaluations — automated quality gating
Observability — production observability
Security — production hardening

Deployment

On this page