FruxonDocs

Deployment

Production checklist, rollout patterns, and how to ship changes safely

Promoting an agent to production is a one-click operation, but doing it well is a discipline. This page is a checklist for shipping changes you can sleep through, and a tour of the rollout patterns Fruxon supports.

Before you deploy

Run through this list every time you promote a revision:

  • Test interactively. The new revision works on the inputs you care about. (Testing)
  • Evaluate against the golden dataset. Aggregate scores match or beat the deployed baseline; no regressions on critical cases. (Evaluations)
  • Diff the revision. You know exactly what changed. (Versioning)
  • Check tool dependencies. Any new integrations are connected and authorized in the production organization.
  • Check secrets & API keys. Anything referenced via {{secret.X}} exists in this organization.
  • Check sub-agents. Any sub-agents you call are themselves deployed (or pinned to a revision).
  • Check budgets. Per-agent budget alerts and caps are still appropriate. (Cost)
  • Check the rollback target. You know which revision you'd roll back to if this one breaks.

Deploy

In Studio, Revisions → Deploy → Confirm. The switch is atomic; in-flight requests are not dropped.

After deploy

  • Watch the first traffic. Open Monitoring and watch the first few real runs.
  • Spot-check conversations. Skim the first batch in Conversations.
  • Confirm cost is sane. Per-run cost matches what you saw in evaluation; no runaway loops.
  • Confirm errors are at baseline. New error patterns warrant a rollback while you investigate.

Rollout patterns

Plain deploy

The default. You believe the change is safe, you deploy, you watch. Fine for low-risk changes (copy tweaks, prompt clarifications, additive tools).

Canary by API caller

Have your API caller pin to a specific revisionId for a subset of traffic. Production keeps hitting the deployed revision; the canary subset hits the candidate. Compare metrics, then promote.

POST /v1/{tenant}/agents/{agent}:execute
{ "revisionId": "rev_candidate", "input": { ... } }

This is the cleanest canary: you control which traffic sees the new behavior.

Shadow mode

Run the candidate revision on real production inputs without serving its responses to users. Save the candidate's output to a log; compare offline. Useful when you can't yet trust the revision but want production-shaped signal.

Implement with a router agent that calls both old and new as sub-agents and returns only the old.

Scheduled rollout

For risky changes, ship during low-traffic windows. Combine with budget alerts so a runaway gets capped automatically.

Production hygiene

  • Pin sub-agents when their behavior is critical. A refactor in a downstream agent shouldn't silently change yours.
  • Set budget caps, not just alerts. An infinite-loop bug at 4am is much cheaper if the cap kicked in at $50 instead of $5,000.
  • Use Viewer access for stakeholders. Anyone who only needs to read traces or conversations should be a Viewer on the agent — not an Editor. Editor permits deploy. (Team & Roles)
  • Keep a "deployed" baseline in your dataset. Run evaluations on the currently deployed revision against the dataset on a schedule. If aggregate scores drop, your underlying model or knowledge base shifted under you.

Multi-environment

Most teams run separate organizations for dev, staging, and production. Organizations are fully isolated — different team members, different secrets, different integrations. Promote between them by exporting/cloning a deployed revision.

Programmatic promotion across organizations is on the roadmap. For now, treat organization export/import as the cross-environment boundary.

Incident response

When production is broken:

  1. Roll back first, debug after. Re-deploy the previous revision. Stop the bleeding.
  2. Capture evidence. Pull failing traces from Monitoring before they age out of your retention window.
  3. Add the failing cases to your golden dataset. Whatever broke today should be a regression test tomorrow.
  4. Postmortem the deploy. Was the issue catchable in evaluation? Update the dataset / metrics so it would be next time.

Next steps

On this page