Most demo AI agents look great in a five-minute video and fall apart in production. This guide is about what's different when you ship agents that real users depend on.

The principle: bounded autonomy

A production agent has a clear, narrow domain in which it can act autonomously, and a clear escalation path outside that domain. Unbounded agents — "just use the LLM to decide everything" — are unreliable in production. Period.

Agent architecture

A production agent has four parts:

  1. Planner. Decides what to do given the user's input. Usually the main LLM call.
  2. Tools. Discrete capabilities the agent can invoke: search, database query, API call, etc.
  3. Memory. Short-term (conversation history) and long-term (user context, learned preferences).
  4. Critic / guardrails. Validators that check the agent's actions before they execute.

Tool design

Tools are where agents most often fail. Principles:

  • Narrow tools. One verb, one noun. "search_orders" is better than "do_anything_with_orders."
  • Idempotent tools where possible. The agent will retry. Make retries safe.
  • Strong typing. Tools take structured input, return structured output. JSON schema everywhere.
  • Clear failure modes. When a tool fails, the agent needs to know why and what to do next.

Memory

  • Conversation memory: Last N turns, with token-budget-aware truncation.
  • User memory: Persistent facts about the user (preferences, account details, past interactions). Stored in a vector store or structured DB.
  • Knowledge memory: RAG over the product's documentation, knowledge base, etc.

Don't conflate these. Each has a different update cadence and retrieval strategy.

Planning patterns

  • ReAct: Reason → Act → Observe → Repeat. Simple, works for most tasks.
  • Plan-and-Execute: Plan all steps first, then execute. Useful for tasks with clear sub-steps.
  • Routing: A small classifier decides which specialist sub-agent handles the request. Often simpler and more reliable than a single big agent.

For most production cases, the "routing to specialised sub-agents" pattern wins. Each sub-agent is simpler, easier to evaluate, and easier to debug.

Guardrails

  • Input validation. Reject obviously out-of-scope queries early.
  • Output validation. Check the agent's response against expected schemas before sending it to the user.
  • Action validation. Before any tool with real-world effects executes (sending email, charging a card, modifying data), validate the action against business rules.
  • PII filters. Detect and redact sensitive data leaving the system.

Escalation

The most important pattern. The agent has a clear "I'm not sure, escalate to human" path:

  • Low-confidence retrievals.
  • Ambiguous user intent.
  • Requests outside the agent's domain.
  • Emotional escalation in the conversation.
  • High-stakes actions (financial, legal, irreversible).

A good agent escalates often. A great one escalates only when it should.

Evaluation

  • Trajectory evaluation: Did the agent take the right path of tool calls?
  • Output evaluation: Was the final response correct?
  • Safety evaluation: Did the agent refuse to do harmful things and execute safe things?
  • Production sampling: Continuously grade real production interactions.

Observability

Every agent invocation is a tree of calls. Trace them all:

  • Input prompt.
  • Planner output.
  • Each tool call and its result.
  • Final output.
  • Latency and cost per step.

Without this, debugging is impossible. LangSmith, Langfuse, or DIY all work.

Cost discipline

  • Cache where possible (system prompts, common tool results).
  • Use smaller models for routing and validation, larger models for the actual reasoning.
  • Stream where users will wait, batch where they won't.
  • Have a per-user and per-conversation cost ceiling. Stop runaway agents.

The summary

Production agents work when they're narrow, bounded, observable, and have great escalation. Demo agents that "do everything" almost always fall apart at scale. Pick the smallest possible domain, do it well, then expand.