Claude Code subagents unlock a qualitatively different way of working: delegation, parallelism, and specialization at the agent layer instead of the prompt layer. But the delta between a demo and a production subagent system is significant. The failure modes are subtle, the costs compound quickly, and poorly scoped agents create more work for the orchestrator than they save.
This article distills 12 practices that consistently separate working subagent systems from ones that look fine until they don’t. These come from analyzing hundreds of AGENTS.md files from production codebases and from running subagents inside real multi-step workflows — not from reading the API docs alone.
If you want the conceptual foundation first, read our complete subagents reference. This article assumes you understand what subagents are and focuses on what makes them reliable at scale.
Why Subagents Matter (and When They Don’t)
Subagents exist to solve a specific problem: some work is too large, too specialized, or too parallelizable to fit inside a single context window or a single agent’s responsibility. When you delegate to a subagent, you get:
- Context isolation. The subagent starts with a clean slate. It doesn’t carry the baggage of the orchestrating session — previous failed attempts, unrelated conversation history, or the orchestrator’s working assumptions.
- Specialization. A subagent defined in AGENTS.md can have a tightly scoped system prompt, restricted tools, and output conventions that make it reliably good at one thing.
- Potential parallelism. When tasks don’t block each other, multiple subagents can work simultaneously.
But subagents are not always the right answer. If the task genuinely requires shared state and tight feedback loops, splitting it across agents often costs more in coordination overhead than it saves. If the problem fits in one context window and doesn’t require parallel execution, a skill or a well-crafted prompt is simpler and cheaper. Use subagents when the decomposition is natural — not because delegation feels more “agentic.”
Best Practice 1: Single-Responsibility Principle for Subagent Design
The most common mistake in AGENTS.md files is writing subagents that do too many things. A subagent that “researches topics, writes summaries, and updates the knowledge base” is three agents pretending to be one. When it fails, you don’t know which responsibility caused the failure.
Apply the single-responsibility principle: each subagent does one well-defined job. That job should be describable in a single sentence without the word “and.”
Weak definition:
## content-agent
Researches topics, writes blog posts, and publishes them to the CMS.
Strong definition:
## research-agent
Given a topic and optional source URLs, returns a structured research report
in JSON format. Does not write prose or interact with external systems beyond
reading URLs. Stops after 10 tool calls.
## writer-agent
Given a research report in JSON format, writes a 1,200-word blog post in
markdown. Does not research, fetch URLs, or publish anything.
## publish-agent
Given a markdown post and metadata, publishes it to the CMS via the API.
Does not write or modify content.
The orchestrator sequences these three agents. Each one is testable in isolation. When publish-agent fails, you know exactly what went wrong.
Best Practice 2: Aggressive Tool Restriction with the tools: Field
Every tool you give a subagent is a potential failure mode. File system tools can write to the wrong path. Network tools can make unintended requests. Shell tools can run dangerous commands. The principle is: give the subagent the minimum set of tools it needs to complete its job.
The tools: field in AGENTS.md lets you specify exactly which tools are available to a subagent. Use it.
## data-fetcher
Fetches data from the configured API endpoint and returns raw JSON.
tools:
- WebFetch
- WebSearch
# No file system access, no shell, no code execution
## file-processor
Reads CSV files from ./data/input/, processes them, and writes results to ./data/output/.
tools:
- Read
- Write
- Bash
# Explicitly document what Bash is needed for:
# - Running the data validation script: python3 scripts/validate.py
# No network access needed
Tool restriction has three benefits: it prevents agents from taking actions outside their scope, it makes behavior more predictable (the agent can’t “solve” a problem by using an unexpected tool), and it reduces the attack surface when agents process untrusted input.
If you find yourself adding tools to a subagent because it keeps failing, that’s usually a signal that the task scope is too broad, not that the agent needs more capabilities.
Best Practice 3: Model Selection Per Subagent
Claude Code supports specifying which model a subagent runs on. This is one of the highest-leverage cost optimizations available — and most teams leave it on the table.
The pattern is straightforward: use Haiku for high-volume, low-complexity work; use Sonnet for most production tasks; reserve Opus for work that genuinely requires deep reasoning or where errors are expensive.
## log-summarizer
Reads application logs and extracts error patterns. Returns a bullet-point
summary of the top 5 error categories by frequency.
model: claude-haiku-4-5
tools:
- Read
# Haiku is sufficient for pattern extraction from structured text.
# This agent runs on every CI pass — cost matters.
## architecture-reviewer
Reviews a proposed system architecture document and identifies structural
risks, missing failure modes, and scaling bottlenecks.
model: claude-opus-4-5
tools:
- Read
# Architecture review requires genuine reasoning about complex tradeoffs.
# Runs once per major proposal — cost is acceptable.
The practical rule: if a task can be completed by following explicit patterns (format conversion, extraction, classification, summarization of structured data), Haiku handles it well. If the task requires judgment calls, cross-domain reasoning, or creative problem-solving under constraints, use Sonnet or Opus.
Track your model distribution over time. If more than 20% of your subagent calls are on Opus, your task decomposition probably has gaps — some of that work could be handed to cheaper models with better scoping.
Best Practice 4: Skills Over Subagents for Encoded Preferences
There is a common confusion between when to use a skill and when to use a subagent. The distinction matters: skills encode how to do something, subagents delegate what to do.
A skill is a reusable procedure — a sequence of steps with encoded preferences, naming conventions, and quality standards. It runs in the current context. A subagent is a separate agent instance with its own context, tools, and responsibility boundary.
Use skills when:
- You need to encode a repeatable workflow with specific preferences (formatting, file naming, output structure)
- The task runs in the current context and doesn’t need isolation
- You want the behavior to be consistent across many invocations without respecifying it
Use subagents when:
- The task genuinely needs a clean context
- You want to run multiple tasks in parallel
- The task requires a different tool set than the orchestrator has
A practical example: if your team has a standard code review process (check for type errors, verify test coverage, confirm naming conventions), encode it as a skill. The preferences are stable and the skill will apply them consistently. But if you want to run code review in parallel with security analysis and documentation generation — each needing different tools and context — those are subagents.
For more on skill design, see our complete guide to writing Claude Code skills.
Best Practice 5: Subagents vs. Agent Teams — a Decision Framework
“Agent teams” in Claude Code refers to the experimental multi-agent parallel development feature. Subagents are the general delegation mechanism. They solve different problems.
Use subagents when:
- You need to orchestrate a multi-step workflow with sequential or conditional dependencies
- You want to isolate context for specific subtasks
- You need different tool sets for different parts of the job
Use agent teams when:
- You need multiple agents working on a codebase simultaneously with git-level coordination
- You want parallel development across separate branches
- You’re doing large-scale refactoring where parallelism is the primary goal
The two can coexist: your agent team might use subagents internally for specialized analysis tasks. But they shouldn’t be interchangeable — pick the right primitive for the problem.
A simple decision test: if the work can be represented as a dependency graph (step A must complete before step B), use subagents. If the work is a set of genuinely independent parallel tracks (write feature X and feature Y simultaneously), agent teams are more appropriate.
Best Practice 6: Context Window Hygiene
Each subagent starts with its AGENTS.md definition plus whatever the orchestrator passes it. Everything else begins clean. This is a feature — but it means you need to be deliberate about what you pass.
The most common context hygiene failure is passing too much. Orchestrators often forward their entire accumulated context to subagents “just in case.” This wastes tokens, can confuse the subagent with irrelevant information, and slows down the task.
Pass structured inputs, not raw conversation history:
## Orchestrator passes to code-reviewer:
{
"file_path": "src/auth/token_validator.py",
"review_type": "security",
"known_issues": ["CVE-2025-1234 - verify token expiry check"],
"output_format": "json"
}
Not:
Here's everything that happened in this session so far: [5,000 tokens of
context about unrelated tasks] ... Now please review this file.
Similarly, design subagents to return structured outputs, not prose that the orchestrator needs to parse. A subagent that returns JSON is faster and cheaper to work with than one that returns a narrative.
Define a contract for each subagent: what it receives, what it returns, and what errors it raises. Enforce the contract in both the AGENTS.md definition and in any downstream parsing logic.
Best Practice 7: Error Handling and Fallback Patterns
Subagents fail. Networks time out, tools return unexpected results, edge cases in input data cause the agent to go sideways. Your orchestrating agent needs explicit strategies for what to do when a subagent fails — not just a generic “try again.”
The three patterns that work in production:
Retry with narrowed scope. If a subagent fails on a complex input, break the input into smaller pieces and retry. A code reviewer failing on a 2,000-line file can often succeed on 200-line chunks.
Fallback to a different agent or model. Define a fallback: subagent in AGENTS.md for tasks where failure is high-cost. The fallback might use a more capable model, a simpler approach, or a human-in-the-loop step.
Explicit failure reporting. Instruct subagents to return a structured failure object rather than raising an exception or returning garbled output. The orchestrator can then make an informed decision.
## data-validator
Validates the input data structure and returns a result object.
On success, return:
{
"status": "ok",
"record_count": <integer>,
"warnings": []
}
On failure, return:
{
"status": "error",
"error_type": "schema_mismatch" | "missing_required_fields" | "encoding_error",
"details": "<human-readable description>",
"affected_records": [<list of row indices if applicable>]
}
Never throw an exception. Always return a result object.
This explicit failure contract makes the orchestrator’s error handling straightforward: check status, branch on error_type, log details.
Best Practice 8: Logging and Observability
You cannot debug what you cannot observe. Production subagent systems need logging at the orchestrator level, not just inside individual agents.
Log at minimum:
- Which subagent was called
- What input it received (or a hash/summary if the input is large)
- How long it took
- What it returned (or the error it raised)
- Which model was used
A simple logging wrapper around every subagent call:
import time
import json
import hashlib
from datetime import datetime
def call_subagent(agent_name: str, input_data: dict) -> dict:
start = time.time()
input_hash = hashlib.sha256(
json.dumps(input_data, sort_keys=True).encode()
).hexdigest()[:8]
log_entry = {
"timestamp": datetime.utcnow().isoformat(),
"agent": agent_name,
"input_hash": input_hash,
"input_size_tokens": estimate_tokens(input_data),
}
try:
result = invoke_subagent(agent_name, input_data)
log_entry["status"] = "ok"
log_entry["output_size_tokens"] = estimate_tokens(result)
log_entry["duration_ms"] = int((time.time() - start) * 1000)
return result
except Exception as e:
log_entry["status"] = "error"
log_entry["error"] = str(e)
log_entry["duration_ms"] = int((time.time() - start) * 1000)
raise
finally:
append_to_log("subagent_calls.jsonl", log_entry)
Once you have structured logs, you can build useful aggregations: which agents are slowest, which fail most often, which inputs correlate with failures. This data drives targeted improvements instead of guesswork.
Hooks are another observability layer worth using. Claude Code hooks can fire on subagent invocation events and write structured records without modifying your agent code. See our complete hooks reference for the full event list and handler patterns.
Best Practice 9: Cost Tracking Per Subagent
At scale, subagent costs can surprise you. A workflow that looks cheap per invocation can run hundreds of times per day and dominate your bill. The only way to manage this is per-subagent cost tracking.
The key metrics to track:
- Input tokens per call
- Output tokens per call
- Model used (Haiku vs Sonnet vs Opus price difference is significant)
- Calls per day / per workflow run
A minimal cost tracking setup using JSONL logs:
MODEL_COSTS = {
"claude-haiku-4-5": {"input": 0.80, "output": 4.00}, # per million tokens
"claude-sonnet-4-5": {"input": 3.00, "output": 15.00},
"claude-opus-4-5": {"input": 15.00, "output": 75.00},
}
def calculate_call_cost(model: str, input_tokens: int, output_tokens: int) -> float:
rates = MODEL_COSTS.get(model, MODEL_COSTS["claude-sonnet-4-5"])
return (
(input_tokens / 1_000_000) * rates["input"] +
(output_tokens / 1_000_000) * rates["output"]
)
Review your cost logs weekly, not monthly. Common patterns that indicate optimization opportunities:
- High input token count. The orchestrator is passing too much context. Tighten the input contract.
- High output token count. The agent is returning more than necessary. Add explicit output length constraints.
- Most calls on Sonnet/Opus. Revisit model selection — some of these tasks may not need the heavier model.
- Frequent retries. Count retries as separate calls in your logs. High retry rates mean your agent definitions or input quality need work.
Set budget alerts at the team level. If a workflow starts costing 3x what it did last week, you want to know before the bill arrives.
Best Practice 10: Naming Conventions That Scale
When you have ten subagents, naming is trivia. When you have forty, it determines whether the system is navigable. Establish a convention early and enforce it in code review.
A convention that works across large AGENTS.md files:
{domain}-{verb}-{noun}
Examples:
auth-validate-tokencode-review-securitydocs-generate-api-referencedata-transform-csv-to-jsontest-run-integration
Rules for the convention:
- Domain is the functional area (auth, code, docs, data, test, deploy, notify)
- Verb is what the agent does to something (validate, review, generate, transform, run, fetch, publish)
- Noun is what it operates on (token, security, api-reference, csv-to-json, integration)
- No generic names like
helper,utility, oragent - No model names in the subagent name (the model can change without renaming the agent)
Store the naming convention in a CONVENTIONS.md file alongside your AGENTS.md and add a CI check that validates new entries conform to the pattern. A linter that runs grep on AGENTS.md and checks name format catches violations at PR time, not at 3am when something breaks in production.
Best Practice 11: Versioning Subagent Definitions
Subagent definitions are code. They should be versioned, reviewed, and changed with the same care as any other code that goes into production.
The practical approach: track AGENTS.md in git alongside your codebase. Every change to a subagent definition gets a commit with a meaningful message. Use semantic versioning in a comment header on each agent definition if your agents have external consumers.
## code-review-security
# v2.1.0 — 2026-05-15
# Breaking change from v1.x: output format changed from markdown to JSON
# Requires orchestrator update before deployment
Performs security-focused code review of a single file.
### Input
{
"file_path": string,
"language": string,
"security_level": "standard" | "high" | "critical"
}
### Output (JSON)
{
"findings": [{"severity": string, "line": integer, "description": string}],
"summary": string,
"recommendation": "approve" | "request_changes" | "block"
}
When a subagent definition changes in a way that would break callers — different input schema, different output structure, different tool requirements — that is a breaking change and needs a major version bump and a migration plan for any orchestrators that depend on it.
For multi-team environments, consider separating AGENTS.md into per-domain files that are owned by specific teams: AGENTS.auth.md, AGENTS.data.md, AGENTS.deploy.md. Each team versions and reviews their own file. The orchestrating AGENTS.md imports or references them.
Best Practice 12: Testing Subagents Like Code
Subagents are not magic — they are deterministic-enough systems that can and should be tested. The test patterns that work:
Contract tests. Verify that a subagent returns output that matches its defined output schema. Don’t test the content of the output — test that the structure is correct.
import jsonschema
OUTPUT_SCHEMA = {
"type": "object",
"required": ["status", "findings"],
"properties": {
"status": {"type": "string", "enum": ["ok", "error"]},
"findings": {"type": "array"},
}
}
def test_code_reviewer_output_schema():
result = call_subagent("code-review-security", {
"file_path": "tests/fixtures/sample.py",
"language": "python",
"security_level": "standard"
})
jsonschema.validate(result, OUTPUT_SCHEMA) # raises on schema mismatch
Smoke tests with known inputs. For each subagent, maintain a small set of fixture inputs with expected output properties. Run these on every change to AGENTS.md.
Regression tests on real failures. When a subagent fails in production, add the failing input as a fixture and write a test that asserts the fixed behavior. This builds a regression suite that reflects real-world edge cases, not imagined ones.
Load tests. If a subagent will be called at high frequency, test it under load. Token costs, latency, and error rates often behave differently at volume than in isolation.
The testing infrastructure investment pays off quickly. Changes to AGENTS.md definitions are one of the highest-risk changes in a subagent system — a test suite catches regressions before they reach production.
Common Anti-Patterns to Avoid
The God Agent
An agent that can do everything. It has every tool available, a vague system prompt, and is called for any task that doesn’t fit elsewhere. It produces inconsistent outputs, fails in unpredictable ways, and is impossible to debug.
Fix: decompose it into single-responsibility agents. If you truly need a general-purpose agent, give it a clear scope boundary and a defined fallback behavior.
Context Forwarding
Passing the entire orchestrator context to every subagent. This wastes tokens, introduces noise, and can cause the subagent to be influenced by irrelevant prior results.
Fix: define a structured input schema for each subagent. Pass only what it needs.
Silent Failures
Subagents that return plausible-looking but incorrect output rather than raising an error. The orchestrator sees a well-formed response, proceeds to the next step, and the error compounds.
Fix: add validation at the orchestrator level. After each subagent call, check that the output meets minimum quality criteria before proceeding.
Model Uniformity
Using the same model (almost always Sonnet) for every subagent regardless of task complexity. This is expensive for simple tasks and sometimes insufficient for complex ones.
Fix: audit each subagent’s actual task complexity. Haiku handles extraction, classification, and summarization of structured data well. Opus is warranted for tasks that require genuine multi-step reasoning.
Missing Idempotency
Subagents that are not idempotent — calling them twice with the same input produces different side effects (writes two files, makes two API calls). When you add retry logic, non-idempotent agents double the damage.
Fix: design agents so that calling them twice with the same input is safe. For agents that write files or call external APIs, add idempotency keys or check-before-write logic.
Undocumented Dependencies
A subagent that silently requires certain environment variables, API keys, or file system state to exist. It fails with a cryptic error when deployed in a context where that state doesn’t exist.
Fix: document all external dependencies in the agent definition. Validate them at the start of the agent’s execution and return a clear error if they’re missing.
FAQ
How many subagents is too many?
There is no fixed limit, but complexity grows non-linearly. Orchestrating 5 agents is straightforward. Orchestrating 20 requires careful dependency management, logging infrastructure, and a testing investment. If you have more than 10-15 agents in AGENTS.md, audit whether each one is genuinely used and whether any can be consolidated without losing specificity.
Should subagent definitions live in AGENTS.md or in separate files?
Both approaches work. A single AGENTS.md is simpler for small teams — everything is in one place. For large codebases with many agents owned by different teams, splitting into per-domain files reduces merge conflicts and makes ownership clear. The key is consistency: don’t mix the two approaches.
Can I call a subagent from inside another subagent?
Technically yes, but it adds complexity and cost that is rarely justified. Deeply nested agent calls are hard to debug, and the cost compounds with each level of nesting. In most cases, flat orchestration (one orchestrator calling multiple specialized agents) is more maintainable than hierarchical agent trees.
How do I handle secrets inside subagents?
Secrets should never appear in AGENTS.md definitions or be passed as explicit context. Use environment variables that are available to the Claude Code process and instruct agents to read them via the Bash tool or a dedicated secrets management pattern. Log the names of secrets used, not their values.
What is the right retry count for a failing subagent?
Two retries is a reasonable default for transient failures (network timeouts, temporary API errors). For logic failures (the agent returns an error because the input doesn’t match what it expects), retrying with the same input will fail again. Distinguish error types in your error handling and only retry transient failures. For logic failures, escalate to the orchestrator.
How do I keep subagent definitions synchronized with code changes?
Treat AGENTS.md changes as code changes that require review. When a function signature changes, the corresponding subagent’s input schema should be updated in the same PR. Consider adding a CI check that validates AGENTS.md definitions against your actual codebase structure (e.g., verifying that file paths referenced in agents exist).
What is the performance impact of using subagents vs. a single agent?
Subagents add latency: each invocation has API round-trip overhead plus the context initialization cost. For sequential workflows, this adds up. The payoff is context isolation and specialization — but if you are running subagents sequentially and each one is small, a single well-structured prompt might be faster and cheaper. Subagents provide the most value when they run in parallel or when task complexity genuinely requires isolation.
Putting It Together
The twelve practices in this article form a coherent system, not a checklist. Single-responsibility design enables tool restriction. Tool restriction enables cost optimization. Cost optimization requires per-subagent tracking. Tracking requires naming conventions and logging. Logging enables debugging. Debugging requires good error contracts. Error contracts require testing. Testing requires versioned definitions.
Start with the practices that address your current biggest pain points:
- If agents are producing inconsistent output: Best Practices 1, 6, and 12
- If costs are higher than expected: Best Practices 3, 9, and 6
- If debugging is hard: Best Practices 8, 7, and 10
- If maintenance is painful: Best Practices 11, 10, and 12
For the architecture patterns that complement these practices — specifically how subagents compose into larger multi-agent systems — see our Claude Code agent teams guide. For the skills layer that sits alongside subagents in a mature Claude Code setup, read how to write Claude Code skills. For comprehensive hook-based observability across your entire subagent system, the Claude Code hooks complete reference covers every event type you can instrument.
Browse real production AGENTS.md files that apply these patterns in our rules gallery.
The gap between a working demo and a reliable production system is not API knowledge — it is design discipline. Single-responsibility definitions, explicit contracts, and systematic testing are the same principles that make any software system maintainable. Subagents are no different. Apply the same rigor you would to any other production code and they will reward you with consistency, debuggability, and predictable costs.