The single-agent paradigm for AI-powered applications hit a wall for me about 14 months ago. I was building the security testing engine for SCAFU and had a single, monolithic LLM prompt that was supposed to handle reconnaissance, payload generation, vulnerability analysis, and report writing. The prompt was 4,200 tokens long, the context window was perpetually exhausted, and the model was mediocre at everything because it was trying to be an expert at everything simultaneously.
The breakthrough was obvious in retrospect: stop asking one agent to be a polymath and start building a team of specialists. Today, SCAFU's security testing engine coordinates 16+ specialized agents across 4 phases of operation, and the difference in output quality is not incremental. It is categorical. Here is how I got there and what I learned about making multi-agent systems work in production.
Why Single Agents Fail at Complex Tasks
The fundamental limitation of a single-agent approach is context window economics. A state-of-the-art model in 2026 offers 128K-200K tokens of context. That sounds like a lot until you consider what a complex task actually requires. For a security scan, the agent needs: the scan configuration and target details (500 tokens), the reconnaissance results from the target (2,000-8,000 tokens depending on application size), the relevant vulnerability knowledge for the identified technology stack (3,000-5,000 tokens), generated payloads and their rationale (1,000-3,000 tokens per vulnerability class), raw scan results that need analysis (5,000-20,000 tokens for a moderately complex application), and the output format and reporting requirements (1,000 tokens).
For a complex target, you are looking at 40,000-60,000 tokens of essential context. That leaves maybe 70,000-90,000 tokens for the model's reasoning. And here is the problem: model performance degrades significantly as context utilization increases. Studies from late 2025 showed that retrieval accuracy in the middle of long contexts drops by 20-40% compared to information at the beginning or end. The model literally loses track of critical details buried in the middle of a massive prompt.
More importantly, single agents suffer from role confusion. A model asked to simultaneously be a reconnaissance expert, a payload crafter, a vulnerability analyst, and a report writer cannot maintain the distinct cognitive modes each role requires. Reconnaissance demands breadth and curiosity -- exploring every surface. Payload generation demands precision and creativity -- finding the exact syntax that breaks a specific parser. Analysis demands skepticism and rigor -- challenging every finding against real exploitability criteria. These are different thinking styles, and forcing a single agent to context-switch between them produces mediocre work across the board.
Communication Patterns: Hub-and-Spoke vs. Mesh vs. Pipeline
The first design decision in a multi-agent system is how agents communicate with each other. I tested three patterns before settling on a hybrid approach.
Hub-and-Spoke
A central coordinator agent dispatches tasks to specialist agents and aggregates their results. This is the simplest pattern and the one most tutorials demonstrate. The coordinator maintains the overall task state and decides which specialist to invoke next.
The problem with pure hub-and-spoke is that the coordinator becomes a bottleneck and a single point of failure. If the coordinator misunderstands a specialist's output, the error propagates to every downstream agent. In my testing, coordinator comprehension errors occurred in approximately 12% of interactions, and each error cascaded into 2-4 downstream failures. At 16+ agents, a 12% per-interaction error rate compounds into system-level unreliability.
Mesh Communication
Every agent can communicate with every other agent directly. This eliminates the coordinator bottleneck but introduces a different problem: communication complexity scales as O(n^2). With 16 agents, there are 240 possible communication channels. In practice, agents would get into conversation loops, repeatedly requesting information from each other without converging on a result. Without a central authority to break deadlocks, mesh communication in a 16-agent system averaged 3.2 times more LLM calls per task than hub-and-spoke, with no improvement in output quality.
Pipeline (What Actually Works)
The production architecture uses a directed pipeline with lateral communication channels. Agents are organized into phases, and the output of each phase feeds the input of the next. Within a phase, agents can communicate laterally with other agents in the same phase, but they cannot reach back to modify earlier phases or skip ahead.
In SCAFU, the four phases are:
- Reconnaissance (3 agents): TechStack Detector, Endpoint Mapper, Security Profiler. These agents run in parallel, each producing a structured report. Lateral communication is limited to deduplication: if the Endpoint Mapper discovers a technology indicator, it notifies the TechStack Detector to avoid redundant probing.
- Planning (2 agents): Attack Planner, Priority Ranker. These agents receive the combined reconnaissance output and produce a prioritized attack plan. The Planner generates candidate attack vectors. The Ranker evaluates them against the security profile and produces a priority-ordered queue.
- Execution (8 agents): Specialized payload generators and testers for XSS, SQLi, Authentication, SSRF, IDOR, API abuse, Deserialization, and CORS. Each agent receives the attack plan items relevant to its specialty and executes them. These agents run in parallel with rate-limiting coordination to avoid overwhelming the target.
- Analysis (3+ agents): Validation Agent, Chain Discovery Agent, Report Generator. The Validation Agent confirms exploitability. The Chain Discovery Agent looks for multi-step exploit chains by analyzing which individual findings can be combined. The Report Generator produces the final output.
This pipeline reduced LLM calls per scan by 45% compared to hub-and-spoke and eliminated the conversation loops of mesh communication. Each agent has a focused context window containing only the information relevant to its role, which improves per-agent quality significantly.
The Coordination Layer: State Management for Agents
The coordination layer is the infrastructure that makes the pipeline work. It is not an agent itself; it is a software system (written in Python, running as a FastAPI service) that manages agent state, message routing, and failure handling. This is the component I discuss in more detail on the architecture deep-dive page.
The coordination layer maintains three data structures:
Task Graph. A directed acyclic graph (DAG) representing the dependency relationships between agent tasks. Each node is a task with an assigned agent, input requirements, expected output schema, and completion status. The DAG determines which tasks can run in parallel and which must wait for upstream dependencies.
Shared Context Store. A key-value store (Redis-backed) that holds the shared state between agents. When the TechStack Detector identifies that the target uses Next.js 15, it writes {"tech.framework": "nextjs", "tech.framework_version": "15.1.2"} to the shared context. Downstream agents read from this store rather than receiving the full reconnaissance output. This reduces per-agent context size by 60-80% because each agent only retrieves the context keys it needs.
Message Queue. An ordered queue of inter-agent messages that handles lateral communication within phases. Messages are typed (NOTIFICATION, REQUEST, RESPONSE, ERROR) and carry structured payloads. The queue enforces the communication rules: agents can only send messages to agents in the same phase or to the coordination layer itself. Cross-phase communication is prohibited by design.
The coordination layer also implements three critical safety mechanisms:
Timeout enforcement. Every agent task has a maximum execution time. If an agent exceeds its timeout (typically 30 seconds for analysis tasks, 120 seconds for execution tasks that involve HTTP requests), the coordination layer kills the task and either retries with a simplified prompt or marks it as failed and continues without it. In production, approximately 3% of agent tasks hit their timeout, almost always due to unexpectedly large inputs.
Output validation. Every agent's output is validated against a JSON schema before it is written to the shared context store or passed to downstream agents. If an agent produces malformed output, the coordination layer can request a retry (with the schema error fed back as correction guidance) up to 3 times before marking the task as failed. Schema validation catches approximately 8% of agent outputs, and the retry mechanism resolves 85% of those on the first retry.
Deadlock detection. The coordination layer monitors the task graph for circular dependencies that could cause the pipeline to hang. If no task makes progress for 60 seconds, it evaluates whether remaining tasks can be skipped without compromising the final output, and if so, proceeds with partial results.
Specialization: How to Design Agent Boundaries
The hardest design question in multi-agent systems is where to draw the boundaries between agents. Too few agents and you are back to the monolithic prompt problem. Too many and the coordination overhead dominates the actual work. After extensive experimentation, I settled on three principles for agent boundary design.
Principle 1: One cognitive mode per agent. An agent that needs to be both creative (generating novel attack payloads) and analytical (validating whether those payloads succeeded) should be split into two agents. The creative agent works without self-censorship. The analytical agent works with rigorous skepticism. Combining these modes in a single agent reliably produces compromised output -- payloads that are "safe" enough to pass self-validation but not creative enough to find real vulnerabilities.
Principle 2: Context should fit comfortably in 8,000 tokens. If an agent needs more than 8,000 tokens of context to do its job, it is probably doing two jobs. The XSS payload generator needs: the target's CSP policy (200 tokens), the identified injection points (500-1,000 tokens), the framework's escaping behavior (300 tokens), and the XSS knowledge base (2,000 tokens). Total: approximately 3,500 tokens. Plenty of room for reasoning. If I tried to combine XSS and SQLi generation into one agent, the context would double and the model's attention would split.
Principle 3: Failure should be isolated. If one agent fails, the system should degrade gracefully, not collapse. This means agents should not have bidirectional dependencies. If Agent A depends on Agent B's output, Agent B should not also depend on Agent A. In the SCAFU pipeline, if the SSRF testing agent fails entirely, the system loses SSRF coverage but every other vulnerability class is still fully tested. No downstream agent depends on SSRF results to function.
Handling Disagreements Between Agents
When two agents produce conflicting assessments, the system needs a resolution strategy. In SCAFU, this happens most often between the Execution agents and the Validation agent. An Execution agent reports a finding as a confirmed XSS vulnerability. The Validation agent attempts to reproduce it and fails.
The resolution protocol has three tiers:
- Environmental retry. The Validation agent retries with different conditions: different browser user-agent strings, different timing (some vulnerabilities require specific race conditions), and different authentication states. This resolves 40% of disagreements -- the vulnerability is real but environment-sensitive.
- Evidence comparison. Both agents present their evidence (request/response pairs, execution traces) to a Mediator agent that evaluates the strength of each position. The Mediator has access to the raw HTTP traffic and can determine whether the discrepancy is due to a WAF blocking the validation attempt, a timing-dependent vulnerability, or a genuine false positive. This resolves another 35% of disagreements.
- Conservative default. If the Mediator cannot resolve the disagreement, the finding is reported with a reduced confidence score and a flag indicating that automated validation was inconclusive. The human investigator makes the final call. This happens for approximately 25% of disagreements, representing roughly 2-3% of total findings.
Performance at Scale: Numbers from Production
Running 16+ agents in a pipeline generates significant infrastructure load. Here are the actual numbers from production scans:
- LLM calls per scan: 140-280 calls (depending on application complexity and number of identified attack surfaces).
- Total tokens consumed per scan: 800K-1.4M tokens across all agents combined.
- Scan duration: 8-25 minutes for a typical web application. The bottleneck is HTTP request execution, not LLM inference.
- Local model usage: 72% of LLM calls use local Ollama models (all security-sensitive operations). 28% use cloud APIs (planning and report generation only).
- Agent failure rate: 3.1% of individual agent tasks fail. 0.4% of scans fail entirely (all critical agents failed). The remainder complete with partial results.
- Resource requirements: The full pipeline runs on a single machine with 32GB RAM, an NVIDIA RTX 4090 (24GB VRAM), and 8 CPU cores. Peak GPU utilization during a scan is 85-92%.
The most important optimization was agent-level caching. If two different scans target applications with the same framework (Next.js 15, for example), the Reconnaissance phase results are partially cacheable. The TechStack Detector's knowledge about Next.js 15 vulnerability patterns does not change between scans. Caching framework-specific knowledge reduced per-scan LLM calls by approximately 20% for subsequent scans of the same technology stack.
Multi-agent systems are harder to build than single-agent systems. The coordination layer is significant engineering. The debugging is more complex because failures can originate in any agent and propagate through the pipeline in non-obvious ways. But for tasks that require multiple distinct expertise domains, the quality improvement over a single agent is substantial enough to justify the engineering investment. A team of focused specialists, properly coordinated, consistently outperforms a single generalist trying to do everything at once.
← Back to Blog