Architecture and System Decomposition

Part 1 argued that the first design decision for an LLM system is not "how smart should it be?" but "where do uncertainty, agency, and accountability sit?" That decision constrains your architecture to a narrow set of viable designs. This post is about making those designs explicit.

The conventional wisdom in LLM system design is to "start simple and add complexity as needed." This is correct but incomplete. The harder question is: what kind of complexity do you add, and in what order? A system that adds retrieval is fundamentally different from one that adds tool use, which is fundamentally different from one that adds multi-step planning. Each escalation path creates different failure modes, different evaluation requirements, and different operational burdens.

This post covers four layers of architectural decision-making: decomposition patterns (how to break a task into components), control flow (how those components coordinate), memory architecture (how state persists across steps), and tool integration (how the system acts on the world). For each layer, I will give you both the conceptual framework and the concrete decision criteria that let you choose among alternatives.

1. The Decomposition Decision: What Gets Its Own Box?

Every LLM system, no matter how simple, involves decomposition. Even a single API call decomposes into prompt construction, model invocation, and response parsing. The question is not whether to decompose, but where to draw the boundaries and what abstraction each component provides.

Three decomposition philosophies

Monolithic. One LLM call does everything. The prompt contains all instructions, context, and constraints. The response is the final output. This is the baseline, and it is often underestimated. Modern frontier models can handle remarkably complex tasks in a single call if the prompt is well-constructed. The failure mode is not capability but controllability: when something goes wrong, you have no intermediate state to inspect.

Pipeline. Sequential stages, each with a single responsibility. Stage 1 extracts information, Stage 2 reasons over it, Stage 3 generates output. Each stage can be a different model, a different prompt, or a deterministic function. The failure mode is error propagation: a mistake in Stage 1 corrupts everything downstream, and you cannot recover without re-running from the beginning.

Graph. Components with conditional dependencies. The output of one component determines which other components run. This includes branching (if-then-else), looping (repeat until condition), and parallel execution (fan-out, fan-in). The failure mode is complexity explosion: as the graph grows, reasoning about all possible execution paths becomes intractable.

The decomposition heuristic

Decompose when any of the following are true:

Different components need different models. A classification step might use a small, fast model; a generation step might need a frontier model. If you need heterogeneous compute, you need decomposition.

You need intermediate observability. If debugging requires seeing what happened between input and output, you need explicit stages that emit inspectable state. This is the most common reason to decompose in practice.

Components have different reliability requirements. A retrieval step that returns wrong documents is recoverable (the generation step can ignore them). A generation step that hallucinates a tool call is not recoverable (the tool executes). If components have different error costs, they should be different components with different guardrails.

Components operate at different time scales. A slow, expensive reasoning step should not block a fast, cheap classification step. If you need to return partial results or handle timeouts gracefully, decomposition gives you the control points to do so.

Do not decompose just because it "feels cleaner." Every boundary you introduce is a contract you must maintain, an interface you must version, and a failure mode you must handle. The cost of decomposition is coordination overhead, and that cost is non-zero.

The debuggability test

When your system produces a wrong output, how many steps back do you need to trace to find the root cause? If the answer is "the whole thing," you have under-decomposed. If the answer is "I have to check twelve intermediate states," you have over-decomposed. The sweet spot is three to five checkpoints for a typical failure investigation.

2. Control Flow: Who Decides What Happens Next?

Once you have components, something must orchestrate them. The control flow decision is: how much authority does the LLM have over its own execution path?

Four control flow patterns

Fixed Pipeline

Execution order is hardcoded. Stage A always runs, then Stage B, then Stage C. The LLM has no say in what happens next. This is the simplest pattern and the easiest to reason about.

Use when: task structure is predictable, all inputs require the same processing steps.

Router-Dispatched

A classifier (often a small LLM or rule-based system) examines the input and routes to one of several fixed sub-pipelines. The routing decision is the only LLM-controlled branch; everything else is deterministic.

Use when: inputs fall into distinct categories with different optimal handling.

ReAct Loop

The LLM alternates between reasoning and acting. It decides what action to take, observes the result, and decides again. The loop continues until the LLM emits a terminal action or a budget is exhausted.

Use when: the number of steps cannot be predicted in advance, each step depends on what was learned previously.

Plan-then-Execute

The LLM generates a complete plan upfront, then a separate executor runs the plan step by step. The plan can be modified (re-planning) if execution reveals the plan is wrong, but generation and execution are distinct phases.

Use when: you need to review the plan before execution, or when execution is expensive and you want to catch errors before committing.

The control flow tradeoff

More LLM control means more flexibility and more risk. A fixed pipeline cannot handle novel inputs, but it also cannot get stuck in an infinite loop. A ReAct loop can solve problems the designer did not anticipate, but it can also burn through your token budget chasing a hallucinated goal.

Pattern	Flexibility	Predictability	Failure Modes
Fixed Pipeline	None	Complete	Wrong output, but always terminates
Router-Dispatched	Low (discrete branches)	High	Misrouting, wrong branch selected
ReAct Loop	High	Low	Infinite loops, budget exhaustion, goal drift
Plan-then-Execute	Medium	Medium	Bad plans, re-planning thrash, plan-execution mismatch

The industrial pattern: layered control

The most robust production systems do not commit to a single control flow pattern. They layer them.

Outer layer: router-dispatched. A fast classifier examines the input and routes to one of several handling paths. Simple queries go to a fixed pipeline. Complex queries go to an agentic subsystem.

Middle layer: plan-then-execute with approval gates. For complex queries, the system generates a plan and pauses for review (human or automated). Only approved plans proceed to execution.

Inner layer: ReAct loops with strict budgets. Within each plan step, a ReAct loop handles the actual execution, but with hard limits on iterations and token spend.

This layering captures the benefits of each pattern while containing their failure modes. The router prevents the agentic system from being invoked on simple queries (cost control). The plan-then-execute layer prevents commitment to bad plans (safety). The ReAct layer provides flexibility within bounded risk (capability with guardrails).

Cursor and Claude Code both implement variants of this pattern. Tab completion is a fixed pipeline (zero LLM control over execution). Inline edit is plan-then-execute (the diff is the plan; user approval is the gate). Agent mode is a ReAct loop with tool budgets and human checkpoints on high-risk operations.

3. Memory Architecture: What Does the System Remember?

Every LLM call is stateless. The model has no memory of previous calls unless you explicitly provide that memory in the prompt. Memory architecture is about what to remember, how to store it, and how to retrieve it when needed.

Four memory types

Context window (working memory). Everything in the current prompt. This is the only memory the model can directly attend to. It is fast, reliable, and severely limited. The context window is your most precious resource; everything else in memory architecture is about using it efficiently.

Conversation history (episodic memory). Previous turns in the current session. The naive approach appends everything; the production approach summarizes, compresses, or selectively retrieves. The failure mode is context exhaustion: as the conversation grows, you either truncate (losing information) or summarize (losing fidelity).

Knowledge base (semantic memory). External documents, retrieved on demand. This is RAG. The failure mode is retrieval failure: the relevant information exists but is not retrieved, or irrelevant information is retrieved and misleads the model.

Scratchpad (procedural memory). Intermediate reasoning state that persists across steps within a single task. In a ReAct loop, this might be the history of actions taken and observations received. The failure mode is scratchpad pollution: as the task progresses, the scratchpad accumulates irrelevant or contradictory information that confuses subsequent reasoning.

The memory management problem

All four memory types compete for the same finite resource: context window tokens. Memory architecture is fundamentally about allocation: how many tokens go to conversation history versus retrieved documents versus scratchpad versus instructions?

A common pattern:

System prompt: 500-2000 tokens. Instructions, persona, constraints. This is fixed per session.
Retrieved context: 2000-8000 tokens. Documents relevant to the current query. This varies per query.
Conversation history: 1000-4000 tokens. Recent turns, possibly summarized older turns. This grows with the session.
Scratchpad: 500-2000 tokens. For agentic systems, the action-observation history. This grows with task complexity.
Current query: Variable. The user's actual input.
Output budget: 500-4000 tokens. Reserved for the model's response.

The sum must fit in the context window. When it does not, something must be evicted. The eviction policy is a design decision with quality implications.

Memory strategies for different system types

System Type	Primary Memory	Eviction Strategy	Failure Mode
Single-turn Q&A	Knowledge base	None (no accumulation)	Retrieval failure
Multi-turn chat	Conversation history	Sliding window or summarization	Lost context from early turns
Agentic task	Scratchpad	Relevance-based pruning	Forgetting critical intermediate state
Long-running agent	All four	Hierarchical: recent in full, older summarized	Summary drift, compounding errors

The reflection pattern

One memory architecture deserves special attention: explicit reflection. Instead of just accumulating observations, the system periodically pauses to synthesize what it has learned.

In Reflexion (Shinn et al., 2023), the agent maintains a "reflection" buffer that summarizes lessons learned from previous attempts. When the agent fails, it generates a reflection ("I tried X but it failed because Y; next time I should Z") that persists to the next attempt. This is not just memory; it is meta-memory—the system reasoning about its own reasoning.

The production version of this pattern uses scheduled reflection points: after every N steps, or when the scratchpad exceeds a threshold, the system generates a summary that replaces the detailed history. This trades fidelity for longevity: you lose the details but retain the lessons.

Devin implements a variant where the agent maintains a "plan" document that it updates as it works. The plan is both a memory (what have I done?) and a guide (what should I do next?). When the context window fills, old execution details are evicted but the plan persists, providing continuity.

4. Tool Integration: How Does the System Act?

Tools are how LLM systems affect the world. A tool might read a file, query a database, send an email, or execute code. Tool design is where the abstract decisions from Part 1 (authority boundaries, error costs, reversibility) become concrete.

The tool schema contract

Every tool has an implicit contract with the LLM:

What it does. A natural language description that the model uses to decide when to invoke the tool. Ambiguous descriptions cause misuse. "Search the web" is ambiguous (search for what? return what?). "Search the web for the query and return the top 5 result titles and URLs" is precise.

What it accepts. A schema for the input parameters. The schema should be as constrained as possible. If a parameter can only take three values, enumerate them; do not accept a free-form string. The more constrained the schema, the fewer ways the model can misuse the tool.

What it returns. A schema for the output. The model needs to know what to expect so it can plan subsequent steps. If the tool can fail, the failure modes should be documented in the description.

What can go wrong. This is the most commonly omitted part of the contract. Does the tool timeout? Return partial results? Have rate limits? Cost money? Every undocumented failure mode is a debugging session waiting to happen.

The principle of least authority

A tool should have the minimum authority necessary to accomplish its purpose. This is not just a security principle; it is a reliability principle. The more a tool can do, the more ways it can fail, and the harder failures are to diagnose.

Read vs. write. If the task only requires reading, the tool should not have write access. A "file browser" tool that can also delete files is a liability, not a feature.

Scoped vs. global. If the task only requires access to a specific directory, the tool should not have access to the entire filesystem. Scope limits blast radius.

Reversible vs. irreversible. Prefer tools whose effects can be undone. A tool that appends to a log is safer than one that overwrites it. A tool that creates a draft is safer than one that sends an email.

Confirmation-gated vs. autonomous. High-stakes tools should require explicit confirmation before execution. This is the implementation of the "bounded autonomy" pattern from Part 1: let the LLM decide what to do, but gate the action on human approval.

Tool composition patterns

Real systems rarely use tools in isolation. They compose tools into workflows. Three composition patterns dominate:

Sequential. Tool A's output feeds Tool B's input. File read → parse → transform → file write. The failure mode is the same as pipeline decomposition: errors propagate forward.

Conditional. Tool A's output determines which tool runs next. Search → if results empty, try broader query; else, summarize results. The failure mode is incorrect branching: the condition is evaluated wrong, and the wrong branch executes.

Parallel. Multiple tools run simultaneously, and their outputs are aggregated. Search three databases in parallel, merge results. The failure mode is partial failure: one tool succeeds, another fails, and the aggregation logic must handle the inconsistency.

The tool design checklist

For every tool in your system, answer these questions:

What is the most dangerous thing this tool can do? (Blast radius assessment)
Can that dangerous thing be undone? (Reversibility)
What happens if this tool is called with malformed input? (Input validation)
What happens if this tool is called in a context where it should not be? (Authorization)
What does the model see if this tool fails? (Error messaging)
How will you know if this tool is being misused? (Observability)

If you cannot answer all six questions for every tool, your system is not ready for production.

5. Putting It Together: Architecture Selection

The four layers (decomposition, control flow, memory, tools) interact. Choosing a control flow pattern constrains your memory architecture. Choosing a memory architecture constrains your tool design. The goal is coherence: a system where all the layers are aligned with each other and with the framing decisions from Part 1.

The architecture decision tree

Start here: What is the I/O gap?

Can the task be completed in a single LLM call with retrieval?

Yes: Use a fixed pipeline. Retrieval → Generation. Memory is just the retrieved documents.

No: Continue below.

Can you enumerate the possible execution paths in advance?

Yes: Use router-dispatched with fixed sub-pipelines. Memory per branch as needed.

No: You need agentic control flow. Continue below.

Is the plan reviewable before execution?

Yes: Use plan-then-execute. Memory includes the plan as a persistent artifact.

No: Use ReAct loop with scratchpad memory. Add hard iteration limits.

Does the task require irreversible actions?

Yes: Add confirmation gates before irreversible tools. Consider plan-then-execute even if plan is not externally reviewed.

No: Proceed with selected control flow.

Does the task exceed the context window?

Yes: Add hierarchical memory with summarization. Consider scheduled reflection points.

No: Use simple conversation history.

Three worked examples

Example 1: Customer support bot (Part 1's Uber case).

Decomposition: Router → specialized handlers per issue type. Each handler is a short pipeline (retrieve records → apply policy → generate response).
Control flow: Router-dispatched. Complex cases escalate to human; they do not get more agentic handling.
Memory: Conversation history (full for current session). Retrieved records (customer and trip data). No scratchpad needed for simple cases.
Tools: Read-only tools for data retrieval. Write tools (issue refund) gated on policy compliance checks. High-value actions require human confirmation.

Example 2: Code agent (Claude Code / Cursor).

Decomposition: Task understanding → planning → iterative execution. Execution decomposes into: read files, edit files, run commands, observe results.
Control flow: Layered. Simple edits use fixed pipeline. Complex tasks use plan-then-execute with user review. Execution uses ReAct loop with iteration budget.
Memory: Scratchpad (recent actions and observations). Semantic memory (codebase index, documentation). Plan as persistent artifact. Reflection at checkpoints.
Tools: File read (safe). File write (medium risk, diff review). Terminal commands (high risk, sandboxed). Reversibility through git: all changes can be rolled back.

Example 3: Research agent (Deep Research / Perplexity Pro).

Decomposition: Query expansion → parallel search → source evaluation → synthesis → fact-checking → response generation.
Control flow: Plan-then-execute for the overall research plan. ReAct loop within each search/evaluation step. No irreversible actions, so no confirmation gates.
Memory: Hierarchical. Recent sources in full. Older sources as summaries with citations. Research plan as persistent artifact. Explicit contradiction tracking.
Tools: Web search (safe). Page fetch (safe). Citation extraction (safe). No write tools. All tools are read-only; the only output is the generated report.

6. The Anti-Patterns

Architecture mistakes tend to cluster. Here are the patterns I see most often:

Premature multi-agent. Splitting a task across multiple agents when a single agent with good tools would suffice. Every agent boundary is a coordination overhead. Add agents when you have genuinely distinct competencies that cannot share a context, not because the architecture diagram looks more impressive.

Tool proliferation. Adding a new tool for every capability instead of composing existing tools. Ten narrow tools are harder to use correctly than three well-designed tools with composable outputs. The model must learn when to use each tool; more tools means more opportunities for misuse.

Memory as an afterthought. Building the control flow first, then discovering the context window is exhausted halfway through a task. Memory architecture should be designed alongside control flow, not retrofitted.

Optimistic error handling. Assuming tools will succeed and not designing for partial failure. In production, tools timeout, return errors, and produce malformed output. The architecture must handle these cases gracefully, not crash or produce garbage.

Observability debt. Building a complex agentic system without intermediate logging. When something goes wrong (and it will), you need to reconstruct what happened. If the only observable state is input and output, debugging is guesswork.

What Comes Next

This post gave you the structural vocabulary for LLM system architecture: decomposition patterns, control flow options, memory strategies, and tool design principles. But vocabulary is not enough. The next post is about the infrastructure that makes these architectures work: knowledge bases, RAG pipelines, tool registries, and prompt management. That is where most of the unglamorous engineering happens, and where most agent failures actually originate.

Architecture is the skeleton. Part 3 is about the flesh and blood.

This is Part 2 of a 7-part series.

Part 1 covered framing decisions: error surfaces, constraints, and tradeoffs. This part covered architecture: decomposition, control flow, memory, and tools. Next: the infrastructure that makes it all work.