A Systems Engineering Approach to LLM Agents — Requirements and Problem Framing

In traditional ML, requirements gathering and problem framing are separate phases. You talk to stakeholders, write a spec, then decide: is this a classification problem? A regression problem? A ranking problem? The framing follows the requirements. For LLM-based systems, that separation collapses.

How you frame the problem—single LLM call vs. retrieval chain vs. agentic loop vs. multi-agent system—is itself a core requirement decision. It locks in your latency budget, cost profile, failure modes, evaluation strategy, and governance infrastructure all at once. You cannot write a requirements document without simultaneously committing to a framing, so you need to do both deliberately.

This post introduces a framework for that joint decision. The conventional approach is taxonomic: classify your system as "Level 0" through "Level 3" on some complexity scale, then optimize within your level. That framing is useful but shallow. The deeper insight is that the key decision is not how intelligent the system should be, but where you want uncertainty to accumulate, what actions are reversible, and what you commit to measuring. Everything else follows.

We will work through this in four analytical steps, then synthesize into a practical requirements template.

Step 1 of 4

1. Framing as Choosing an Error Surface

In traditional ML, you choose a model class and accept its bias-variance profile. For LLM systems, you choose a decision authority boundary and accept the error surface that comes with it.

The standard taxonomy describes mechanisms: single call, chain, agentic loop, multi-agent. But mechanism is not what determines your error surface. What determines it is: what the system is authorized to decide, and what happens when those decisions are wrong.

Three dimensions that actually determine the error surface

Decision Authority

What can the system do without human confirmation? Read data? Write data? Take external actions? The wider the authority, the larger the error surface, regardless of how many LLM calls are involved.

Error Cost Asymmetry

Is a false positive worse than a false negative, or vice versa? A support bot that wrongly denies a legitimate refund has a different profile than one that wrongly grants a fraudulent one. The architecture must reflect which direction of error the business can tolerate.

Recoverability Horizon

How quickly can a mistake be detected and reversed? A bad product recommendation is recovered in milliseconds. A wrongly issued refund in days. A bad code deployment might corrupt production data irreversibly.

The same architecture, radically different error surfaces

Consider three systems that are all, architecturally, "multi-turn chatbot with retrieval and tool access":

Amazon Rufus. Tools include product catalog search, review retrieval, purchase history lookup. Decision authority is advisory only: Rufus recommends, the user clicks "Add to Cart." The human is always the final gate on any consequential action. Error surface: retrieval relevance plus generation faithfulness. A bad recommendation means the user sees an irrelevant product. Cost of error: near zero. Recoverability: instant. Because the system never commits to anything, you can be aggressive with exploration since no step is dangerous.

Uber's support bot. Tools include trip history lookup, payment records, policy database, refund issuance, account modification. Decision authority is mixed: reading trip history is advisory; issuing a refund is a commitment. The interesting design question is where the authority boundary is drawn. If refunds require human approval, the error surface collapses to "bad recommendation to a human reviewer"—structurally identical to Rufus's error surface applied to a different domain. If refunds are autonomous up to some threshold, the error surface expands to include financial loss, customer trust damage, and policy compliance risk. If fully autonomous, it further includes adversarial exploitation. The architecture is not the hard decision. The hard decision is where to draw the authority boundary, and that is a business and governance decision that the architecture then implements.

Cursor Agent / Claude Code. Tools include file read/write, terminal commands, search. Decision authority varies: Cursor offers tab completion (zero authority, suggestion only), inline edit (bounded authority, user reviews a diff), and agent mode (broad authority, multi-file changes with self-directed exploration). This is adaptive routing within a single product. The "level" is not a property of the system; it is a property of each interaction, and the best systems let the authority boundary flex.

The deeper point

All three systems could be described as "agentic" if you look at their architectural capability. What makes them fundamentally different is not the number of LLM calls or whether there is a loop. It is the decision authority boundary and the cost function over errors at that boundary. Architecture is downstream of that choice, not the other way around.

The sharper question

Instead of asking "what mechanism does the system use?", ask: "What is the system authorized to do, and what does the error cost landscape look like at each authority boundary?" The answer constrains mechanism, not the other way around.

	Advisory system recommends, human acts	Bounded Autonomy within guardrails	Full Autonomy system decides and acts
Low risk recoverable instantly	Rufus, Perplexity, NotebookLM	Cursor tab-complete, auto-formatting	Spam filter, auto-tagging
Moderate risk partially recoverable	Google AI Overviews, Copilot Chat	Klarna refunds (capped), Cursor inline edit	Automated code review merge
High risk irreversible	Medical Q&A without disclaimer	Autonomous trading within limits	Autonomous deployment to prod

The rows determine how careful you need to be. The columns determine how careful your architecture lets you be. The design problem is finding the cell where the required carefulness matches the achievable carefulness, given your evaluation and monitoring infrastructure.

Step 2 of 4

2. Input-Output Specification: What Are You Actually Asking the LLM to Do?

Before architecture, before constraints, before tradeoffs: what goes in, and what comes out? This sounds obvious, but most teams get it wrong by conflating the user's goal with the LLM's task. The user wants "help with my refund." The LLM's task might be: classify intent, retrieve policy, and generate a response. Or it might be: autonomously resolve the case end-to-end. Same user goal, completely different I/O specification.

The atomic operations

Every LLM system, at the atomic level, performs one or some combination of these operations:

Operation	Input → Output	Examples
Generation	Context and prompt → natural language	Email drafting, creative writing, explanation
Classification / Routing	Text → label or routing decision	Intent detection, triage, content moderation
Extraction	Document or conversation → structured fields	Parsing trip details from a complaint, entity extraction
Transformation	Structured data + instructions → restructured data	Summarization, translation, code refactoring
Decision / Action	State + policy → action with real-world consequences	Issue refund, merge PR, send notification

Task difficulty as an information gap

The difficulty of a task is not about "how smart the LLM needs to be." It is about the gap between what the input provides and what the output requires. The wider the gap, the harder the task, and the more the system needs to acquire information through retrieval, tool use, multi-step reasoning, or human interaction, rather than just transform information already present.

Narrow gap (transformation-dominant). "Summarize this document." All information needed for the output is in the input. The LLM compresses; it does not need to seek. Almost always solvable with a single call.

Medium gap (retrieval-dominant). "Answer this question given our knowledge base." The information exists somewhere but is not in the input. The system needs to find it. Single call with RAG, or a short chain.

Wide gap (reasoning-dominant). "Debug why this test is failing." The input is a test failure; the output requires understanding codebase structure, dependency relationships, and causal reasoning across files. The system needs to actively explore to close the gap.

Open-ended gap (planning-dominant). "Build me a feature that does X." The input is a vague specification; the output is a working implementation. The system must decompose, plan, execute, verify, and iterate. The number of steps cannot be predicted in advance.

Industrial grounding

Rufus. The user's input is a shopping query ("running shoes for flat feet under $100"). The required output is a product recommendation with justification. The information gap is narrow to medium: the answer exists in the product catalog, and the LLM needs to retrieve and synthesize it. The atomic tasks are retrieval plus generation. This is why Rufus works as a simple system: the information gap can be closed in one or two steps.

Uber support bot. The user's input is a complaint ("I was charged twice for a cancelled ride"). The required output is a resolution. The information gap is wide: the system needs to extract the claim from natural language, retrieve trip and payment records, match the situation to policy, decide on an action, and execute or recommend that action. Each step introduces new information that reshapes the next step. The atomic tasks span extraction, retrieval, classification, reasoning, and potentially decision/action. The gap cannot be closed in a single step; the system must plan an information-gathering trajectory.

The key difference is not "chat about products" versus "chat about trips." It is that Rufus's I/O gap can be closed with a single retrieval step, while Uber's requires a multi-step information acquisition process where each step's output conditions the next step's input. Task difficulty is structural, not domain-specific.

Two diagnostic questions

Diagnosing your I/O gap

First: How many information-acquisition steps are needed to bridge the I/O gap for the median case? For the 95th percentile case? If the answer is one or two, you are in simple system territory. If it is three to five with dependencies between steps, you need a chain or light agentic behavior. If you cannot predict the number in advance because it depends on what the system discovers, you need a full agentic loop.

Second: Is the output verifiable from the input alone? If the user can check the output by looking at the input (summarization, translation, extraction), evaluation is cheap and the system is naturally self-correcting. If verifying the output requires independent investigation (was the refund correct? does the code actually work?), evaluation is expensive and errors persist longer. This verification difficulty is a better predictor of required system complexity than the surface-level task description.

Step 3 of 4

3. Constraint Analysis: The Four Walls of Your Design Space

Section 2 asked "what do you want to do, and how hard is it?" This section asks "what are you not allowed to do while doing it?" Constraints do not just limit your architecture; they can invalidate entire framing levels and force you into designs you would not otherwise choose.

Wall 1 / 4 Latency

Not all latency constraints are equal. The real question is: what is the user doing while waiting?

Synchronous, inline (under 1–2 seconds). The user is mid-task and blocked. Autocomplete, inline suggestions, search results. Every additional LLM call is directly felt. An agentic loop is architecturally invalid here. This is Rufus's world: the user typed a query and is staring at the screen.

Synchronous, conversational (2–30 seconds). The user is in a dialogue and expects a response but will tolerate a pause. Chat interfaces, support bots. You can afford a retrieval step plus generation, maybe a short chain. This is where most chatbots live, including Uber's bot for straightforward cases.

Asynchronous, backgrounded (minutes to hours). The user kicked off a task and went to do something else. Code generation, report writing, deep research. Agentic loops are not just tolerated; they are expected. This is the world of Devin, Claude Code in headless mode, Deep Research.

The underappreciated coupling

Latency is not just about user patience. It determines feedback loop frequency. A one-second system gets corrected by the user every second. A ten-minute agentic run gets corrected zero times during execution. Longer latency means the system runs open-loop for longer, which means errors accumulate uncorrected. Latency and error surface are directly coupled.

Wall 2 / 4 Cost

Per-query cost scales roughly multiplicatively with system complexity. A back-of-envelope template:

System type	Approx. LLM calls	Cost relative to single call
Single call	1	1×
Chain / pipeline	2–5 + retrieval	3–5×
Agentic loop	5–50 + tool executions	10–100× (highly variable)
Multi-agent	Agentic × number of agents	Multiplies agentic cost

The real cost analysis is not per-query; it is per-resolution. If a single-call system resolves 60% of cases and an agentic system resolves 90%, the question is whether the 30% improvement justifies the cost increase on all queries—or whether adaptive routing can capture the gains cheaply by sending simple cases to the single-call path and escalating only the hard 30%.

Rufus handles millions of queries daily. At a penny per query, that is already tens of thousands of dollars a day. An agentic architecture at fifty cents per query would be financially non-viable at that scale, even if it produced better recommendations. Cost eliminates agentic framing before any other consideration.

Uber's support bot operates at lower volume but higher value per resolution. A human agent costs five to fifteen dollars per interaction. If an agentic bot costs fifty cents to two dollars per interaction and resolves 70% of cases without human involvement, the economics are compelling. Cost enables a more complex system here because the comparison point is human labor, not a simpler bot.

Wall 3 / 4 Privacy and Information Flow

The conventional framing asks "what PII touches the LLM?" and "do you need on-prem?" These are important but shallow questions. The deeper question is about information flow topology: at each step, what data can the model see, infer, store, and transmit?

See

Can the model access raw user data, or only anonymized versions? Can it see cross-user data? This is the most visible and most commonly audited dimension.

Infer

Even without raw PII, can the model infer sensitive attributes from context? A trip from a hospital to a pharmacy implies health information. Late-night rides to a specific address imply personal relationships.

Store

Does intermediate reasoning persist? In an agentic loop, the model's scratchpad may contain sensitive inferences that are more revealing than the raw data — and are rarely audited.

Transmit

When the model calls a tool or another agent, what leaks across that boundary? Multi-agent decomposition can create new information transmission channels that did not exist in a monolithic system.

The more you can push PII resolution to deterministic pre-processing, the simpler your privacy architecture. Uber's bot does not necessarily need the LLM to see raw trip GPS coordinates. A pre-processing layer could resolve "the trip on Tuesday" to a trip ID, and the LLM only sees the ID, fare, and status. Rufus mostly avoids PII since product queries are rarely sensitive—another reason a simpler architecture is viable there.

Wall 4 / 4 Safety, Reversibility, and Governance

Every authority level requires a corresponding governance level. Advisory systems need output quality monitoring: evals for faithfulness, relevance, toxicity. Standard practice. Systems with bounded autonomy need policy compliance auditing: verification that the system followed policy on every action, requiring logging, trajectory auditing, and exception review. Significantly more infrastructure. Fully autonomous systems need comprehensive audit trails, adversarial testing, and real-time anomaly detection—the governance level of a financial trading system, not a chatbot.

The constraint is often organizational, not technical

If you cannot build or staff the governance infrastructure for a given authority level, you cannot responsibly operate at that level. Many teams have the technical capability to build agentic systems but not the operational capability to govern them.

How the walls interact

The four constraints do not operate independently. They create feasibility wedges.

Latency plus cost: an agentic system might be the best solution, but if the latency wall forces sub-second response and the cost wall prevents parallelizing ten LLM calls, you are forced into a simple system regardless of task difficulty.

Privacy plus authority: if the task requires accessing sensitive data and taking autonomous actions, you need both rigorous information flow control and action governance. This is the hardest design space, where most real enterprise systems live, and where most naive "just add agents" approaches fail.

Cost plus governance: an agentic loop might be affordable per query, but trajectory auditing at scale can cost more than the LLM calls themselves. The hidden cost of agentic systems is not compute; it is monitoring.

Step 4 of 4

4. Pareto Tradeoffs: You Cannot Maximize Everything

Sections 2 and 3 defined what you want and what you are constrained by. This section confronts the uncomfortable truth: even within your feasible design space, you face genuine tradeoffs where improving one dimension necessarily degrades another. The goal is not to optimize; it is to choose your tradeoff position consciously.

Axis 1: Response Quality vs. Latency and Cost (Deliberation Tradeoff)

More reasoning, more tool calls, more verification steps produce better outputs. But each step adds latency and cost. This is the tradeoff between "think harder" and "answer faster."

Rufus sits far toward speed. A product recommendation that takes five seconds is worse than a decent one in 500 milliseconds, because the user is mid-browse and will abandon. Amazon accepts a lower ceiling on recommendation quality in exchange for responsiveness at scale. There is a subtle cheat here: the user provides the additional deliberation. If the first recommendation is wrong, the user refines their query. The human-in-the-loop is the deliberation mechanism, and it is free to Amazon.

Uber's bot can sit further toward quality. A support interaction that takes sixty seconds but correctly resolves the issue is far better than a five-second response that misclassifies the problem. The cost of a wrong fast answer (customer frustration, escalation to a human, potential churn) exceeds the cost of a slow correct answer. But there is a cliff: past two to three minutes, the user abandons the bot and calls support anyway, eliminating the cost savings.

Axis 2: Autonomy vs. Control (Agency Tradeoff)

More autonomy means the system can handle novel situations. More control means the system does exactly what you specified. You cannot maximize both.

Rufus operates with high control and low autonomy. It follows a narrow loop: interpret query, retrieve products, generate recommendation. It does not decide to cross-reference competitor prices or proactively suggest alternatives to items already in your cart. This makes Rufus predictable and easy to audit, but it cannot handle genuinely complex shopping decisions.

Uber's bot needs more autonomy because support cases are heterogeneous. A cancelled ride, a lost item, a safety incident, and a fare dispute all require different tools, policies, and workflows. A fully controlled system would need to enumerate every case type in advance. Autonomy allows handling novel or ambiguous cases—but the cost is unpredictability: the bot might apply the wrong policy, or handle a safety incident with the same routine as a fare dispute.

The convergent design pattern. Both systems benefit from the same principle: autonomy in reasoning, control at the action boundary. Let the LLM be autonomous in understanding the situation (flexible interpretation, retrieval, reasoning). Clamp down at the action boundary (hardcoded policy checks, human approval for high-stakes actions, strict guardrails on what the system can execute). This captures the benefits of autonomy for understanding novel cases without the risks of autonomous action.

Axis 3: Generality vs. Reliability (Scope Tradeoff)

A general system handles more use cases but fails less gracefully on any particular one. A narrow system handles fewer cases but handles them very well.

Rufus is narrow scope, high reliability within that scope. It answers shopping queries about Amazon products. It does not try to handle returns, account issues, or general knowledge questions. This narrowness enables tight optimization: retrieval tuned to the product catalog, generation tuned to product language, evaluation based on click-through and purchase conversion.

Uber's bot must be broader because "customer support" spans dozens of issue types with different resolution paths. But generality creates a long tail of failure: the system handles the top ten issue types well but struggles with rare cases—edge cases in surge pricing policy, multi-leg international trips, or accessibility-related complaints. The 80th percentile case works; the 95th percentile case does not.

The production solution is scoped generality: be general within a defined boundary, and have hard fallbacks (escalation to human, graceful failure) for cases outside that boundary. The design question is where to draw the boundary, and that is an empirical question that most teams answer too late.

The combined Pareto map

Tradeoff	Rufus	Uber Bot
Quality vs. Speed	Speed wins. User provides deliberation.	Quality wins, with a latency cliff at 2–3 minutes.
Autonomy vs. Control	Controlled reasoning, no action authority.	Autonomous reasoning, controlled action boundary.
Generality vs. Reliability	Narrow and reliable. One job, done well.	Scoped generality. Common cases handled; long tail escalated.
Net design posture	Simple, fast, cheap, predictable. Let the human do the hard part.	Complex, careful, moderate cost, monitored. System does the hard part under governance.

These positions are not technical preferences. They reflect business model alignment. Rufus exists to keep users browsing and buying; speed and low friction are paramount, and a wrong recommendation costs nothing. Uber's bot exists to replace a $5–15 human interaction; thoroughness and correctness are paramount, and a wrong resolution costs real money and trust. The Pareto position follows from the business model, not from the technology.

5. Putting It Together: The Framing Decision

The preceding sections gave you four analytical lenses. Here is how to synthesize them into an actual requirements decision.

Six forcing questions

Rather than a checklist, use these as forcing functions that make implicit assumptions explicit:

The six forcing questions

Error surface. Where is uncertainty allowed to accumulate? Map the decision authority boundary: what does the system recommend versus what does it execute? What happens when it is wrong at each boundary?
I/O gap analysis. How many information-acquisition steps bridge input to required output, for the median case and the 95th percentile? If the number is unpredictable, you need agentic behavior.
Constraint feasibility. Which of the four walls (latency, cost, privacy, governance) eliminates which framing levels? Often a single constraint will kill an entire design direction before you consider tradeoffs.
Pareto position. Given your business model, where do you sit on each tradeoff axis? Which axis matters most? Optimize along that axis and satisfice the others.
Evaluation commitment. Given your framing, what can you actually measure? A simple system lets you benchmark output quality with standard metrics. An agentic system requires trajectory evaluation, counterfactual analysis, and tool-use auditing. If you cannot build the evaluation infrastructure your framing demands, you will be flying blind. Choose a framing you can evaluate, not just one you can build.
Routing policy. Will the system operate at a single complexity level, or adaptively route across levels? The true design problem is often not "which level?" but "what routing policy across levels, and what signal drives the routing?"

A worked example: Uber's support bot through the six questions

Worked example · Uber's support bot

1. Error surface

Advisory for informational queries (trip status, policy explanation). Bounded autonomy for simple actions (refunds under $20 on trips with clear cancellation records). Human escalation for ambiguous cases, high-value disputes, and safety-related incidents.

2. I/O gap

"What is your cancellation policy?" — one-step gap (retrieval). "I was double-charged on a cancelled ride" — four-to-five-step gap (extract claim → retrieve trip → retrieve payment → match to policy → determine action). "My driver made me feel unsafe" — open-ended gap, mandatory escalation.

3. Constraints

Latency: 2–30s tolerance, hard abandonment cliff at 2–3 minutes. Cost: must beat $5–15 human agent cost. Privacy: pre-process trip GPS to trip IDs, minimize raw location in LLM context. Governance: trajectory auditing for every autonomous refund, adversarial monitoring for refund gaming.

4. Pareto position

Quality over speed (wrong resolutions cost more than slow ones). Autonomous reasoning, controlled action boundary. Scoped generality with hard escalation paths.

5. Evaluation commitment

Output metrics: resolution rate, CSAT, re-contact rate. Trajectory metrics: policy compliance rate, action appropriateness, escalation accuracy. Both are required — invest in trajectory logging from day one.

6. Routing policy

Simple informational queries → single call with retrieval. Standard issue types (cancellation, fare adjustment) → structured chain with policy lookup. Complex, ambiguous, or safety-related cases → human escalation. The routing classifier itself is a first-class component with its own evaluation cycle.

The adaptive routing insight

The golden rule "start simple, escalate only when demonstrably needed" is directionally correct but incomplete. The modern production pattern is adaptive routing, not fixed escalation. You do not pick a level and live there. You build a routing policy that allocates each case to the appropriate level of deliberation based on estimated difficulty and stakes. The router itself becomes a first-class component of the system: it needs its own evaluation, its own failure modes, and its own iteration cycle.

When is one more deliberation step worth it? The intuition is straightforward: the expected quality gain from an additional step must exceed the sum of its incremental latency cost, failure risk, and monitoring burden. When that sum is positive, escalate. When it is negative, you are adding complexity for complexity's sake. Rufus-type queries almost never benefit from more deliberation. Uber-type disputes almost always do, up to a ceiling. The best systems learn where that ceiling is from data, not from intuition.

What Comes Next

This post argued that the first design decision for an LLM system is not "how smart should it be?" but "where do uncertainty, agency, and accountability sit?" The four lenses—error surface, I/O gap, constraints, Pareto tradeoffs—and six forcing questions give you a framework for making that decision deliberately rather than by default.

But notice what we have not yet discussed: how to actually build the thing. The requirements and framing decision constrains the architecture to a narrow set of viable designs. Part 2 is about making that design explicit: decomposition patterns, memory architecture, tool integration, and the control flow decisions that turn a framing choice into a running system.

The paradox of this first step is that it is also the step you revisit most often. Requirements shift, models improve, cost structures change, and your evaluation data reveals that the I/O gap is different than you assumed.

The six forcing questions are not a one-time exercise. They are a living document that evolves with your system.

This is Part 1 of a 7-part series.

The series overview laid out the seven-step lifecycle. This part covered requirements and problem framing: error surfaces, I/O gaps, constraints, and Pareto tradeoffs. Next: turning those decisions into a running architecture.

← Part 0: Series Overview