There is a widening disconnect in the LLM agent space. On one end, researchers publish increasingly sophisticated architectures: multi-agent debate, self-reflective planning, tool-augmented reasoning chains. On the other, practitioners are shipping agents held together by a system prompt and a prayer. The missing piece is not more research. It is engineering discipline.
I spent the past months compiling a comprehensive design framework for LLM-based agentic systems, drawing on both the research literature (ReAct, Reflexion, Constitutional AI, SWE-bench, and others) and practical lessons from building and evaluating these systems. The result is a seven-step lifecycle that covers the full arc from problem framing to production monitoring.
This post is the overview. It lays out the structure of the framework and explains why each step exists and what makes it different from its traditional ML counterpart. It does not attempt to be comprehensive. Instead, each of the seven steps will get its own dedicated deep-dive in subsequent posts, with concrete examples, formal definitions, and actionable checklists.
Why Agents Need Their Own Playbook
Traditional ML has a well-established lifecycle: define the problem, collect data, train a model, evaluate, deploy, monitor. LLM agents break this in ways that are subtle but consequential.
First, you usually don't train the model. You select a foundation model, possibly fine-tune it, and then build around it. The "development" work shifts from gradient updates to architecture design, prompt engineering, tool integration, and memory construction. This is no less rigorous than training; it's just different, and it lacks the same institutional knowledge.
Second, problem framing and architecture design collapse into one decision. Choosing between a single LLM call, an agentic loop, and a multi-agent system is simultaneously a product decision and an engineering decision. Get it wrong and you'll either under-build (a single call that can't handle the task) or over-build (a multi-agent system where a single call would have sufficed, but with 5x the latency and cost).
Third, evaluation is fundamentally harder. Agent trajectories are multi-step, path-dependent, and stochastic. Many valid paths exist for the same task. Emergent behavior appears in multi-agent systems. Key properties (safety, reliability, consistency) are latent and hard to measure. A single metric will mislead you, and offline metrics that don't correlate with online outcomes mean all your optimization was wasted.
Fourth, failure modes are novel. Infinite loops, hallucinated tool calls, context exhaustion, cascading errors across agents, reward hacking, goal drift. These don't map cleanly onto traditional ML failure categories, and they require their own monitoring and alerting infrastructure.
The framework I'm proposing doesn't reinvent everything. It builds on the traditional ML lifecycle but adapts each step for the specific challenges that agentic systems introduce, and it adds steps (like architecture design) that are entirely new.
The seven-step lifecycle. Each feeds forward; monitoring feeds back to every earlier stage.
The Seven Steps: A Roadmap
Below is the structure of the series. For each step, I'll outline what it covers, why it matters, and what makes it different from its traditional ML analog. The deep-dive posts will follow, each with formal definitions, worked examples, and concrete checklists you can use in your own projects.
Requirements and Problem Framing
In traditional ML, requirements and problem framing are separate phases. For LLM agents, they are inseparable. How you frame the problem (single call vs. agentic loop vs. multi-agent system) is itself a core requirement decision, because it determines latency, cost, reliability, and the entire downstream architecture. This step forces you to articulate task-level objectives, system-level objectives, alignment boundaries, and the performance tradeoffs between speed and quality, autonomy and control, specialization and generality. The golden rule: start simple, escalate only when demonstrably needed.
Agent Architecture Design
This step has no analog in traditional ML. For agents, architecture design (topology, reasoning patterns, tool schemas, inter-agent protocols) is as consequential as model selection. I'll walk through the four canonical topologies (pipeline, hierarchical, debate, collaborative), the four core reasoning patterns (ReAct, Plan-then-Execute, Reflexion, Re-planning), and a critical dimension that most guides ignore: human-centered design. Research on human-AI collaboration shows that users build mental models of what agents can and cannot do, and misaligned mental models lead to over-reliance or under-trust. Architecture choices shape those mental models, whether you design for it or not.
Knowledge, Tools, and Data Infrastructure
In traditional ML, data prep means building training datasets. For LLM agents, the real work is building knowledge infrastructure: RAG pipelines, vector stores, tool integration contracts, memory systems, prompt management. This is where most of the unglamorous engineering happens, and where most agent failures actually originate. A tool with an ambiguous description or an incomplete error contract will cause downstream failures that look like model problems but are actually infrastructure problems. I'll cover storage and retrieval (vector DB, key-value, graph), tool schema design with the principle of least privilege, prompt engineering specifically for agents (instruction hierarchy, few-shot tool examples, constitutional rules), and data governance.
Agent Development
The model adaptation spectrum ranges from in-context learning (lowest barrier, context-limited) through supervised fine-tuning and RL methods (PPO, DPO, GRPO, DAPO) to full distillation. But the real development work is in three areas that don't exist in traditional ML: memory architecture (context window management, external memory, episodic reflection, shared MAS memory), anthropomorphic design as a controllable lever (four dimensions of cues that can be intentionally tuned to support user goals rather than treated as an incidental risk), and multi-agent orchestration (topology, communication protocol, conflict resolution, error propagation).
Evaluation
This is perhaps where the gap between traditional ML and agentic systems is widest. Agent evaluation is harder along every axis: multi-step trajectories, many valid paths, path-dependent stochastic behavior, emergent MAS dynamics, latent safety properties. I'll cover the full metrics spectrum from lexical (BLEU, ROUGE) through embedding (BERTScore) to learned and LLM-as-Judge approaches, with specific attention to RAG error decomposition (context utilization vs. hallucination vs. noise sensitivity). For online evaluation, I'll discuss why you must never rely on a single metric. And I'll spend significant space on population-level fairness: why an increase in the arithmetic mean can mask quality-of-service harms for underrepresented subgroups, and how disaggregated evaluation and worst-case (maximin) metrics address this.
Safety, Guardrails, and Deployment
For agents that act autonomously, guardrail design is as important as the agent itself, and must happen before deployment. The threat model follows the OWASP Top 10 for LLMs: prompt injection (direct and indirect), data leakage, excessive agency, system prompt leakage. Guardrails operate at three layers (input, system prompt, output), and for multi-agent systems must also cover inter-agent messages. I'll also cover cost engineering in depth, because MAS cost (often 3 to 5 times the user-facing token cost) is a deployment-blocking concern that most guides hand-wave away. Smart routing, semantic caching, prompt compression, and budget enforcement aren't optimizations; they're requirements.
Observability and Monitoring
For agentic systems, observability is a distinct discipline that goes beyond traditional APM. You need distributed tracing where every LLM call, tool call, and inter-agent message is a span. You need prompt versioning that correlates prompt changes with metric shifts. You need trajectory replay: the ability to store and replay failed agent trajectories, which is the LLM equivalent of a stack trace. And you need a failure mode taxonomy that covers the novel ways agents break: infinite loops, hallucinated tool calls, context exhaustion, cascading MAS errors, reward hacking, goal drift, distribution shift, and silent tool API changes. Each of these needs its own alerting threshold and severity classification.
What's Next
Part 1 will go deep on Requirements and Problem Framing: how to decide between a single call, an agentic loop, and a multi-agent system; how to map performance tradeoffs; and how to build a constraint analysis that prevents the most common architecture mistakes before you write a single line of code.
This is Part 0 of a 7-part series.
Each subsequent post will take one step of the framework and go deep: formal definitions, worked examples, checklists, and lessons from building these systems in practice.