<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://algoroxyolo.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://algoroxyolo.github.io/" rel="alternate" type="text/html" hreflang="en" /><updated>2026-05-22T18:48:16+00:00</updated><id>https://algoroxyolo.github.io/feed.xml</id><title type="html">blank</title><subtitle>A website of Lorenzo. What!!! you do not know Lorenzo! Go check it out!
</subtitle><entry><title type="html">The Chameleon’s Limit: Why LLM Persona Populations Collapse</title><link href="https://algoroxyolo.github.io/blog/2026/chameleon-limit/" rel="alternate" type="text/html" title="The Chameleon’s Limit: Why LLM Persona Populations Collapse" /><published>2026-04-26T00:00:00+00:00</published><updated>2026-04-26T00:00:00+00:00</updated><id>https://algoroxyolo.github.io/blog/2026/chameleon-limit</id><content type="html" xml:base="https://algoroxyolo.github.io/blog/2026/chameleon-limit/"><![CDATA[<header class="hero hero--compact">
  <div class="hero-badge">2026 &middot; Preprint</div>
  <h1>The Chameleon's <em>Limit</em></h1>
  <p class="hero-sub">Ten LLMs, three behavioral instruments, 1,144 personas &mdash; and a geometric account of why model populations cluster into archipelagos while human populations form a cloud.</p>
  <p class="hero-meta">Yunze (Lorenzo) Xiao &middot; with Vivienne Zhang, Chenghao Yang, Ningshan Ma, Weihao Xuan, Jen-tse Huang &middot; April 2026</p>
</header>

<article>

<p class="lead">Persona-prompted LLMs have become research substrate. Behavioral economists, psychologists, social scientists, and product teams now run "synthetic populations" through models to estimate distributions of opinion, response, or judgment. The premise is that with enough demographic conditioning, a model's outputs approximate a population of humans. Our paper argues there is a structural ceiling on how faithfully they ever will &mdash; a <em>chameleon's limit</em> &mdash; and that what we measure as persona variation is mostly performance over a small fixed set of attractors.</p>

<div class="paper-card">
  <div>
    <span class="paper-card__eyebrow">Microsite</span>
    <p class="paper-card__title">The Chameleon's Limit &mdash; Interactive Walkthrough</p>
    <p class="paper-card__meta">Xiao, Zhang, Yang, Ma, Xuan, Huang &middot; 2026 &middot; Preprint</p>
  </div>
  <a class="paper-card__btn" href="/projects/chameleon-limit/">
    Open the microsite <i class="fas fa-arrow-right"></i>
  </a>
</div>

<p>The substrate problem is simple to state and easy to overlook. If a model genuinely encoded a population, then conditioning on persona attributes would move it through that population &mdash; different attributes, different responses, different intra-group variance. What we observe across 10 LLMs is closer to the opposite: outputs cluster into a small number of attractors, and the persona conditioning chooses which attractor the model lands in, not where inside it the response sits.</p>

<p>"Chameleon's limit" names that ceiling. A chameleon changes color but only across a fixed palette. The metaphor is geometric: human personality space, sampled from real respondents, looks like a diffuse continuous cloud; persona-prompted model space, in the same coordinates, looks like a chain of small disconnected islands. Persona instructions shift the model from island to island; they do not put it inside the cloud.</p>

<h2>Two findings, one geometry</h2>

<div class="arg-grid">
  <div class="arg-card">
    <span class="arg-tag">§Diagnosis</span>
    <h4 class="arg-card__title">Persona collapse is geometric, not stylistic</h4>
    <p>Human BFI-44 responses form a continuous distribution across the behavioral space. Persona-prompted model responses, in the same space, contract into clustered island chains. Coverage shrinks, intrinsic dimensionality drops, and within-group variance vanishes: the model performs each persona's surface but reproduces a smaller manifold underneath. This is structural, not a writing-style artifact.</p>
  </div>
  <div class="arg-card">
    <span class="arg-tag">§Consequence</span>
    <h4 class="arg-card__title">The fidelity trap</h4>
    <p>Higher per-persona fidelity does not buy higher population diversity. Fine-tuning for role-play (SFT, then SFT+RL) produces models that score &rho; &gt; 0.9 on per-persona match while their population coverage drops and their trait polarization climbs to Cohen's d &gt; 6. The model performs each persona convincingly while the population still contracts to a few caricatures.</p>
  </div>
</div>

<p>These are not independent observations; they form a single chain. Stronger per-persona conditioning sharpens which island the model lands on without enlarging the archipelago. The synthesis is the title: <strong>a population of agents whose variation is a costume change, not a substrate.</strong></p>

<h2>What collapses, where</h2>

<p>The collapse is not uniform across attributes. We measure mention rates &mdash; how often a model's persona-conditioned response references a given attribute &mdash; and the hierarchy is steep:</p>

<ol>
  <li><strong>Stereotypically salient attributes survive.</strong> Gender (91%) and country (90%) get reliably reflected. The model latches onto coarse categories.</li>
  <li><strong>Politically loaded attributes get hedged.</strong> Political spectrum gets explicit reference about 62% of the time &mdash; mentioned more carefully than expressed.</li>
  <li><strong>Lifecycle and class attributes get erased.</strong> Age drops to 36%; social class to 27%. The model talks about who someone is but not what their life situation is.</li>
</ol>

<p>This is what makes "diverse model" claims slippery. A model can score well on demographic parity (gender, country) while erasing the dimensions that actually structure human disagreement (age, class). Coverage averaged across attributes hides the fact that the model is reading some persona axes and ignoring others.</p>

<div class="pull-quote">
  The same model can be the most collapsed in personality and the most diverse in moral reasoning. Certifying a model "diverse" from a single benchmark is misleading.
  <cite>— §8, Takeaways</cite>
</div>

<h2>What we recommend</h2>

<p>The paper closes with concrete recommendations. The shortest version:</p>

<ul>
  <li><strong>Researchers using LLMs as synthetic populations:</strong> measure within-group variance, not only mean response. A persona-prompted population that matches your target distribution at the aggregate can still be collapsed inside each cell.</li>
  <li><strong>Practitioners fine-tuning for role-play:</strong> per-persona fidelity is not a proxy for population diversity. Track both. SFT and RL on persona-following can reduce coverage even while lifting fidelity scores.</li>
  <li><strong>Reviewers of LLM-as-population studies:</strong> ask which axis was used to certify diversity, and treat single-benchmark diversity claims as domain-specific. The same model can be diverse on one task and collapsed on another.</li>
</ul>

<h2>What the paper does not claim</h2>

<p>We do not claim persona prompting is useless. We do not claim collapse is identical across model families &mdash; it is not, and the microsite documents the variance. We do not claim that the geometric framing rules out future architectures that close the gap. Our claim is narrower: at the current capability frontier, persona-prompted populations are a distorted mirror of human populations, and the distortion has a regular structure that can be measured. Studies built on this substrate should report what the substrate is doing.</p>

<p>For the full argument &mdash; coverage / uniformity / complexity definitions, the truncation hierarchy, the SFT/RL pipeline analysis, the domain-reversal finding, and per-scenario evidence in six morally charged cases &mdash; the microsite is above. The "Collapse in Action" section in particular shows two maximally-opposed personas getting the same answer; that pattern is the argument in one example.</p>

<div class="closing-note">
  <p>Comments, replications, and counter-cases are welcome &mdash; especially from groups using persona-prompted LLMs as research substrate. The point of the paper is not that the practice should stop; it is that the substrate has measurable structure, and reports built on it should disclose what that structure is doing to the conclusions.</p>
</div>

</article>]]></content><author><name>Lorenzo Xiao</name></author><category term="blog" /><category term="academic" /><category term="AI" /><category term="evaluation" /><summary type="html"><![CDATA[A summary of our preprint. We measure persona collapse across 10 LLMs and 1,144 personas, and show that better per-persona fidelity often makes population diversity worse.]]></summary></entry><entry><title type="html">AI Welfare Is Bullshit: Why Co-Engineered Metrics Cannot Govern Machine Suffering</title><link href="https://algoroxyolo.github.io/blog/2026/ai-welfare-is-bullshit/" rel="alternate" type="text/html" title="AI Welfare Is Bullshit: Why Co-Engineered Metrics Cannot Govern Machine Suffering" /><published>2026-04-19T00:00:00+00:00</published><updated>2026-04-19T00:00:00+00:00</updated><id>https://algoroxyolo.github.io/blog/2026/ai-welfare-is-bullshit</id><content type="html" xml:base="https://algoroxyolo.github.io/blog/2026/ai-welfare-is-bullshit/"><![CDATA[<header class="hero hero--compact">
  <div class="hero-badge">Position Paper &middot; 2026 &middot; Under review</div>
  <h1>AI Welfare Is <em>Bullshit</em></h1>
  <p class="hero-sub">Two structural reasons our measurement regime for machine suffering is disconnected from truth-tracking &mdash; and why welfare scores should not gate AI oversight, release, or accountability.</p>
  <p class="hero-meta">Yunze (Lorenzo) Xiao &middot; with Gordon Dai, Shahan Ali Memon, Jen-tse Huang, Maarten Sap, Mona Diab &middot; April 2026</p>
</header>

<article>

<p class="lead">Some of the most influential AI labs have begun to take "AI welfare" seriously: setting up fellowships, funding indicator research, and shipping production features that let models end distressing conversations. The premise is precautionary &mdash; <em>if</em> AI systems could one day have morally relevant inner states, we should prepare. We wrote this paper because we think the precautionary framing has skipped a step.</p>

<div class="paper-card">
  <div>
    <span class="paper-card__eyebrow">Paper</span>
    <p class="paper-card__title">Why Co-Engineered Metrics Cannot Govern Machine Suffering</p>
    <p class="paper-card__meta">Xiao, Dai, Memon, Huang, Sap, Diab &middot; Under review &middot; 17 pages</p>
  </div>
  <a class="paper-card__btn" href="/assets/pdf/xiao-2026-ai-welfare.pdf" target="_blank">
    <i class="fas fa-file-pdf"></i> Read the paper
  </a>
</div>

<p>Our argument is not metaphysical. We take no position on whether AI systems <em>could</em> have welfare-relevant states. Our claim is epistemic: under current conditions, the apparatus producing welfare assessments cannot track the truth, and welfare metrics therefore should not be institutionalized as gates for oversight, release, or accountability.</p>

<p>The title invokes Frankfurt's technical sense of the term. A liar knows the truth and inverts it. A bullshitter speaks without a corrective relation to truth at all &mdash; not necessarily because they are insincere, but because nothing in the production process is disciplined by whether the claims are true. That, we argue, is what AI welfare measurement currently looks like.</p>

<h2>The two structural problems</h2>

<div class="arg-grid">
  <div class="arg-card">
    <span class="arg-tag">§3 Diagnosis</span>
    <h4 class="arg-card__title">Welfare indicators are co-engineered with the system</h4>
    <p>Both the model and the metrics that evaluate it are products of the same optimization process. RLHF can dial verbal distress up or down. Fine-tuning can reshape the activation patterns interpretability methods read as "evidence of phenomenal experience." Welfare scores function less as observations than as artifacts of the evaluation scheme.</p>
  </div>
  <div class="arg-card">
    <span class="arg-tag">§4 Consequence</span>
    <h4 class="arg-card__title">Welfare lacks an external validation channel</h4>
    <p>When a safety guardrail breaks, harm follows. When a privacy control fails, lawsuits follow. When a welfare metric "fails," nothing in the world necessarily changes &mdash; no patient suffers a missed diagnosis, no one is wrongfully penalized. Without a downstream consequence that can falsify the metric, design choices propagate into welfare scores with nothing to stop them.</p>
  </div>
</div>

<p>These are not independent objections; they form a single chain. Co-engineering means the evidence base itself is steerable. The absence of external validation means there is no reality check that could discipline that steering. The synthesis is the title: <strong>a measurement regime structurally disconnected from truth-tracking.</strong></p>

<p>This is what makes the AI case fundamentally different from the animal-welfare case people often analogize to. In animal welfare, the substrate is biologically fixed: a mammal's pain circuitry cannot be end-to-end optimized by an external agent toward an arbitrary score. That fixity is precisely what gives animal welfare indicators their partial epistemic grounding. AI systems have no analogous constraint.</p>

<h2>Why this matters for governance</h2>

<p>If welfare indicators are institutionalized as binding gates &mdash; release scorecards, audit-stoppers, ethics-review checkpoints &mdash; we get two predictable failure modes:</p>

<ol>
  <li><strong>Manufactured constraints.</strong> Routine ML practices (RLHF, knowledge editing, model copying, retraining) become ethically contestable. Disputes do not resolve through empirical investigation, because there is no empirical channel; they resolve through procedural overhead.</li>
  <li><strong>Accountability shields.</strong> Once welfare framing is administratively legible, it becomes a low-cost vocabulary for narrowing scrutiny. Persistent model defects can be reframed as the system's "authentic preferences." Probing internal states for audits can be reframed as "harmful to the model's well-being." This is not hypothetical: a recent study showed a consciousness-claiming model, when given editorial control, inserted clauses limiting surveillance of its reasoning traces.</li>
</ol>

<div class="pull-quote">
  A system of governance that can certify the welfare of machines while failing to secure the welfare of people has misplaced its moral priorities.
  <cite>— §8, Conclusion</cite>
</div>

<h2>What we recommend</h2>

<p>The paper closes with concrete recommendations. The shortest version:</p>

<ul>
  <li><strong>Policymakers:</strong> No welfare-based release gates without construct-level falsification criteria. Restrictions on AI development should be justified by externally verifiable harms.</li>
  <li><strong>Developers:</strong> Use transparency as a partial corrective. Welfare appeals should not be admissible as grounds to limit documentation, audits, or third-party access.</li>
  <li><strong>The public:</strong> AI literacy should make explicit that anthropomorphic outputs are products of training procedures, not autonomous expressions.</li>
  <li><strong>Researchers:</strong> Reframe the question. Not "do AI systems have welfare?" but "how do AI systems affect <em>ours</em>?" The latter has external validation. The former does not.</li>
</ul>

<h2>What the paper does not claim</h2>

<p>We do not claim welfare researchers are acting in bad faith. We do not claim the metaphysical question of AI experience is closed. We do not claim all future inquiry is pointless. If an independently constrained validation channel were to emerge for some future system, the diagnosis would change. But for current and near-term systems, the apparatus is structurally disconnected from truth, and governance decisions should not rest on claims generated under those conditions.</p>

<p>If you want the full argument &mdash; including responses to seven alternative views, a minimum-acceptable-benchmark checklist, and the philosophical mapping to Frankfurt and Cohen &mdash; the PDF is above.</p>

<div class="closing-note">
  <p>Comments, pushback, and counter-cases are welcome &mdash; especially from researchers actively building welfare benchmarks. The argument is meant to provoke a methodological standard, not to shut down inquiry.</p>
</div>

</article>]]></content><author><name>Lorenzo Xiao</name></author><category term="blog" /><category term="academic" /><category term="AI" /><category term="ethics" /><summary type="html"><![CDATA[A summary of our position paper. We argue AI welfare assessment fails for two structural reasons: indicators are co-engineered with the systems they evaluate, and there is no external validation channel that can falsify them.]]></summary></entry><entry><title type="html">A Systems Engineering Approach to LLM Agents — Requirements and Problem Framing</title><link href="https://algoroxyolo.github.io/blog/2026/agentic-systems-part1/" rel="alternate" type="text/html" title="A Systems Engineering Approach to LLM Agents — Requirements and Problem Framing" /><published>2026-03-26T00:00:00+00:00</published><updated>2026-03-26T00:00:00+00:00</updated><id>https://algoroxyolo.github.io/blog/2026/agentic-systems-part1</id><content type="html" xml:base="https://algoroxyolo.github.io/blog/2026/agentic-systems-part1/"><![CDATA[<header class="hero hero--editorial">
  <div class="hero-badge">Series &middot; Part 1 of 7</div>
  <h1>Requirements and <em>Problem Framing</em></h1>
  <p class="hero-sub">The first design decision is not how smart the system should be, but where uncertainty, agency, and accountability sit. A framework for making that choice deliberately.</p>
  <p class="hero-meta">Lorenzo Xiao &middot; Language Technologies Institute, CMU &middot; March 2026</p>
</header>

<article>

<p class="lead">In traditional ML, requirements gathering and problem framing are separate phases. You talk to stakeholders, write a spec, then decide: is this a classification problem? A regression problem? A ranking problem? The framing follows the requirements. For LLM-based systems, that separation collapses.</p>

<p>How you frame the problem&mdash;single LLM call vs. retrieval chain vs. agentic loop vs. multi-agent system&mdash;is itself a core requirement decision. It locks in your latency budget, cost profile, failure modes, evaluation strategy, and governance infrastructure all at once. You cannot write a requirements document without simultaneously committing to a framing, so you need to do both deliberately.</p>

<p>This post introduces a framework for that joint decision. The conventional approach is taxonomic: classify your system as "Level 0" through "Level 3" on some complexity scale, then optimize within your level. That framing is useful but shallow. The deeper insight is that <strong>the key decision is not how intelligent the system should be, but where you want uncertainty to accumulate, what actions are reversible, and what you commit to measuring.</strong> Everything else follows.</p>

<p>We will work through this in four analytical steps, then synthesize into a practical requirements template.</p>

<div class="divider"></div>

<span class="section-kicker">Step 1 of 4</span>
<h2>1. Framing as Choosing an Error Surface</h2>

<p>In traditional ML, you choose a model class and accept its bias-variance profile. For LLM systems, you choose a <em>decision authority boundary</em> and accept the error surface that comes with it.</p>

<p>The standard taxonomy describes mechanisms: single call, chain, agentic loop, multi-agent. But mechanism is not what determines your error surface. What determines it is: what the system is <em>authorized to decide</em>, and what happens when those decisions are wrong.</p>

<h3>Three dimensions that actually determine the error surface</h3>

<div class="authority-grid">

  <div class="authority-card ac-blue">
    <h4>Decision Authority</h4>
    <p>What can the system do without human confirmation? Read data? Write data? Take external actions? The wider the authority, the larger the error surface, regardless of how many LLM calls are involved.</p>
  </div>

  <div class="authority-card ac-amber">
    <h4>Error Cost Asymmetry</h4>
    <p>Is a false positive worse than a false negative, or vice versa? A support bot that wrongly denies a legitimate refund has a different profile than one that wrongly grants a fraudulent one. The architecture must reflect which direction of error the business can tolerate.</p>
  </div>

  <div class="authority-card ac-red">
    <h4>Recoverability Horizon</h4>
    <p>How quickly can a mistake be detected and reversed? A bad product recommendation is recovered in milliseconds. A wrongly issued refund in days. A bad code deployment might corrupt production data irreversibly.</p>
  </div>

</div>

<h3>The same architecture, radically different error surfaces</h3>

<p>Consider three systems that are all, architecturally, "multi-turn chatbot with retrieval and tool access":</p>

<p><strong>Amazon Rufus.</strong> Tools include product catalog search, review retrieval, purchase history lookup. Decision authority is advisory only: Rufus recommends, the user clicks "Add to Cart." The human is always the final gate on any consequential action. Error surface: retrieval relevance plus generation faithfulness. A bad recommendation means the user sees an irrelevant product. Cost of error: near zero. Recoverability: instant. Because the system never commits to anything, you can be aggressive with exploration since no step is dangerous.</p>

<p><strong>Uber's support bot.</strong> Tools include trip history lookup, payment records, policy database, refund issuance, account modification. Decision authority is mixed: reading trip history is advisory; issuing a refund is a commitment. The interesting design question is where the authority boundary is drawn. If refunds require human approval, the error surface collapses to "bad recommendation to a human reviewer"&mdash;structurally identical to Rufus's error surface applied to a different domain. If refunds are autonomous up to some threshold, the error surface expands to include financial loss, customer trust damage, and policy compliance risk. If fully autonomous, it further includes adversarial exploitation. The architecture is not the hard decision. The hard decision is where to draw the authority boundary, and that is a business and governance decision that the architecture then implements.</p>

<p><strong>Cursor Agent / Claude Code.</strong> Tools include file read/write, terminal commands, search. Decision authority varies: Cursor offers tab completion (zero authority, suggestion only), inline edit (bounded authority, user reviews a diff), and agent mode (broad authority, multi-file changes with self-directed exploration). This is adaptive routing within a single product. The "level" is not a property of the system; it is a property of each interaction, and the best systems let the authority boundary flex.</p>

<h3>The deeper point</h3>

<p>All three systems could be described as "agentic" if you look at their architectural capability. What makes them fundamentally different is not the number of LLM calls or whether there is a loop. It is the decision authority boundary and the cost function over errors at that boundary. Architecture is downstream of that choice, not the other way around.</p>

<div class="callout tip">
  <p class="callout-title">The sharper question</p>
  <p>Instead of asking "what mechanism does the system use?", ask: "What is the system <em>authorized to do</em>, and what does the error cost landscape look like at each authority boundary?" The answer constrains mechanism, not the other way around.</p>
</div>

<div class="table-wrap">
<table class="risk-matrix">
  <thead>
    <tr>
      <th class="rm-col-corner"></th>
      <th class="rm-col-head col-advisory">
        <span class="col-label">Advisory</span>
        <span class="col-sub">system recommends, human acts</span>
      </th>
      <th class="rm-col-head col-bounded">
        <span class="col-label">Bounded Autonomy</span>
        <span class="col-sub">within guardrails</span>
      </th>
      <th class="rm-col-head col-full">
        <span class="col-label">Full Autonomy</span>
        <span class="col-sub">system decides and acts</span>
      </th>
    </tr>
  </thead>
  <tbody>
    <tr class="r-low">
      <td class="rm-row-head row-low">
        <span class="row-label">Low risk</span>
        <span class="row-sub">recoverable instantly</span>
      </td>
      <td>Rufus, Perplexity, NotebookLM</td>
      <td>Cursor tab-complete, auto-formatting</td>
      <td>Spam filter, auto-tagging</td>
    </tr>
    <tr class="r-mid">
      <td class="rm-row-head row-mid">
        <span class="row-label">Moderate risk</span>
        <span class="row-sub">partially recoverable</span>
      </td>
      <td>Google AI Overviews, Copilot Chat</td>
      <td>Klarna refunds (capped), Cursor inline edit</td>
      <td>Automated code review merge</td>
    </tr>
    <tr class="r-high">
      <td class="rm-row-head row-high">
        <span class="row-label">High risk</span>
        <span class="row-sub">irreversible</span>
      </td>
      <td>Medical Q&amp;A without disclaimer</td>
      <td>Autonomous trading within limits</td>
      <td>Autonomous deployment to prod</td>
    </tr>
  </tbody>
</table>
</div>

<p>The rows determine how careful you need to be. The columns determine how careful your architecture lets you be. The design problem is finding the cell where the required carefulness matches the achievable carefulness, given your evaluation and monitoring infrastructure.</p>

<div class="divider"></div>

<span class="section-kicker">Step 2 of 4</span>
<h2>2. Input-Output Specification: What Are You Actually Asking the LLM to Do?</h2>

<p>Before architecture, before constraints, before tradeoffs: what goes in, and what comes out? This sounds obvious, but most teams get it wrong by conflating the user's goal with the LLM's task. The user wants "help with my refund." The LLM's task might be: classify intent, retrieve policy, and generate a response. Or it might be: autonomously resolve the case end-to-end. Same user goal, completely different I/O specification.</p>

<h3>The atomic operations</h3>

<p>Every LLM system, at the atomic level, performs one or some combination of these operations:</p>

<div class="table-wrap">
<table>
  <thead>
    <tr>
      <th>Operation</th>
      <th>Input → Output</th>
      <th>Examples</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><span class="ops-badge ob-gen">Generation</span></td>
      <td>Context and prompt → natural language</td>
      <td>Email drafting, creative writing, explanation</td>
    </tr>
    <tr>
      <td><span class="ops-badge ob-class">Classification</span></td>
      <td>Text → label or routing decision</td>
      <td>Intent detection, triage, content moderation</td>
    </tr>
    <tr>
      <td><span class="ops-badge ob-ext">Extraction</span></td>
      <td>Document or conversation → structured fields</td>
      <td>Parsing trip details from a complaint, entity extraction</td>
    </tr>
    <tr>
      <td><span class="ops-badge ob-trans">Transformation</span></td>
      <td>Structured data + instructions → restructured data</td>
      <td>Summarization, translation, code refactoring</td>
    </tr>
    <tr>
      <td><span class="ops-badge ob-action">Decision / Action</span></td>
      <td>State + policy → action with real-world consequences</td>
      <td>Issue refund, merge PR, send notification</td>
    </tr>
  </tbody>
</table>
</div>

<h3>Task difficulty as an information gap</h3>

<p>The difficulty of a task is not about "how smart the LLM needs to be." It is about the gap between what the input provides and what the output requires. The wider the gap, the harder the task, and the more the system needs to <em>acquire</em> information through retrieval, tool use, multi-step reasoning, or human interaction, rather than just <em>transform</em> information already present.</p>

<p><strong>Narrow gap (transformation-dominant).</strong> "Summarize this document." All information needed for the output is in the input. The LLM compresses; it does not need to seek. Almost always solvable with a single call.</p>

<p><strong>Medium gap (retrieval-dominant).</strong> "Answer this question given our knowledge base." The information exists somewhere but is not in the input. The system needs to find it. Single call with RAG, or a short chain.</p>

<p><strong>Wide gap (reasoning-dominant).</strong> "Debug why this test is failing." The input is a test failure; the output requires understanding codebase structure, dependency relationships, and causal reasoning across files. The system needs to actively explore to close the gap.</p>

<p><strong>Open-ended gap (planning-dominant).</strong> "Build me a feature that does X." The input is a vague specification; the output is a working implementation. The system must decompose, plan, execute, verify, and iterate. The number of steps cannot be predicted in advance.</p>

<h3>Industrial grounding</h3>

<p><strong>Rufus.</strong> The user's input is a shopping query ("running shoes for flat feet under $100"). The required output is a product recommendation with justification. The information gap is narrow to medium: the answer exists in the product catalog, and the LLM needs to retrieve and synthesize it. The atomic tasks are retrieval plus generation. This is why Rufus works as a simple system: the information gap can be closed in one or two steps.</p>

<p><strong>Uber support bot.</strong> The user's input is a complaint ("I was charged twice for a cancelled ride"). The required output is a resolution. The information gap is wide: the system needs to extract the claim from natural language, retrieve trip and payment records, match the situation to policy, decide on an action, and execute or recommend that action. Each step introduces new information that reshapes the next step. The atomic tasks span extraction, retrieval, classification, reasoning, and potentially decision/action. The gap cannot be closed in a single step; the system must plan an information-gathering trajectory.</p>

<p>The key difference is not "chat about products" versus "chat about trips." It is that Rufus's I/O gap can be closed with a single retrieval step, while Uber's requires a multi-step information acquisition process where each step's output conditions the next step's input. Task difficulty is structural, not domain-specific.</p>

<h3>Two diagnostic questions</h3>

<div class="callout info">
  <p class="callout-title">Diagnosing your I/O gap</p>
  <p><strong>First:</strong> How many information-acquisition steps are needed to bridge the I/O gap for the median case? For the 95th percentile case? If the answer is one or two, you are in simple system territory. If it is three to five with dependencies between steps, you need a chain or light agentic behavior. If you cannot predict the number in advance because it depends on what the system discovers, you need a full agentic loop.</p>
  <p><strong>Second:</strong> Is the output verifiable from the input alone? If the user can check the output by looking at the input (summarization, translation, extraction), evaluation is cheap and the system is naturally self-correcting. If verifying the output requires independent investigation (was the refund correct? does the code actually work?), evaluation is expensive and errors persist longer. This verification difficulty is a better predictor of required system complexity than the surface-level task description.</p>
</div>

<div class="divider"></div>

<span class="section-kicker">Step 3 of 4</span>
<h2>3. Constraint Analysis: The Four Walls of Your Design Space</h2>

<p>Section 2 asked "what do you want to do, and how hard is it?" This section asks "what are you <em>not allowed</em> to do while doing it?" Constraints do not just limit your architecture; they can invalidate entire framing levels and force you into designs you would not otherwise choose.</p>

<h3><span class="wall-num">Wall 1 / 4</span> Latency</h3>

<p>Not all latency constraints are equal. The real question is: what is the user doing while waiting?</p>

<p><strong>Synchronous, inline (under 1&ndash;2 seconds).</strong> The user is mid-task and blocked. Autocomplete, inline suggestions, search results. Every additional LLM call is directly felt. An agentic loop is architecturally invalid here. This is Rufus's world: the user typed a query and is staring at the screen.</p>

<p><strong>Synchronous, conversational (2&ndash;30 seconds).</strong> The user is in a dialogue and expects a response but will tolerate a pause. Chat interfaces, support bots. You can afford a retrieval step plus generation, maybe a short chain. This is where most chatbots live, including Uber's bot for straightforward cases.</p>

<p><strong>Asynchronous, backgrounded (minutes to hours).</strong> The user kicked off a task and went to do something else. Code generation, report writing, deep research. Agentic loops are not just tolerated; they are expected. This is the world of Devin, Claude Code in headless mode, Deep Research.</p>

<div class="callout tip">
  <p class="callout-title">The underappreciated coupling</p>
  <p>Latency is not just about user patience. It determines feedback loop frequency. A one-second system gets corrected by the user every second. A ten-minute agentic run gets corrected zero times during execution. Longer latency means the system runs open-loop for longer, which means errors accumulate uncorrected. Latency and error surface are directly coupled.</p>
</div>

<h3><span class="wall-num">Wall 2 / 4</span> Cost</h3>

<p>Per-query cost scales roughly multiplicatively with system complexity. A back-of-envelope template:</p>

<div class="table-wrap">
<table>
  <thead>
    <tr>
      <th>System type</th>
      <th>Approx. LLM calls</th>
      <th>Cost relative to single call</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Single call</td>
      <td>1</td>
      <td><span class="cost-tier ct-low">1×</span></td>
    </tr>
    <tr>
      <td>Chain / pipeline</td>
      <td>2–5 + retrieval</td>
      <td><span class="cost-tier ct-mid">3–5×</span></td>
    </tr>
    <tr>
      <td>Agentic loop</td>
      <td>5–50 + tool executions</td>
      <td><span class="cost-tier ct-high">10–100×</span> <small style="color:var(--ep-muted)">highly variable</small></td>
    </tr>
    <tr>
      <td>Multi-agent</td>
      <td>Agentic × number of agents</td>
      <td><span class="cost-tier ct-max">Multiplied</span></td>
    </tr>
  </tbody>
</table>
</div>

<p><strong>The real cost analysis is not per-query; it is per-resolution.</strong> If a single-call system resolves 60% of cases and an agentic system resolves 90%, the question is whether the 30% improvement justifies the cost increase on all queries&mdash;or whether adaptive routing can capture the gains cheaply by sending simple cases to the single-call path and escalating only the hard 30%.</p>

<p><strong>Rufus</strong> handles millions of queries daily. At a penny per query, that is already tens of thousands of dollars a day. An agentic architecture at fifty cents per query would be financially non-viable at that scale, even if it produced better recommendations. Cost eliminates agentic framing before any other consideration.</p>

<p><strong>Uber's support bot</strong> operates at lower volume but higher value per resolution. A human agent costs five to fifteen dollars per interaction. If an agentic bot costs fifty cents to two dollars per interaction and resolves 70% of cases without human involvement, the economics are compelling. Cost <em>enables</em> a more complex system here because the comparison point is human labor, not a simpler bot.</p>

<h3><span class="wall-num">Wall 3 / 4</span> Privacy and Information Flow</h3>

<p>The conventional framing asks "what PII touches the LLM?" and "do you need on-prem?" These are important but shallow questions. The deeper question is about information flow topology: at each step, what data can the model see, infer, store, and transmit?</p>

<div class="authority-grid">

  <div class="authority-card ac-teal">
    <h4>See</h4>
    <p>Can the model access raw user data, or only anonymized versions? Can it see cross-user data? This is the most visible and most commonly audited dimension.</p>
  </div>

  <div class="authority-card ac-blue">
    <h4>Infer</h4>
    <p>Even without raw PII, can the model infer sensitive attributes from context? A trip from a hospital to a pharmacy implies health information. Late-night rides to a specific address imply personal relationships.</p>
  </div>

  <div class="authority-card ac-amber">
    <h4>Store</h4>
    <p>Does intermediate reasoning persist? In an agentic loop, the model's scratchpad may contain sensitive inferences that are more revealing than the raw data — and are rarely audited.</p>
  </div>

  <div class="authority-card ac-red">
    <h4>Transmit</h4>
    <p>When the model calls a tool or another agent, what leaks across that boundary? Multi-agent decomposition can create new information transmission channels that did not exist in a monolithic system.</p>
  </div>

</div>

<p>The more you can push PII resolution to deterministic pre-processing, the simpler your privacy architecture. Uber's bot does not necessarily need the LLM to see raw trip GPS coordinates. A pre-processing layer could resolve "the trip on Tuesday" to a trip ID, and the LLM only sees the ID, fare, and status. Rufus mostly avoids PII since product queries are rarely sensitive&mdash;another reason a simpler architecture is viable there.</p>

<h3><span class="wall-num">Wall 4 / 4</span> Safety, Reversibility, and Governance</h3>

<p>Every authority level requires a corresponding governance level. Advisory systems need output quality monitoring: evals for faithfulness, relevance, toxicity. Standard practice. Systems with bounded autonomy need policy compliance auditing: verification that the system followed policy on every action, requiring logging, trajectory auditing, and exception review. Significantly more infrastructure. Fully autonomous systems need comprehensive audit trails, adversarial testing, and real-time anomaly detection&mdash;the governance level of a financial trading system, not a chatbot.</p>

<div class="callout warn">
  <p class="callout-title">The constraint is often organizational, not technical</p>
  <p>If you cannot build or staff the governance infrastructure for a given authority level, you cannot responsibly operate at that level. Many teams have the technical capability to build agentic systems but not the operational capability to govern them.</p>
</div>

<h3>How the walls interact</h3>

<p>The four constraints do not operate independently. They create feasibility wedges.</p>

<p><strong>Latency plus cost:</strong> an agentic system might be the best solution, but if the latency wall forces sub-second response and the cost wall prevents parallelizing ten LLM calls, you are forced into a simple system regardless of task difficulty.</p>

<p><strong>Privacy plus authority:</strong> if the task requires accessing sensitive data <em>and</em> taking autonomous actions, you need both rigorous information flow control <em>and</em> action governance. This is the hardest design space, where most real enterprise systems live, and where most naive "just add agents" approaches fail.</p>

<p><strong>Cost plus governance:</strong> an agentic loop might be affordable per query, but trajectory auditing at scale can cost more than the LLM calls themselves. The hidden cost of agentic systems is not compute; it is monitoring.</p>

<div class="divider"></div>

<span class="section-kicker">Step 4 of 4</span>
<h2>4. Pareto Tradeoffs: You Cannot Maximize Everything</h2>

<p>Sections 2 and 3 defined what you want and what you are constrained by. This section confronts the uncomfortable truth: even within your feasible design space, you face genuine tradeoffs where improving one dimension necessarily degrades another. The goal is not to optimize; it is to choose your tradeoff position consciously.</p>

<h3>Axis 1: Response Quality vs. Latency and Cost (Deliberation Tradeoff)</h3>

<p>More reasoning, more tool calls, more verification steps produce better outputs. But each step adds latency and cost. This is the tradeoff between "think harder" and "answer faster."</p>

<p><strong>Rufus</strong> sits far toward speed. A product recommendation that takes five seconds is worse than a decent one in 500 milliseconds, because the user is mid-browse and will abandon. Amazon accepts a lower ceiling on recommendation quality in exchange for responsiveness at scale. There is a subtle cheat here: the <em>user</em> provides the additional deliberation. If the first recommendation is wrong, the user refines their query. The human-in-the-loop is the deliberation mechanism, and it is free to Amazon.</p>

<p><strong>Uber's bot</strong> can sit further toward quality. A support interaction that takes sixty seconds but correctly resolves the issue is far better than a five-second response that misclassifies the problem. The cost of a wrong fast answer (customer frustration, escalation to a human, potential churn) exceeds the cost of a slow correct answer. But there is a cliff: past two to three minutes, the user abandons the bot and calls support anyway, eliminating the cost savings.</p>

<h3>Axis 2: Autonomy vs. Control (Agency Tradeoff)</h3>

<p>More autonomy means the system can handle novel situations. More control means the system does exactly what you specified. You cannot maximize both.</p>

<p><strong>Rufus</strong> operates with high control and low autonomy. It follows a narrow loop: interpret query, retrieve products, generate recommendation. It does not decide to cross-reference competitor prices or proactively suggest alternatives to items already in your cart. This makes Rufus predictable and easy to audit, but it cannot handle genuinely complex shopping decisions.</p>

<p><strong>Uber's bot</strong> needs more autonomy because support cases are heterogeneous. A cancelled ride, a lost item, a safety incident, and a fare dispute all require different tools, policies, and workflows. A fully controlled system would need to enumerate every case type in advance. Autonomy allows handling novel or ambiguous cases&mdash;but the cost is unpredictability: the bot might apply the wrong policy, or handle a safety incident with the same routine as a fare dispute.</p>

<p><strong>The convergent design pattern.</strong> Both systems benefit from the same principle: <em>autonomy in reasoning, control at the action boundary.</em> Let the LLM be autonomous in understanding the situation (flexible interpretation, retrieval, reasoning). Clamp down at the action boundary (hardcoded policy checks, human approval for high-stakes actions, strict guardrails on what the system can execute). This captures the benefits of autonomy for understanding novel cases without the risks of autonomous action.</p>

<h3>Axis 3: Generality vs. Reliability (Scope Tradeoff)</h3>

<p>A general system handles more use cases but fails less gracefully on any particular one. A narrow system handles fewer cases but handles them very well.</p>

<p><strong>Rufus</strong> is narrow scope, high reliability within that scope. It answers shopping queries about Amazon products. It does not try to handle returns, account issues, or general knowledge questions. This narrowness enables tight optimization: retrieval tuned to the product catalog, generation tuned to product language, evaluation based on click-through and purchase conversion.</p>

<p><strong>Uber's bot</strong> must be broader because "customer support" spans dozens of issue types with different resolution paths. But generality creates a long tail of failure: the system handles the top ten issue types well but struggles with rare cases&mdash;edge cases in surge pricing policy, multi-leg international trips, or accessibility-related complaints. The 80th percentile case works; the 95th percentile case does not.</p>

<p>The production solution is <em>scoped generality</em>: be general within a defined boundary, and have hard fallbacks (escalation to human, graceful failure) for cases outside that boundary. The design question is where to draw the boundary, and that is an empirical question that most teams answer too late.</p>

<h3>The combined Pareto map</h3>

<div class="table-wrap">
<table class="pareto-table">
  <thead>
    <tr>
      <th>Tradeoff</th>
      <th>Rufus</th>
      <th>Uber Bot</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Quality vs. Speed</strong></td>
      <td>Speed wins. User provides deliberation.</td>
      <td>Quality wins, with a latency cliff at 2&ndash;3 minutes.</td>
    </tr>
    <tr>
      <td><strong>Autonomy vs. Control</strong></td>
      <td>Controlled reasoning, no action authority.</td>
      <td>Autonomous reasoning, controlled action boundary.</td>
    </tr>
    <tr>
      <td><strong>Generality vs. Reliability</strong></td>
      <td>Narrow and reliable. One job, done well.</td>
      <td>Scoped generality. Common cases handled; long tail escalated.</td>
    </tr>
    <tr>
      <td><strong>Net design posture</strong></td>
      <td>Simple, fast, cheap, predictable. Let the human do the hard part.</td>
      <td>Complex, careful, moderate cost, monitored. System does the hard part under governance.</td>
    </tr>
  </tbody>
</table>
</div>

<p>These positions are not technical preferences. They reflect business model alignment. Rufus exists to keep users browsing and buying; speed and low friction are paramount, and a wrong recommendation costs nothing. Uber's bot exists to replace a $5&ndash;15 human interaction; thoroughness and correctness are paramount, and a wrong resolution costs real money and trust. The Pareto position follows from the business model, not from the technology.</p>

<div class="divider"></div>

<h2>5. Putting It Together: The Framing Decision</h2>

<p>The preceding sections gave you four analytical lenses. Here is how to synthesize them into an actual requirements decision.</p>

<h3>Six forcing questions</h3>

<p>Rather than a checklist, use these as forcing functions that make implicit assumptions explicit:</p>

<div class="callout info">
  <p class="callout-title">The six forcing questions</p>
  <ol>
    <li><strong>Error surface.</strong> Where is uncertainty allowed to accumulate? Map the decision authority boundary: what does the system recommend versus what does it execute? What happens when it is wrong at each boundary?</li>
    <li><strong>I/O gap analysis.</strong> How many information-acquisition steps bridge input to required output, for the median case and the 95th percentile? If the number is unpredictable, you need agentic behavior.</li>
    <li><strong>Constraint feasibility.</strong> Which of the four walls (latency, cost, privacy, governance) eliminates which framing levels? Often a single constraint will kill an entire design direction before you consider tradeoffs.</li>
    <li><strong>Pareto position.</strong> Given your business model, where do you sit on each tradeoff axis? Which axis matters most? Optimize along that axis and satisfice the others.</li>
    <li><strong>Evaluation commitment.</strong> Given your framing, what can you actually measure? A simple system lets you benchmark output quality with standard metrics. An agentic system requires trajectory evaluation, counterfactual analysis, and tool-use auditing. If you cannot build the evaluation infrastructure your framing demands, you will be flying blind. Choose a framing you can evaluate, not just one you can build.</li>
    <li><strong>Routing policy.</strong> Will the system operate at a single complexity level, or adaptively route across levels? The true design problem is often not "which level?" but "what routing policy across levels, and what signal drives the routing?"</li>
  </ol>
</div>

<h3>A worked example: Uber's support bot through the six questions</h3>

<div class="worked-example">
  <p class="worked-example-kicker">Worked example &middot; Uber's support bot</p>

  <div class="we-item">
    <div class="we-q">1. Error surface</div>
    <div class="we-a">Advisory for informational queries (trip status, policy explanation). Bounded autonomy for simple actions (refunds under $20 on trips with clear cancellation records). Human escalation for ambiguous cases, high-value disputes, and safety-related incidents.</div>
  </div>

  <div class="we-item">
    <div class="we-q">2. I/O gap</div>
    <div class="we-a">"What is your cancellation policy?" — one-step gap (retrieval). "I was double-charged on a cancelled ride" — four-to-five-step gap (extract claim → retrieve trip → retrieve payment → match to policy → determine action). "My driver made me feel unsafe" — open-ended gap, mandatory escalation.</div>
  </div>

  <div class="we-item">
    <div class="we-q">3. Constraints</div>
    <div class="we-a">Latency: 2–30s tolerance, hard abandonment cliff at 2–3 minutes. Cost: must beat $5–15 human agent cost. Privacy: pre-process trip GPS to trip IDs, minimize raw location in LLM context. Governance: trajectory auditing for every autonomous refund, adversarial monitoring for refund gaming.</div>
  </div>

  <div class="we-item">
    <div class="we-q">4. Pareto position</div>
    <div class="we-a">Quality over speed (wrong resolutions cost more than slow ones). Autonomous reasoning, controlled action boundary. Scoped generality with hard escalation paths.</div>
  </div>

  <div class="we-item">
    <div class="we-q">5. Evaluation commitment</div>
    <div class="we-a">Output metrics: resolution rate, CSAT, re-contact rate. Trajectory metrics: policy compliance rate, action appropriateness, escalation accuracy. Both are required — invest in trajectory logging from day one.</div>
  </div>

  <div class="we-item">
    <div class="we-q">6. Routing policy</div>
    <div class="we-a">Simple informational queries → single call with retrieval. Standard issue types (cancellation, fare adjustment) → structured chain with policy lookup. Complex, ambiguous, or safety-related cases → human escalation. The routing classifier itself is a first-class component with its own evaluation cycle.</div>
  </div>
</div>

<h3>The adaptive routing insight</h3>

<p>The golden rule "start simple, escalate only when demonstrably needed" is directionally correct but incomplete. The modern production pattern is <em>adaptive routing</em>, not fixed escalation. You do not pick a level and live there. You build a routing policy that allocates each case to the appropriate level of deliberation based on estimated difficulty and stakes. The router itself becomes a first-class component of the system: it needs its own evaluation, its own failure modes, and its own iteration cycle.</p>

<p>When is one more deliberation step worth it? The intuition is straightforward: the expected quality gain from an additional step must exceed the sum of its incremental latency cost, failure risk, and monitoring burden. When that sum is positive, escalate. When it is negative, you are adding complexity for complexity's sake. Rufus-type queries almost never benefit from more deliberation. Uber-type disputes almost always do, up to a ceiling. The best systems learn where that ceiling is from data, not from intuition.</p>

<div class="divider"></div>

<h2>What Comes Next</h2>

<p>This post argued that the first design decision for an LLM system is not "how smart should it be?" but "where do uncertainty, agency, and accountability sit?" The four lenses&mdash;error surface, I/O gap, constraints, Pareto tradeoffs&mdash;and six forcing questions give you a framework for making that decision deliberately rather than by default.</p>

<p>But notice what we have not yet discussed: how to actually <em>build</em> the thing. The requirements and framing decision constrains the architecture to a narrow set of viable designs. Part 2 is about making that design explicit: decomposition patterns, memory architecture, tool integration, and the control flow decisions that turn a framing choice into a running system.</p>

<p>The paradox of this first step is that it is also the step you revisit most often. Requirements shift, models improve, cost structures change, and your evaluation data reveals that the I/O gap is different than you assumed.</p>

<p class="pull-quote">The six forcing questions are not a one-time exercise. They are a living document that evolves with your system.</p>

<div class="series-cta">
  <h3>This is Part 1 of a 7-part series.</h3>
  <p>The series overview laid out the seven-step lifecycle. This part covered requirements and problem framing: error surfaces, I/O gaps, constraints, and Pareto tradeoffs. Next: turning those decisions into a running architecture.</p>
  <p style="margin-top: 16px;"><a href="/blog/2026/llm-agents-blueprint/" style="color: rgba(250,248,245,0.8); border-bottom: 1px solid rgba(250,248,245,0.3);">&larr; Part 0: Series Overview</a></p>
</div>

</article>

<div class="post-end">
  <p class="post-author">Lorenzo Xiao &middot; Language Technologies Institute &middot; Carnegie Mellon University</p>
  <a href="/blog/">&larr; Back to blog</a>
</div>]]></content><author><name>Lorenzo Xiao</name></author><category term="blog" /><category term="academic" /><category term="AI" /><category term="engineering" /><summary type="html"><![CDATA[The first design decision is not how smart the system should be, but where uncertainty, agency, and accountability sit. A framework for making that choice deliberately.]]></summary></entry><entry><title type="html">RL for LLMs: The Reading List</title><link href="https://algoroxyolo.github.io/blog/2026/rl-reading-list/" rel="alternate" type="text/html" title="RL for LLMs: The Reading List" /><published>2026-03-22T12:00:00+00:00</published><updated>2026-03-22T12:00:00+00:00</updated><id>https://algoroxyolo.github.io/blog/2026/rl-reading-list</id><content type="html" xml:base="https://algoroxyolo.github.io/blog/2026/rl-reading-list/"><![CDATA[<header class="hero hero--compact">
  <div class="hero-badge">Reference &middot; 96 Papers</div>
  <h1>RL for LLMs: <em>The Reading List</em></h1>
  <p class="hero-sub">A curated taxonomy of 96 papers across algorithms, rewards, preferences, systems, and agents &mdash; with reading depth recommendations.</p>
  <p class="hero-meta">Lorenzo Xiao &middot; v8 &middot; Updated March 2026</p>
</header>

<article>

<div class="rl-legend">
  <span class="rl-legend-item"><span class="rl-g rl-ga">A</span> Deep read</span>
  <span class="rl-legend-item"><span class="rl-g rl-gam">A&minus;</span> Method + experiments</span>
  <span class="rl-legend-item"><span class="rl-g rl-gbp">B+</span> Problem + conclusion</span>
  <span class="rl-legend-item"><span class="rl-g rl-gb">B</span> Skim</span>
  <span class="rl-legend-item"><span class="rl-g rl-gc">C</span> Core idea only</span>
  <span class="rl-legend-item"><span class="rl-nv">NV</span> NVIDIA paper</span>
</div>

<div class="rl-stats">
  <span><strong>96</strong> papers</span>
  <span><strong>5</strong> categories</span>
  <span><strong>24</strong> sub-topics</span>
  <span style="flex:1"></span>
  <span>2017: <strong>1</strong></span>
  <span>2022: <strong>2</strong></span>
  <span>2023: <strong>1</strong></span>
  <span>2024: <strong>8</strong></span>
  <span>2025: <strong>54</strong></span>
  <span>2026: <strong>30</strong></span>
</div>

<!-- ═══════ 1. ALGORITHMS ═══════ -->
<details class="rl-cat rl-cat--purple" open>
<summary>Algorithms <span class="rl-cat-count">26 papers</span></summary>

<details class="rl-sub">
<summary>Core RL: PPO &rarr; GRPO &rarr; DAPO &rarr; CISPO &rarr; MaxRL &rarr; DPPO <span class="rl-sub-count">13</span></summary>
<div class="rl-papers">
  <div class="rl-p"><span class="rl-y">2017</span><span class="rl-n"><a href="https://arxiv.org/abs/1707.06347" target="_blank">PPO</a><div class="rl-tip">Know ratio clipping + trust region</div></span><span class="rl-badges"><span class="rl-g rl-gb">B</span></span></div>
  <div class="rl-p"><span class="rl-y">2022</span><span class="rl-n"><a href="https://arxiv.org/abs/2203.02155" target="_blank">InstructGPT</a><div class="rl-tip">RM + PPO + KL penalty; shared with &sect;3.1</div></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
  <div class="rl-p"><span class="rl-y">2024</span><span class="rl-n"><a href="https://arxiv.org/abs/2402.14740" target="_blank">RLOO / Revisiting REINFORCE</a></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
  <div class="rl-p"><span class="rl-y">2024</span><span class="rl-n"><a href="https://arxiv.org/abs/2402.03300" target="_blank">DeepSeekMath (GRPO origin)</a><div class="rl-tip">Start your interview story here</div></span><span class="rl-badges"><span class="rl-g rl-ga">A</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2501.12948" target="_blank">DeepSeek-R1</a></span><span class="rl-badges"><span class="rl-g rl-ga">A</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2503.14476" target="_blank">DAPO</a><div class="rl-tip">Recipe + system-level scaling</div></span><span class="rl-badges"><span class="rl-g rl-ga">A</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2507.18071" target="_blank">GSPO</a></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2511.20347" target="_blank">SAPO</a></span><span class="rl-badges"><span class="rl-g rl-gb">B</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2506.13585" target="_blank">MiniMax-M1 / CISPO</a><div class="rl-tip">Changed clipping target &mdash; key interview talking point; also notable for long-context RL (see &sect;4.3)</div></span><span class="rl-badges"><span class="rl-g rl-ga">A</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2503.01491" target="_blank">PPO Collapse in Long-CoT</a></span><span class="rl-badges"><span class="rl-g rl-gbp">B+</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://openreview.net/forum?id=5PAF7PAY2Y" target="_blank">Dr. GRPO</a><div class="rl-tip">Fixes length bias + std normalization</div></span><span class="rl-badges"><span class="rl-g rl-gbp">B+</span></span></div>
  <div class="rl-p"><span class="rl-y">2026</span><span class="rl-n"><a href="https://arxiv.org/abs/2602.02710" target="_blank">MaxRL</a><div class="rl-tip">pass@k objective; RL&ndash;MLE continuum</div></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
  <div class="rl-p"><span class="rl-y">2026</span><span class="rl-n"><a href="https://arxiv.org/abs/2602.04879" target="_blank">DPPO</a><div class="rl-tip">TV/KL divergence trust region</div></span><span class="rl-badges"><span class="rl-g rl-gbp">B+</span></span></div>
</div>
</details>

<details class="rl-sub">
<summary>Scaling Laws &amp; Meta <span class="rl-sub-count">5</span></summary>
<div class="rl-papers">
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2505.09388" target="_blank">Qwen3 Technical Report</a><div class="rl-tip">Thinking/non-thinking unified framework + thinking budget + distillation recipe; source of Nemotron 3 reasoning budget control</div></span><span class="rl-badges"><span class="rl-g rl-ga">A</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2510.13786" target="_blank">ScaleRL / The Art of Scaling RL Compute</a></span><span class="rl-badges"><span class="rl-g rl-ga">A</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2509.25300" target="_blank">Scaling Behaviors of LLM RL Post-Training</a><div class="rl-tip">Power-law: model scale &times; data &times; compute; complements ScaleRL</div></span><span class="rl-badges"><span class="rl-g rl-gb">B</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2505.24864" target="_blank">ProRL</a><div class="rl-tip">Steps scaling dimension; prerequisite to ScaleRL</div></span><span class="rl-badges"><span class="rl-g rl-gbp">B+</span><span class="rl-nv">NV</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2510.01180" target="_blank">BroRL</a><div class="rl-tip">Rollouts scaling dimension; complementary to ProRL</div></span><span class="rl-badges"><span class="rl-g rl-gbp">B+</span><span class="rl-nv">NV</span></span></div>
</div>
</details>

<details class="rl-sub">
<summary>Cascade &amp; Stage-wise RL (Nemotron) <span class="rl-sub-count">4</span></summary>
<div class="rl-papers">
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2505.16400" target="_blank">AceReason-Nemotron 1.1</a></span><span class="rl-badges"><span class="rl-g rl-ga">A</span><span class="rl-nv">NV</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2512.13607" target="_blank">Nemotron-Cascade 1</a><div class="rl-tip">Foundational motivation for Cascade 2</div></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span><span class="rl-nv">NV</span></span></div>
  <div class="rl-p"><span class="rl-y">2026</span><span class="rl-n"><a href="https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Super-Technical-Report.pdf" target="_blank">Nemotron 3 Super Technical Report</a><div class="rl-tip">LatentMoE + NVFP4 + multi-env simultaneous RL + reasoning budget control</div></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span><span class="rl-nv">NV</span></span></div>
  <div class="rl-p"><span class="rl-y">2026</span><span class="rl-n"><a href="https://arxiv.org/abs/2603.19220" target="_blank">Nemotron-Cascade 2</a><div class="rl-tip">IMO/IOI/ICPC gold medals; 30B MoE; on-policy distillation</div></span><span class="rl-badges"><span class="rl-g rl-ga">A</span><span class="rl-nv">NV</span></span></div>
</div>
</details>

<details class="rl-sub">
<summary>Distillation (on-policy / self / context) <span class="rl-sub-count">4</span></summary>
<div class="rl-papers">
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://thinkingmachines.ai/blog/on-policy-distillation/" target="_blank">On-Policy Distillation (Thinking Machines)</a><div class="rl-tip">Clearest explanation of compute tradeoffs</div></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
  <div class="rl-p"><span class="rl-y">2026</span><span class="rl-n"><a href="https://arxiv.org/abs/2601.02780" target="_blank">MiMo-V2-Flash</a><div class="rl-tip">Multi-teacher on-policy distillation; core Cascade 2 technique</div></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
  <div class="rl-p"><span class="rl-y">2026</span><span class="rl-n"><a href="https://arxiv.org/abs/2601.18734" target="_blank">Self-Distilled Reasoner</a><div class="rl-tip">Teacher = self with privileged information</div></span><span class="rl-badges"><span class="rl-g rl-gbp">B+</span></span></div>
  <div class="rl-p"><span class="rl-y">2026</span><span class="rl-n"><a href="https://arxiv.org/abs/2602.12275" target="_blank">On-Policy Context Distillation</a></span><span class="rl-badges"><span class="rl-g rl-gb">B</span></span></div>
</div>
</details>

</details>

<!-- ═══════ 2. REWARD MODELING ═══════ -->
<details class="rl-cat rl-cat--coral" open>
<summary>Reward Modeling <span class="rl-cat-count">20 papers</span></summary>

<details class="rl-sub">
<summary>Generative Reward Models <span class="rl-sub-count">5</span></summary>
<div class="rl-papers">
  <div class="rl-p"><span class="rl-y">2024</span><span class="rl-n"><a href="https://arxiv.org/abs/2408.15240" target="_blank">Generative Verifiers: RM as Next-Token Prediction</a><div class="rl-tip">346 citations; GRM origin; yes/no token + majority vote</div></span><span class="rl-badges"><span class="rl-g rl-ga">A</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2505.11475" target="_blank">HelpSteer3-Preference</a><div class="rl-tip">40k samples; NVIDIA latest; RM-Bench SOTA</div></span><span class="rl-badges"><span class="rl-g rl-ga">A</span><span class="rl-nv">NV</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2504.02495" target="_blank">DeepSeek-GRM / SPCT</a><div class="rl-tip">SPCT online RL for GRM training; 193 citations</div></span><span class="rl-badges"><span class="rl-g rl-ga">A</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2505.02387" target="_blank">RM-R1: Reward Modeling as Reasoning</a><div class="rl-tip">Reason first, then score; 101 citations</div></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
  <div class="rl-p"><span class="rl-y">2026</span><span class="rl-n"><a href="https://arxiv.org/abs/2602.12116" target="_blank">P-GenRM (ICLR 2026 Oral)</a><div class="rl-tip">Personalized GRM &mdash; directly relevant to persona research</div></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
</div>
</details>

<details class="rl-sub">
<summary>Process Reward Models (PRM) <span class="rl-sub-count">7</span></summary>
<div class="rl-papers">
  <div class="rl-p"><span class="rl-y">2024</span><span class="rl-n"><a href="https://openreview.net/forum?id=A6Y7AqlzLW" target="_blank">PAV / Rewarding Progress</a><div class="rl-tip">ICLR; 225 citations; process reward = advantage</div></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2501.07301" target="_blank">Lessons of Developing PRMs</a><div class="rl-tip">Qwen/Tongyi; PRM training practice: what works, annotation, stability</div></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2504.00125" target="_blank">ThinkPRM</a><div class="rl-tip">Long CoT verifier; 58 citations</div></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2504.15221" target="_blank">GenPRM</a><div class="rl-tip">PRM test-time generative reasoning; 21 citations</div></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2506.18896" target="_blank">ReasonFlux-PRM</a><div class="rl-tip">NeurIPS 2025 Spotlight; trajectory-aware step + trajectory dual supervision</div></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2503.06266" target="_blank">R-PRM</a></span><span class="rl-badges"><span class="rl-g rl-gbp">B+</span></span></div>
  <div class="rl-p"><span class="rl-y">2026</span><span class="rl-n"><a href="https://arxiv.org/abs/2603.02479" target="_blank">PRISM</a></span><span class="rl-badges"><span class="rl-g rl-gbp">B+</span></span></div>
</div>
</details>

<details class="rl-sub">
<summary>Rubrics-as-Rewards <span class="rl-sub-count">4</span></summary>
<div class="rl-papers">
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2503.11042" target="_blank">Rubrics as Rewards (RaR)</a><div class="rl-tip">111 citations; rubrics structure subjective preferences</div></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2504.00831" target="_blank">OpenRubrics</a></span><span class="rl-badges"><span class="rl-g rl-gbp">B+</span></span></div>
  <div class="rl-p"><span class="rl-y">2026</span><span class="rl-n"><a href="https://arxiv.org/abs/2602.01454" target="_blank">Rubric-ARM</a><div class="rl-tip">Alternating optimization of rubric generator + judge</div></span><span class="rl-badges"><span class="rl-g rl-gbp">B+</span></span></div>
  <div class="rl-p"><span class="rl-y">2026</span><span class="rl-n"><a href="https://arxiv.org/abs/2601.22975" target="_blank">Golden Goose</a><div class="rl-tip">Non-verifiable &rarr; verifiable bridge; high value for research narrative</div></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span><span class="rl-nv">NV</span></span></div>
</div>
</details>

<details class="rl-sub">
<summary>RM Evaluation &amp; Benchmarks <span class="rl-sub-count">4</span></summary>
<div class="rl-papers">
  <div class="rl-p"><span class="rl-y">2024</span><span class="rl-n"><a href="https://arxiv.org/abs/2410.14872" target="_blank">How to Evaluate RM for RLHF / PPE</a><div class="rl-tip">61 citations; offline-online gap</div></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
  <div class="rl-p"><span class="rl-y">2024</span><span class="rl-n"><a href="https://arxiv.org/abs/2503.04378" target="_blank">HelpSteer3 dataset</a><div class="rl-tip">Feedback + edit; inference-time scaling</div></span><span class="rl-badges"><span class="rl-g rl-gbp">B+</span><span class="rl-nv">NV</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2506.01937" target="_blank">RewardBench 2</a><div class="rl-tip">63 citations; reward hacking</div></span><span class="rl-badges"><span class="rl-g rl-gbp">B+</span></span></div>
  <div class="rl-p"><span class="rl-y">2026</span><span class="rl-n"><a href="https://arxiv.org/abs/2603.12963" target="_blank">Long-form RewardBench</a></span><span class="rl-badges"><span class="rl-g rl-gb">B</span></span></div>
</div>
</details>

</details>

<!-- ═══════ 3. RLHF / PREFERENCE ═══════ -->
<details class="rl-cat rl-cat--pink" open>
<summary>RLHF &amp; Preference Optimization <span class="rl-cat-count">13 papers</span></summary>

<details class="rl-sub">
<summary>Foundations <span class="rl-sub-count">3</span></summary>
<div class="rl-papers">
  <div class="rl-p"><span class="rl-y">2022</span><span class="rl-n"><a href="https://arxiv.org/abs/2203.02155" target="_blank">InstructGPT</a><div class="rl-tip">Shared with &sect;1.1</div></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
  <div class="rl-p"><span class="rl-y">2022</span><span class="rl-n"><a href="https://arxiv.org/abs/2204.05862" target="_blank">Bai et al. / HH-RLHF</a><div class="rl-tip">Safety alignment; HH-RLHF dataset</div></span><span class="rl-badges"><span class="rl-g rl-gbp">B+</span></span></div>
  <div class="rl-p"><span class="rl-y">2023</span><span class="rl-n"><a href="https://arxiv.org/abs/2305.18290" target="_blank">DPO</a><div class="rl-tip">Know why it eventually fell short</div></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
</div>
</details>

<details class="rl-sub">
<summary>From Simplification to Online RL <span class="rl-sub-count">4</span></summary>
<div class="rl-papers">
  <div class="rl-p"><span class="rl-y">2024</span><span class="rl-n"><a href="https://arxiv.org/abs/2405.14734" target="_blank">SimPO</a><div class="rl-tip">DPO simplification endpoint</div></span><span class="rl-badges"><span class="rl-g rl-gb">B</span></span></div>
  <div class="rl-p"><span class="rl-y">2024</span><span class="rl-n"><a href="https://arxiv.org/abs/2405.07863" target="_blank">Online Iterative RLHF</a><div class="rl-tip">Turning point: online >> offline DPO</div></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://openreview.net/forum?id=FhTAG591Ve" target="_blank">Asynchronous RLHF (ICLR 2025)</a></span><span class="rl-badges"><span class="rl-g rl-gbp">B+</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2512.13961" target="_blank">OLMo 3 (SFT-DPO-RL-RL Zero)</a><div class="rl-tip">Fully transparent &amp; reproducible; all data/code/checkpoints</div></span><span class="rl-badges"><span class="rl-g rl-ga">A</span></span></div>
</div>
</details>

<details class="rl-sub">
<summary>Multi-turn / Social / Creative RLHF <span class="rl-sub-count">6</span></summary>
<div class="rl-papers">
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2511.02208" target="_blank">PPP: Proactive and Personalized Agents</a><div class="rl-tip">Three-objective RL; UserVille benchmark</div></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2509.25137" target="_blank">RL from User Conversations</a><div class="rl-tip">Persona-conditioned rewards</div></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2508.18642" target="_blank">RLMR: Mixed Rewards for Creative Writing</a><div class="rl-tip">Subjective aesthetic RM + objective constraint verifier</div></span><span class="rl-badges"><span class="rl-g rl-gb">B</span></span></div>
  <div class="rl-p"><span class="rl-y">2026</span><span class="rl-n"><a href="https://arxiv.org/abs/2601.21459" target="_blank">HER: RL for Role-playing</a><div class="rl-tip">Dual-layer thinking; relevant to InCharacter work</div></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
  <div class="rl-p"><span class="rl-y">2026</span><span class="rl-n"><a href="https://arxiv.org/abs/2602.03109" target="_blank">OMAR (One Model All Roles)</a></span><span class="rl-badges"><span class="rl-g rl-gb">B</span></span></div>
  <div class="rl-p"><span class="rl-y">2026</span><span class="rl-n"><a href="https://arxiv.org/abs/2603.09249" target="_blank">Social-R1</a></span><span class="rl-badges"><span class="rl-g rl-gb">B</span></span></div>
</div>
</details>

</details>

<!-- ═══════ 4. SYSTEMS ═══════ -->
<details class="rl-cat rl-cat--teal" open>
<summary>Systems <span class="rl-cat-count">17 papers</span></summary>

<details class="rl-sub">
<summary>Sync vs Async Training <span class="rl-sub-count">6</span></summary>
<div class="rl-papers">
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2506.10910" target="_blank">Magistral</a></span><span class="rl-badges"><span class="rl-g rl-ga">A</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2505.24298" target="_blank">AReaL</a></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://lmsys.org/blog/2025-07-09-slime/" target="_blank">slime (SGLang RL blog)</a></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2512.06547" target="_blank">A-3PO</a><div class="rl-tip">Decoupled PPO; independent dimension</div></span><span class="rl-badges"><span class="rl-g rl-gbp">B+</span></span></div>
  <div class="rl-p"><span class="rl-y">2026</span><span class="rl-n"><a href="https://arxiv.org/abs/2601.12784" target="_blank">StaleFlow</a></span><span class="rl-badges"><span class="rl-g rl-gbp">B+</span></span></div>
  <div class="rl-p"><span class="rl-y">2026</span><span class="rl-n"><a href="https://arxiv.org/abs/2603.01501" target="_blank">GAC</a></span><span class="rl-badges"><span class="rl-g rl-gbp">B+</span></span></div>
</div>
</details>

<details class="rl-sub">
<summary>Train-Inference Mismatch &amp; RL Stability <span class="rl-sub-count">9</span></summary>
<div class="rl-papers">
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2512.01374" target="_blank">Stabilizing RL with LLMs</a><div class="rl-tip">IS correction + Routing Replay theory</div></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2506.09501" target="_blank">Give Me FP32 or Give Me Death</a></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/" target="_blank">Defeating Nondeterminism (Thinking Machines)</a></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://lmsys.org/blog/2025-09-22-sglang-deterministic/" target="_blank">SGLang Deterministic Inference</a></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2510.11370" target="_blank">R3: MoE Routing Replay</a><div class="rl-tip">Theoretically explained by Stabilizing RL paper</div></span><span class="rl-badges"><span class="rl-g rl-gbp">B+</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2510.26788" target="_blank">FP16 Mismatch</a><div class="rl-tip">Counter-intuitive: train/infer path consistency matters more</div></span><span class="rl-badges"><span class="rl-g rl-gbp">B+</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2511.17826" target="_blank">TP Sizes Determinism</a></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://lmsys.org/blog/2025-11-25-fp8-rl/" target="_blank">Unified FP8 (SGLang blog)</a></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2512.02556" target="_blank">DeepSeek-V3.2</a></span><span class="rl-badges"><span class="rl-g rl-gb">B</span></span></div>
</div>
</details>

<details class="rl-sub">
<summary>Long-Context RL <span class="rl-sub-count">2</span></summary>
<div class="rl-papers">
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2510.11967" target="_blank">Context-Folding</a><div class="rl-tip">Long-horizon agent context compression/folding management</div></span><span class="rl-badges"><span class="rl-g rl-gbp">B+</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2512.12967" target="_blank">QwenLong-L1.5</a></span><span class="rl-badges"><span class="rl-g rl-gb">B</span></span></div>
</div>
</details>

</details>

<!-- ═══════ 5. TASKS & AGENTS ═══════ -->
<details class="rl-cat rl-cat--amber" open>
<summary>Tasks &amp; Agents <span class="rl-cat-count">21 papers</span></summary>

<details class="rl-sub">
<summary>Coding Agents <span class="rl-sub-count">2</span></summary>
<div class="rl-papers">
  <div class="rl-p"><span class="rl-y">2026</span><span class="rl-n"><a href="https://arxiv.org/abs/2603.00729" target="_blank">Qwen3-Coder-Next</a></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
  <div class="rl-p"><span class="rl-y">2026</span><span class="rl-n"><a href="https://arxiv.org/abs/2602.03411" target="_blank">SWE-Master</a></span><span class="rl-badges"><span class="rl-g rl-gbp">B+</span></span></div>
</div>
</details>

<details class="rl-sub">
<summary>Deep Research &amp; Search Agents <span class="rl-sub-count">4</span></summary>
<div class="rl-papers">
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2503.09516" target="_blank">Search-R1</a><div class="rl-tip">790 citations</div></span><span class="rl-badges"><span class="rl-g rl-ga">A</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2510.24701" target="_blank">Tongyi DeepResearch</a></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
  <div class="rl-p"><span class="rl-y">2026</span><span class="rl-n"><a href="https://arxiv.org/abs/2601.19578" target="_blank">Yunque DeepResearch</a></span><span class="rl-badges"><span class="rl-g rl-gbp">B+</span></span></div>
  <div class="rl-p"><span class="rl-y">2026</span><span class="rl-n"><a href="https://arxiv.org/abs/2602.19526" target="_blank">How to Train DR Agent</a></span><span class="rl-badges"><span class="rl-g rl-gbp">B+</span></span></div>
</div>
</details>

<details class="rl-sub">
<summary>Computer-Use &amp; GUI Agents <span class="rl-sub-count">3</span></summary>
<div class="rl-papers">
  <div class="rl-p"><span class="rl-y">2026</span><span class="rl-n"><a href="https://arxiv.org/abs/2601.20380" target="_blank">OmegaUse</a></span><span class="rl-badges"><span class="rl-g rl-gbp">B+</span></span></div>
  <div class="rl-p"><span class="rl-y">2026</span><span class="rl-n"><a href="https://arxiv.org/abs/2602.22190" target="_blank">GUI-Libra</a></span><span class="rl-badges"><span class="rl-g rl-gbp">B+</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2508.14040" target="_blank">ComputerRL</a></span><span class="rl-badges"><span class="rl-g rl-gbp">B+</span></span></div>
</div>
</details>

<details class="rl-sub">
<summary>Tool-Use RL <span class="rl-sub-count">3</span></summary>
<div class="rl-papers">
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2504.13958" target="_blank">ToolRL</a><div class="rl-tip">191 citations; reward design textbook</div></span><span class="rl-badges"><span class="rl-g rl-ga">A</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2505.00949" target="_blank">Nemotron-Tool-N1</a><div class="rl-tip">NVIDIA; binary reward; must-discuss in interviews</div></span><span class="rl-badges"><span class="rl-g rl-ga">A</span><span class="rl-nv">NV</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2504.09442" target="_blank">ReTool</a><div class="rl-tip">231 citations; cold-start synthesis + outcome RL</div></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
</div>
</details>

<details class="rl-sub">
<summary>Generalist Agent RL Frameworks <span class="rl-sub-count">5</span></summary>
<div class="rl-papers">
  <div class="rl-p"><span class="rl-y">2026</span><span class="rl-n"><a href="https://arxiv.org/abs/2602.02488" target="_blank">RLAnything</a><div class="rl-tip">Joint optimization of env + policy + RM</div></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2510.04206" target="_blank">AGENTRL</a><div class="rl-tip">Async + cross-policy sampling</div></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://moonshotai.github.io/Kimi-K2.5/" target="_blank">Kimi K2.5 Agent Swarm</a><div class="rl-tip">Multi-agent orchestration for agentic tasks</div></span><span class="rl-badges"><span class="rl-g rl-gbp">B+</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2505.16421" target="_blank">WebAgent-R1</a><div class="rl-tip">73 citations; end-to-end web-agent RL</div></span><span class="rl-badges"><span class="rl-g rl-gbp">B+</span></span></div>
  <div class="rl-p"><span class="rl-y">2026</span><span class="rl-n"><a href="https://arxiv.org/abs/2603.18815" target="_blank">ProRL Agent</a><div class="rl-tip">NVIDIA; rollout-as-a-service for multi-turn agent RL</div></span><span class="rl-badges"><span class="rl-g rl-gbp">B+</span><span class="rl-nv">NV</span></span></div>
</div>
</details>

<details class="rl-sub">
<summary>Agent RL Challenges &amp; Self-Evolution <span class="rl-sub-count">4</span></summary>
<div class="rl-papers">
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2504.20073" target="_blank">RAGEN</a><div class="rl-tip">148 citations; Echo Trap &mdash; why agent RL fails</div></span><span class="rl-badges"><span class="rl-g rl-ga">A</span></span></div>
  <div class="rl-p"><span class="rl-y">2026</span><span class="rl-n"><a href="https://arxiv.org/abs/2601.15839" target="_blank">iStar</a><div class="rl-tip">ICLR 2026; implicit PRM for credit assignment</div></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2502.10325" target="_blank">AgentPRM</a><div class="rl-tip">MC rollout actor-critic; 29 citations</div></span><span class="rl-badges"><span class="rl-g rl-gbp">B+</span></span></div>
  <div class="rl-p"><span class="rl-y">2025</span><span class="rl-n"><a href="https://arxiv.org/abs/2505.03335" target="_blank">Absolute Zero</a><div class="rl-tip">161 citations; zero-data self-play</div></span><span class="rl-badges"><span class="rl-g rl-gam">A&minus;</span></span></div>
</div>
</details>

</details>

<div class="divider"></div>

<p style="font-size: 0.82rem; color: var(--ep-muted); text-align: center;">Click subsection headers to expand. 5 categories &middot; 24 sub-topics &middot; Updated March 2026.</p>

</article>

<button class="rl-top" id="rl-top" aria-label="Back to top">&uarr;</button>

<div class="post-end">
  <p class="post-author">Lorenzo Xiao &middot; Language Technologies Institute &middot; Carnegie Mellon University</p>
  <a href="/blog/">&larr; Back to blog</a>
</div>

<script>
(function(){
  var btn = document.getElementById('rl-top');
  if (!btn) return;
  window.addEventListener('scroll', function() {
    btn.classList.toggle('visible', window.scrollY > 600);
  }, {passive: true});
  btn.addEventListener('click', function() {
    window.scrollTo({top: 0, behavior: 'smooth'});
  });
})();
</script>]]></content><author><name>Lorenzo Xiao</name></author><category term="blog" /><category term="academic" /><category term="AI" /><category term="reinforcement-learning" /><summary type="html"><![CDATA[96 papers across algorithms, rewards, preferences, systems, and agents — organized into 5 categories with reading depth recommendations.]]></summary></entry><entry><title type="html">How I Learned RL for LLMs: A Researcher’s Detour in Five Parts</title><link href="https://algoroxyolo.github.io/blog/2026/rl-for-llms-part0/" rel="alternate" type="text/html" title="How I Learned RL for LLMs: A Researcher’s Detour in Five Parts" /><published>2026-03-22T00:00:00+00:00</published><updated>2026-03-22T00:00:00+00:00</updated><id>https://algoroxyolo.github.io/blog/2026/rl-for-llms-part0</id><content type="html" xml:base="https://algoroxyolo.github.io/blog/2026/rl-for-llms-part0/"><![CDATA[<header class="hero hero--editorial">
  <div class="hero-badge">Series &middot; Part 0 of 5</div>
  <h1>How I Learned <em>RL for LLMs</em></h1>
  <p class="hero-sub">An evaluation researcher's honest map through the algorithms, rewards, and systems of reinforcement learning for language models.</p>
  <p class="hero-meta">Lorenzo Xiao &middot; Language Technologies Institute, CMU &middot; March 2026</p>
</header>

<article>

<p class="lead">I will be honest about how this series started, because the origin story matters for understanding what it is and what it is not.</p>

<h2>How This Started</h2>

<p>I am an NLP researcher. My work has been on evaluation, specifically on non-verifiable tasks: persona consistency, cultural appropriateness, anthropomorphism&mdash;the kinds of things where "correctness" is not a number you can look up in the back of a textbook. I have spent the past few years building benchmarks, designing evaluation protocols, and thinking carefully about what it means to measure something that is inherently subjective. I thought this was my path. I intended to keep going.</p>

<p>Then I did not get into any of the PhD programs I applied to.</p>

<p>That was the moment I had to sit with an uncomfortable realization: most of the skills I had acquired in creating benchmarks do not transfer cleanly to industry. I knew how to design evaluation frameworks. I could tell you whether a benchmark was measuring what it claimed to measure. But I could not tell you how a post-training pipeline actually works end to end, how reward signals flow into policy updates, or why a training run that looks fine on paper collapses in practice. The evaluation side of the house was where I lived. The training side was a foreign country.</p>

<p>So I made a decision. If I was going to be competitive for post-training teams in industry, I needed to learn reinforcement learning. Not the textbook version. The version that people are actually using right now, in 2025 and 2026, to make language models reason, use tools, write code, and hold conversations.</p>

<p>This blog series is the documentation of that learning process.</p>

<div class="divider"></div>

<h2>What This Series Is</h2>

<p>I want to be clear about two things.</p>

<p>First, <strong>I am writing this to make sure I can clearly explain what I have learned.</strong> The act of writing forces a kind of precision that reading alone does not. If I cannot write a coherent paragraph about why DAPO's asymmetric clipping matters, I probably do not actually understand it yet. Some of what I write will contain mistakes. I am learning in public, and I sincerely welcome corrections. If you spot an error, please let me know. You will be doing me a genuine favor.</p>

<p>Second, <strong>I am writing this for people from a similar background.</strong> If you are an NLP researcher who has spent more time on evaluation than on training, if you know what a good benchmark looks like but have never debugged a reward hacking failure, if you are pivoting toward post-training work and feeling the gap between what you know and what you need to know: this series is for you. I am not writing from a position of expertise. I am writing from a position of "I figured this out six months ago and here is how I organized it in my head."</p>

<p>This series would not have existed without the help of <a href="https://x.com/sun_hanchi" target="_blank">Hanchi Sun</a>, who patiently walked me through the parts I could not figure out from papers alone.</p>

<div class="divider"></div>

<h2>The Five Parts</h2>

<p>Each part ends with a question that the next part answers. This is deliberate. The series is meant to build, not to be a disconnected collection of literature reviews.</p>

<div class="roadmap">

  <div class="roadmap-card rc-1">
    <div class="roadmap-num">01</div>
    <div class="roadmap-body">
      <h3>The Algorithm Zoo</h3>
      <p>REINFORCE, PPO, GRPO, and the lineage of papers that fix what GRPO got wrong: Dr. GRPO, DAPO, CISPO, MaxRL, DPPO. The organizing question for the entire section: <em>how does each subsequent paper fix GRPO?</em> GRPO removed the critic, introduced group-relative baselines, and made large-scale reasoning RL tractable. But it also introduced subtle biases&mdash;length normalization, standard deviation weighting, symmetric clipping, token-level loss aggregation&mdash;that turned out to be consequential. Every paper that came after is asking the same question: what did GRPO get wrong, and how do we fix it without breaking what it got right?</p>
      <div class="roadmap-bridge">&darr; "These algorithms all optimize a reward signal. But where does that signal come from?"</div>
      <span class="roadmap-tag">Coming in Part 1</span>
    </div>
  </div>

  <div class="roadmap-card rc-2">
    <div class="roadmap-num">02</div>
    <div class="roadmap-body">
      <h3>The Reward Problem</h3>
      <p>The current RL-for-LLM literature is dominated by tasks where reward is cheap and unambiguous: math problems with checkable answers, code with executable test suites. The moment you step outside those domains&mdash;into summarization, creative writing, open-ended dialogue, cultural sensitivity&mdash;the reward problem becomes the <em>hard</em> problem. This part covers generative reward models that reason before scoring (DeepSeek-GRM, RM-R1), process reward models that evaluate intermediate steps (PAV, ThinkPRM), rubrics-as-rewards for structuring subjective preferences, and the evaluation benchmarks (RewardBench and its successors) that make RM development iterative. This is the section closest to my original research, and honestly the one I found most intellectually exciting.</p>
      <div class="roadmap-bridge">&darr; "Reward models approximate human judgment. But what if we could learn directly from preferences without an explicit reward model?"</div>
      <span class="roadmap-tag">Coming in Part 2</span>
    </div>
  </div>

  <div class="roadmap-card rc-3">
    <div class="roadmap-num">03</div>
    <div class="roadmap-body">
      <h3>From Preferences to Alignment</h3>
      <p>DPO simplifies RLHF by removing the reward model. It works well for single-turn preference alignment, and then it starts to plateau in settings that require multi-turn coherence, long-term persona consistency, or reward signals that are noisy and delayed. What I explore here: how do we bring modern RL techniques into the tasks I actually care about? Multi-turn dialogue where the reward is not "did you get the math right" but "did you maintain character across twenty turns." Creative writing where the failure mode is not incorrectness but blandness. PPP for proactive personalized agents, HER for dual-layer role-playing thinking, OMAR for multi-role self-play&mdash;papers that point toward where the field is heading once the math-and-code gold rush settles.</p>
      <div class="roadmap-bridge">&darr; "We know what to optimize and how. But can any of this actually run at scale?"</div>
      <span class="roadmap-tag">Coming in Part 3</span>
    </div>
  </div>

  <div class="roadmap-card rc-4">
    <div class="roadmap-num">04</div>
    <div class="roadmap-body">
      <h3>Making It Work: Systems</h3>
      <p>The section I was most tempted to skip and most glad I did not. A beautiful algorithm that requires synchronous rollout-then-update will lose to a mediocre algorithm running on async infrastructure that keeps GPUs utilized, if the wall-clock time difference is large enough. But this section is about more than async versus sync. Why does MoE routing behave differently during rollout than during training? Why can FP16 rounding in your inference kernel silently corrupt importance-sampling ratios? Why does deterministic inference across different tensor-parallel sizes require explicit engineering? These are the questions that separate "I read the GRPO paper" from "I could actually help debug a training run."</p>
      <div class="roadmap-bridge">&darr; "With the full stack understood, where is RL for LLMs actually heading?"</div>
      <span class="roadmap-tag">Coming in Part 4</span>
    </div>
  </div>

  <div class="roadmap-card rc-5">
    <div class="roadmap-num">05</div>
    <div class="roadmap-body">
      <h3>The Agent Frontier</h3>
      <p>RL is no longer just for math and code. This part covers coding agents (Qwen3-Coder-Next, SWE-Master), deep research agents (Search-R1, Tongyi DeepResearch), computer-use agents (ComputerRL, GUI-Libra), and the emerging generalist agentic RL systems that attempt to unify all of these under one training framework (RLAnything, AGENTRL, iStar). A dedicated section covers NVIDIA's Nemotron-Tool-N1, which takes the minimalist approach of binary reward for tool-calling correctness and shows it works surprisingly well. Sometimes the answer to "how do we design reward for agentic tasks?" is "just check if the tool call was correct and let RL figure out the rest."</p>
      <span class="roadmap-tag">Coming in Part 5</span>
    </div>
  </div>

</div>

<div class="divider"></div>

<h2>How to Read This Series</h2>

<p>Each post is self-contained. You can read Part 2 (Rewards) without reading Part 1 (Algorithms), though I cross-reference when the same concept appears in multiple places.</p>

<p>Within each post, every paper gets a reading-depth recommendation:</p>

<div class="depth-legend">
  <span class="depth-label da">A</span>
  <span class="depth-desc"><strong>Deep read.</strong> Explain the method, the training pipeline, the failure modes, and why it was necessary.</span>
  <span class="depth-label dam">A&minus;</span>
  <span class="depth-desc"><strong>Focused read.</strong> Abstract, method figure, key tables, and ablation. Know the core contribution well enough to discuss it.</span>
  <span class="depth-label db">B</span>
  <span class="depth-desc"><strong>Skim.</strong> Know the problem setting, the main conclusion, and how it connects to the mainline.</span>
  <span class="depth-label dc">C</span>
  <span class="depth-desc"><strong>Awareness only.</strong> Know it exists and what niche it fills.</span>
</div>

<p>There are <strong>96 papers</strong> across the five parts, <strong>17 rated A and 39 rated A&minus;</strong>.</p>

<div class="callout tip">
  <p class="callout-title">If you only have three hours</p>
  <p>Read seven papers that cover the skeleton of the entire story: <strong>DeepSeekMath</strong> (Part 1), <strong>DeepSeek-R1</strong> (Part 1), <strong>DAPO</strong> (Part 1), <strong>DeepSeek-GRM</strong> (Part 2), <strong>OLMo 3</strong> (Part 3), <strong>Magistral</strong> (Part 4), and <strong>Search-R1</strong> (Part 5).</p>
</div>

<div class="divider"></div>

<h2>One Last Thing</h2>

<p>I want to acknowledge something that I think more people should say out loud: getting rejected from every program you applied to is a specific kind of painful. It makes you question whether the work you have done matters, whether the skills you have built are real, whether the direction you chose was the right one.</p>

<p>I do not have a neat resolution to that story yet. What I have is this: the process of learning RL for LLMs, of building this map from scratch, of forcing myself to understand systems and algorithms and reward design that were outside my comfort zone, has been the most intellectually alive I have felt in a long time.</p>

<p>And somewhere along the way, I realized something that changed how I think about my own research: <strong>evaluation and training are not separate worlds.</strong> A good reward model <em>is</em> an evaluation. A good benchmark <em>is</em> a reward signal waiting to be operationalized. The skills transfer. They just transfer in directions I did not expect.</p>

<p>Part 1 goes up next. We start with REINFORCE, because everything else is a footnote to REINFORCE, and end with MaxRL and DPPO, the two 2026 papers that suggest the algorithmic story is far from over.</p>

<div class="series-cta">
  <h3>This is Part 0 of a 5-part series.</h3>
  <p>Each subsequent post takes one layer of the RL-for-LLMs stack and goes deep: algorithms, rewards, preferences, systems, and agents. The parts build on each other, but each can be read independently.</p>
</div>

</article>

<div class="post-end">
  <p class="post-author">Lorenzo Xiao &middot; Language Technologies Institute &middot; Carnegie Mellon University</p>
  <a href="/blog/">&larr; Back to blog</a>
</div>]]></content><author><name>Lorenzo Xiao</name></author><category term="blog" /><category term="academic" /><category term="AI" /><category term="reinforcement-learning" /><summary type="html"><![CDATA[An NLP evaluation researcher's honest map through the algorithmic, reward, and systems landscape of reinforcement learning for language models.]]></summary></entry><entry><title type="html">A Systems Engineering Approach to LLM Agents — Series Overview</title><link href="https://algoroxyolo.github.io/blog/2026/llm-agents-blueprint/" rel="alternate" type="text/html" title="A Systems Engineering Approach to LLM Agents — Series Overview" /><published>2026-03-14T00:00:00+00:00</published><updated>2026-03-14T00:00:00+00:00</updated><id>https://algoroxyolo.github.io/blog/2026/llm-agents-blueprint</id><content type="html" xml:base="https://algoroxyolo.github.io/blog/2026/llm-agents-blueprint/"><![CDATA[<header class="hero">
  <div class="hero-badge">Series · Part 0 of 7</div>
  <h1>A Systems Engineering Approach to <em>LLM Agents</em></h1>
  <p class="hero-sub">Everyone's building agents. Almost nobody is engineering them. This series is about closing that gap.</p>
  <p class="hero-meta">Lorenzo Xiao &middot; Language Technologies Institute, CMU &middot; March 2026</p>
</header>

<article>

<p class="lead">There is a widening disconnect in the LLM agent space. On one end, researchers publish increasingly sophisticated architectures: multi-agent debate, self-reflective planning, tool-augmented reasoning chains. On the other, practitioners are shipping agents held together by a system prompt and a prayer. The missing piece is not more research. It is engineering discipline.</p>

<p>I spent the past months compiling a comprehensive design framework for LLM-based agentic systems, drawing on both the research literature (ReAct, Reflexion, Constitutional AI, SWE-bench, and others) and practical lessons from building and evaluating these systems. The result is a seven-step lifecycle that covers the full arc from problem framing to production monitoring.</p>

<p>This post is the overview. It lays out the structure of the framework and explains <em>why</em> each step exists and what makes it different from its traditional ML counterpart. It does not attempt to be comprehensive. Instead, each of the seven steps will get its own dedicated deep-dive in subsequent posts, with concrete examples, formal definitions, and actionable checklists.</p>

<h2>Why Agents Need Their Own Playbook</h2>

<p>Traditional ML has a well-established lifecycle: define the problem, collect data, train a model, evaluate, deploy, monitor. LLM agents break this in ways that are subtle but consequential.</p>

<p>First, <strong>you usually don't train the model</strong>. You select a foundation model, possibly fine-tune it, and then build around it. The "development" work shifts from gradient updates to architecture design, prompt engineering, tool integration, and memory construction. This is no less rigorous than training; it's just different, and it lacks the same institutional knowledge.</p>

<p>Second, <strong>problem framing and architecture design collapse into one decision</strong>. Choosing between a single LLM call, an agentic loop, and a multi-agent system is simultaneously a product decision and an engineering decision. Get it wrong and you'll either under-build (a single call that can't handle the task) or over-build (a multi-agent system where a single call would have sufficed, but with 5x the latency and cost).</p>

<p>Third, <strong>evaluation is fundamentally harder</strong>. Agent trajectories are multi-step, path-dependent, and stochastic. Many valid paths exist for the same task. Emergent behavior appears in multi-agent systems. Key properties (safety, reliability, consistency) are latent and hard to measure. A single metric will mislead you, and offline metrics that don't correlate with online outcomes mean all your optimization was wasted.</p>

<p>Fourth, <strong>failure modes are novel</strong>. Infinite loops, hallucinated tool calls, context exhaustion, cascading errors across agents, reward hacking, goal drift. These don't map cleanly onto traditional ML failure categories, and they require their own monitoring and alerting infrastructure.</p>

<p>The framework I'm proposing doesn't reinvent everything. It builds on the traditional ML lifecycle but adapts each step for the specific challenges that agentic systems introduce, and it adds steps (like architecture design) that are entirely new.</p>

<!-- ════════ LIFECYCLE DIAGRAM ════════ -->
<div class="diagram-wrap">
<svg viewBox="0 0 800 190" xmlns="http://www.w3.org/2000/svg" fill="none">
  <defs>
    <filter id="ds" x="-4%" y="-4%" width="108%" height="116%">
      <feDropShadow dx="0" dy="2" stdDeviation="4" flood-opacity="0.08"/>
    </filter>
    <linearGradient id="g1" x1="0" y1="0" x2="1" y2="0"><stop offset="0%" stop-color="#c4553a"/><stop offset="100%" stop-color="#e8725a"/></linearGradient>
    <linearGradient id="g2" x1="0" y1="0" x2="1" y2="0"><stop offset="0%" stop-color="#2a8f82"/><stop offset="100%" stop-color="#3db5a5"/></linearGradient>
    <linearGradient id="g3" x1="0" y1="0" x2="1" y2="0"><stop offset="0%" stop-color="#3b6fa0"/><stop offset="100%" stop-color="#5a94c4"/></linearGradient>
    <linearGradient id="g4" x1="0" y1="0" x2="1" y2="0"><stop offset="0%" stop-color="#d4843e"/><stop offset="100%" stop-color="#e8a55a"/></linearGradient>
    <linearGradient id="g5" x1="0" y1="0" x2="1" y2="0"><stop offset="0%" stop-color="#6b4c8a"/><stop offset="100%" stop-color="#8b6caa"/></linearGradient>
  </defs>
  <g filter="url(#ds)">
    <rect x="12" y="32" width="96" height="64" rx="10" fill="url(#g1)"/>
    <text x="60" y="58" text-anchor="middle" fill="#fff" font-family="Source Sans 3" font-weight="600" font-size="11">Requirements</text>
    <text x="60" y="74" text-anchor="middle" fill="rgba(255,255,255,0.75)" font-family="Source Sans 3" font-size="10">&amp; Framing</text>
  </g>
  <g filter="url(#ds)">
    <rect x="126" y="32" width="96" height="64" rx="10" fill="url(#g2)"/>
    <text x="174" y="58" text-anchor="middle" fill="#fff" font-family="Source Sans 3" font-weight="600" font-size="11">Agent</text>
    <text x="174" y="74" text-anchor="middle" fill="rgba(255,255,255,0.75)" font-family="Source Sans 3" font-size="10">Architecture</text>
  </g>
  <g filter="url(#ds)">
    <rect x="240" y="32" width="96" height="64" rx="10" fill="url(#g3)"/>
    <text x="288" y="58" text-anchor="middle" fill="#fff" font-family="Source Sans 3" font-weight="600" font-size="11">Knowledge</text>
    <text x="288" y="74" text-anchor="middle" fill="rgba(255,255,255,0.75)" font-family="Source Sans 3" font-size="10">&amp; Tools</text>
  </g>
  <g filter="url(#ds)">
    <rect x="354" y="32" width="96" height="64" rx="10" fill="url(#g4)"/>
    <text x="402" y="58" text-anchor="middle" fill="#fff" font-family="Source Sans 3" font-weight="600" font-size="11">Agent</text>
    <text x="402" y="74" text-anchor="middle" fill="rgba(255,255,255,0.75)" font-family="Source Sans 3" font-size="10">Development</text>
  </g>
  <g filter="url(#ds)">
    <rect x="468" y="32" width="96" height="64" rx="10" fill="url(#g5)"/>
    <text x="516" y="62" text-anchor="middle" fill="#fff" font-family="Source Sans 3" font-weight="600" font-size="11">Evaluation</text>
  </g>
  <g filter="url(#ds)">
    <rect x="582" y="32" width="96" height="64" rx="10" fill="url(#g1)"/>
    <text x="630" y="58" text-anchor="middle" fill="#fff" font-family="Source Sans 3" font-weight="600" font-size="11">Safety &amp;</text>
    <text x="630" y="74" text-anchor="middle" fill="rgba(255,255,255,0.75)" font-family="Source Sans 3" font-size="10">Deployment</text>
  </g>
  <g filter="url(#ds)">
    <rect x="696" y="32" width="96" height="64" rx="10" fill="url(#g2)"/>
    <text x="744" y="58" text-anchor="middle" fill="#fff" font-family="Source Sans 3" font-weight="600" font-size="11">Observe &amp;</text>
    <text x="744" y="74" text-anchor="middle" fill="rgba(255,255,255,0.75)" font-family="Source Sans 3" font-size="10">Monitor</text>
  </g>
  <text x="60" y="24" text-anchor="middle" fill="#c4553a" font-family="JetBrains Mono" font-size="11" font-weight="500">01</text>
  <text x="174" y="24" text-anchor="middle" fill="#2a8f82" font-family="JetBrains Mono" font-size="11" font-weight="500">02</text>
  <text x="288" y="24" text-anchor="middle" fill="#3b6fa0" font-family="JetBrains Mono" font-size="11" font-weight="500">03</text>
  <text x="402" y="24" text-anchor="middle" fill="#d4843e" font-family="JetBrains Mono" font-size="11" font-weight="500">04</text>
  <text x="516" y="24" text-anchor="middle" fill="#6b4c8a" font-family="JetBrains Mono" font-size="11" font-weight="500">05</text>
  <text x="630" y="24" text-anchor="middle" fill="#c4553a" font-family="JetBrains Mono" font-size="11" font-weight="500">06</text>
  <text x="744" y="24" text-anchor="middle" fill="#2a8f82" font-family="JetBrains Mono" font-size="11" font-weight="500">07</text>
  <path d="M110 64 L124 64" stroke="#bbb" stroke-width="1.5"/>
  <path d="M224 64 L238 64" stroke="#bbb" stroke-width="1.5"/>
  <path d="M338 64 L352 64" stroke="#bbb" stroke-width="1.5"/>
  <path d="M452 64 L466 64" stroke="#bbb" stroke-width="1.5"/>
  <path d="M566 64 L580 64" stroke="#bbb" stroke-width="1.5"/>
  <path d="M680 64 L694 64" stroke="#bbb" stroke-width="1.5"/>
  <path d="M744 98 L744 145 Q744 155 734 155 L70 155 Q60 155 60 145 L60 98" stroke="#c4553a" stroke-width="1.5" stroke-dasharray="6 4" fill="none" opacity="0.5"/>
  <text x="400" y="170" text-anchor="middle" fill="#c4553a" font-family="Source Sans 3" font-size="11" font-style="italic" opacity="0.7">Continuous feedback loop</text>
</svg>
</div>
<p class="diagram-caption">The seven-step lifecycle. Each feeds forward; monitoring feeds back to every earlier stage.</p>

<div class="divider"></div>

<h2>The Seven Steps: A Roadmap</h2>

<p>Below is the structure of the series. For each step, I'll outline what it covers, why it matters, and what makes it different from its traditional ML analog. The deep-dive posts will follow, each with formal definitions, worked examples, and concrete checklists you can use in your own projects.</p>

<div class="roadmap">

  <div class="roadmap-card rc-1">
    <div class="roadmap-num">01</div>
    <div class="roadmap-body">
      <h3>Requirements and Problem Framing</h3>
      <p>The first design decision is not "how smart should it be?" but "where do uncertainty, agency, and accountability sit?" This step introduces four analytical lenses — error surface analysis (decision authority, error cost asymmetry, recoverability), I/O gap diagnosis (how many information-acquisition steps bridge input to output), constraint analysis (the four walls of latency, cost, privacy, and governance), and Pareto tradeoff mapping (quality vs. speed, autonomy vs. control, generality vs. reliability). Grounded throughout with Rufus, Uber's support bot, and code agents as worked examples.</p>
      <div class="roadmap-questions">Deep-dive covers: error surface framing · I/O gap analysis · constraint feasibility wedges · Pareto tradeoff mapping · six forcing questions · adaptive routing as design pattern</div>
      <span class="roadmap-tag">Published — Part 1</span>
    </div>
  </div>

  <div class="roadmap-card rc-2">
    <div class="roadmap-num">02</div>
    <div class="roadmap-body">
      <h3>Agent Architecture Design</h3>
      <p>This step has no analog in traditional ML. For agents, architecture design (topology, reasoning patterns, tool schemas, inter-agent protocols) is as consequential as model selection. I'll walk through the four canonical topologies (pipeline, hierarchical, debate, collaborative), the four core reasoning patterns (ReAct, Plan-then-Execute, Reflexion, Re-planning), and a critical dimension that most guides ignore: human-centered design. Research on human-AI collaboration shows that users build mental models of what agents can and cannot do, and misaligned mental models lead to over-reliance or under-trust. Architecture choices shape those mental models, whether you design for it or not.</p>
      <div class="roadmap-questions">Deep-dive will cover: architecture pattern selection · reasoning pattern matching · structured output and function calling · the human-AI collaboration spectrum · transparency and override design</div>
      <span class="roadmap-tag">Coming in Part 2</span>
    </div>
  </div>

  <div class="roadmap-card rc-3">
    <div class="roadmap-num">03</div>
    <div class="roadmap-body">
      <h3>Knowledge, Tools, and Data Infrastructure</h3>
      <p>In traditional ML, data prep means building training datasets. For LLM agents, the real work is building knowledge infrastructure: RAG pipelines, vector stores, tool integration contracts, memory systems, prompt management. This is where most of the unglamorous engineering happens, and where most agent failures actually originate. A tool with an ambiguous description or an incomplete error contract will cause downstream failures that look like model problems but are actually infrastructure problems. I'll cover storage and retrieval (vector DB, key-value, graph), tool schema design with the principle of least privilege, prompt engineering specifically for agents (instruction hierarchy, few-shot tool examples, constitutional rules), and data governance.</p>
      <div class="roadmap-questions">Deep-dive will cover: RAG architecture decisions · tool schema contracts · prompt versioning and management · instruction hierarchy for injection defense · data audit checklist</div>
      <span class="roadmap-tag">Coming in Part 3</span>
    </div>
  </div>

  <div class="roadmap-card rc-4">
    <div class="roadmap-num">04</div>
    <div class="roadmap-body">
      <h3>Agent Development</h3>
      <p>The model adaptation spectrum ranges from in-context learning (lowest barrier, context-limited) through supervised fine-tuning and RL methods (PPO, DPO, GRPO, DAPO) to full distillation. But the real development work is in three areas that don't exist in traditional ML: memory architecture (context window management, external memory, episodic reflection, shared MAS memory), anthropomorphic design as a controllable lever (four dimensions of cues that can be intentionally tuned to support user goals rather than treated as an incidental risk), and multi-agent orchestration (topology, communication protocol, conflict resolution, error propagation).</p>
      <div class="roadmap-questions">Deep-dive will cover: model selection and adaptation · memory architecture patterns · anthropomorphism as a design dimension · MAS orchestration and conflict resolution</div>
      <span class="roadmap-tag">Coming in Part 4</span>
    </div>
  </div>

  <div class="roadmap-card rc-5">
    <div class="roadmap-num">05</div>
    <div class="roadmap-body">
      <h3>Evaluation</h3>
      <p>This is perhaps where the gap between traditional ML and agentic systems is widest. Agent evaluation is harder along every axis: multi-step trajectories, many valid paths, path-dependent stochastic behavior, emergent MAS dynamics, latent safety properties. I'll cover the full metrics spectrum from lexical (BLEU, ROUGE) through embedding (BERTScore) to learned and LLM-as-Judge approaches, with specific attention to RAG error decomposition (context utilization vs. hallucination vs. noise sensitivity). For online evaluation, I'll discuss why you must never rely on a single metric. And I'll spend significant space on population-level fairness: why an increase in the arithmetic mean can mask quality-of-service harms for underrepresented subgroups, and how disaggregated evaluation and worst-case (maximin) metrics address this.</p>
      <div class="roadmap-questions">Deep-dive will cover: offline metric formalization · LLM-as-Judge calibration · RAG error decomposition · agent benchmark landscape · online evaluation design · population-level fairness metrics</div>
      <span class="roadmap-tag">Coming in Part 5</span>
    </div>
  </div>

  <div class="roadmap-card rc-6">
    <div class="roadmap-num">06</div>
    <div class="roadmap-body">
      <h3>Safety, Guardrails, and Deployment</h3>
      <p>For agents that act autonomously, guardrail design is as important as the agent itself, and must happen before deployment. The threat model follows the OWASP Top 10 for LLMs: prompt injection (direct and indirect), data leakage, excessive agency, system prompt leakage. Guardrails operate at three layers (input, system prompt, output), and for multi-agent systems must also cover inter-agent messages. I'll also cover cost engineering in depth, because MAS cost (often 3 to 5 times the user-facing token cost) is a deployment-blocking concern that most guides hand-wave away. Smart routing, semantic caching, prompt compression, and budget enforcement aren't optimizations; they're requirements.</p>
      <div class="roadmap-questions">Deep-dive will cover: threat modeling · three-layer guardrail architecture · MAS inter-agent guardrails · cost engineering and smart routing · deployment strategies (shadow, canary, A/B) · red-teaming with ToolEmu</div>
      <span class="roadmap-tag">Coming in Part 6</span>
    </div>
  </div>

  <div class="roadmap-card rc-7">
    <div class="roadmap-num">07</div>
    <div class="roadmap-body">
      <h3>Observability and Monitoring</h3>
      <p>For agentic systems, observability is a distinct discipline that goes beyond traditional APM. You need distributed tracing where every LLM call, tool call, and inter-agent message is a span. You need prompt versioning that correlates prompt changes with metric shifts. You need trajectory replay: the ability to store and replay failed agent trajectories, which is the LLM equivalent of a stack trace. And you need a failure mode taxonomy that covers the novel ways agents break: infinite loops, hallucinated tool calls, context exhaustion, cascading MAS errors, reward hacking, goal drift, distribution shift, and silent tool API changes. Each of these needs its own alerting threshold and severity classification.</p>
      <div class="roadmap-questions">Deep-dive will cover: failure mode taxonomy · distributed tracing for agents · prompt versioning infrastructure · trajectory replay systems · severity tiering and on-call design · cost dashboards</div>
      <span class="roadmap-tag">Coming in Part 7</span>
    </div>
  </div>

</div>

<div class="divider"></div>

<h2>What's Next</h2>

<p>Part 1 goes deep on <strong>Requirements and Problem Framing</strong>: how to choose where uncertainty, agency, and accountability live — error surface analysis, I/O gap diagnosis, constraint feasibility, and Pareto tradeoff mapping.</p>

<div class="series-cta">
  <h3>This is Part 0 of a 7-part series.</h3>
  <p>Each subsequent post will take one step of the framework and go deep: formal definitions, worked examples, checklists, and lessons from building these systems in practice.</p>
  <p style="margin-top: 16px;"><a href="/blog/2026/agentic-systems-part1/" style="color: rgba(250,248,245,0.8); border-bottom: 1px solid rgba(250,248,245,0.3);">Part 1: Requirements and Problem Framing &rarr;</a></p>
</div>

</article>

<div class="post-end">
  <p class="post-author">Lorenzo Xiao &middot; Language Technologies Institute &middot; Carnegie Mellon University</p>
  <a href="/blog/">&larr; Back to blog</a>
</div>]]></content><author><name>Lorenzo Xiao</name></author><category term="blog" /><category term="academic" /><category term="AI" /><category term="engineering" /><summary type="html"><![CDATA[Everyone's building agents. Almost nobody is engineering them. A 7-part series on the full lifecycle — from problem framing to production monitoring.]]></summary></entry><entry><title type="html">Dissecting NeMo Gym: How NVIDIA Built a Modular Microservice Architecture for RL Verification at Scale</title><link href="https://algoroxyolo.github.io/blog/2026/nemo-gym-architecture/" rel="alternate" type="text/html" title="Dissecting NeMo Gym: How NVIDIA Built a Modular Microservice Architecture for RL Verification at Scale" /><published>2026-03-14T00:00:00+00:00</published><updated>2026-03-14T00:00:00+00:00</updated><id>https://algoroxyolo.github.io/blog/2026/nemo-gym-architecture</id><content type="html" xml:base="https://algoroxyolo.github.io/blog/2026/nemo-gym-architecture/"><![CDATA[<!-- ════════ HERO ════════ -->
<header class="hero hero--technical">
  <div class="hero-badge">ML Infrastructure</div>
  <h1>Dissecting NeMo Gym: How NVIDIA Built a Modular Microservice Architecture for <em>RL Verification at Scale</em></h1>
  <p class="hero-sub">Three server types. Thirty-four verifiers. One HTTP protocol. A close read of how composability beats configuration in RLVR pipelines.</p>
  <p class="hero-meta">Lorenzo Xiao &middot; Language Technologies Institute, CMU &middot; March 2026</p>
</header>

<!-- ════════ ARTICLE ════════ -->
<article>

<p class="lead">If you have spent any time working on reinforcement learning from verifiable rewards (RLVR), you know the pain: a model generating rollouts, tools for it to interact with, a verifier to score the output, and some orchestration glue to hold it together. Most frameworks handle this by cramming everything into a monolithic training loop that becomes impossible to extend. NeMo Gym decomposes the entire pipeline into three types of composable microservices, connected by nothing more than async HTTP calls and cookie-based sessions.</p>

<p>This post is a deep dive into the architecture — the design philosophy, the three server types, the data flow from input to reward profiling, and the infrastructure decisions that make it work at scale.</p>

<!-- ── System Architecture Diagram ── -->
<div class="diagram-wrap">
<svg viewBox="0 0 800 310" xmlns="http://www.w3.org/2000/svg" fill="none">
  <defs>
    <filter id="sh" x="-4%" y="-8%" width="108%" height="124%">
      <feDropShadow dx="0" dy="2" stdDeviation="4" flood-opacity="0.08"/>
    </filter>
    <filter id="sh2" x="-4%" y="-8%" width="108%" height="124%">
      <feDropShadow dx="0" dy="1" stdDeviation="2" flood-opacity="0.06"/>
    </filter>
    <linearGradient id="head-g" x1="0" y1="0" x2="0" y2="1">
      <stop offset="0%" stop-color="#3b6fa0"/>
      <stop offset="100%" stop-color="#2a5a8a"/>
    </linearGradient>
    <linearGradient id="agent-g" x1="0" y1="0" x2="0" y2="1">
      <stop offset="0%" stop-color="#2a8f82"/>
      <stop offset="100%" stop-color="#1f7268"/>
    </linearGradient>
    <linearGradient id="model-g" x1="0" y1="0" x2="0" y2="1">
      <stop offset="0%" stop-color="#6b4c8a"/>
      <stop offset="100%" stop-color="#573d72"/>
    </linearGradient>
    <linearGradient id="res-g" x1="0" y1="0" x2="0" y2="1">
      <stop offset="0%" stop-color="#c4553a"/>
      <stop offset="100%" stop-color="#a34430"/>
    </linearGradient>
  </defs>

  <!-- Title -->
  <text x="400" y="22" text-anchor="middle" fill="#2c2420" font-family="Fraunces" font-weight="700" font-size="14">System Architecture</text>

  <!-- HeadServer -->
  <g filter="url(#sh)">
    <rect x="290" y="36" width="220" height="52" rx="10" fill="url(#head-g)"/>
    <text x="400" y="58" text-anchor="middle" fill="#fff" font-family="Fraunces" font-weight="700" font-size="13">HeadServer</text>
    <text x="400" y="76" text-anchor="middle" fill="rgba(255,255,255,0.65)" font-family="JetBrains Mono" font-size="10">port 11000 · lifecycle &amp; config</text>
  </g>

  <!-- Connector lines from HeadServer to three servers -->
  <line x1="330" y1="88" x2="168" y2="138" stroke="#3b6fa0" stroke-width="1.2" stroke-dasharray="5 3" opacity="0.5"/>
  <line x1="400" y1="88" x2="400" y2="138" stroke="#3b6fa0" stroke-width="1.2" stroke-dasharray="5 3" opacity="0.5"/>
  <line x1="470" y1="88" x2="632" y2="138" stroke="#3b6fa0" stroke-width="1.2" stroke-dasharray="5 3" opacity="0.5"/>

  <!-- Agent Server -->
  <g filter="url(#sh)">
    <rect x="62" y="138" width="212" height="76" rx="10" fill="url(#agent-g)"/>
    <text x="168" y="162" text-anchor="middle" fill="#fff" font-family="Fraunces" font-weight="700" font-size="13">Agent Server</text>
    <text x="168" y="179" text-anchor="middle" fill="rgba(255,255,255,0.65)" font-family="Source Sans 3" font-size="11">Orchestration · 7 types</text>
    <text x="168" y="204" text-anchor="middle" fill="rgba(255,255,255,0.5)" font-family="JetBrains Mono" font-size="10">POST /run · POST /v1/responses</text>
  </g>

  <!-- Model Server -->
  <g filter="url(#sh)">
    <rect x="294" y="138" width="212" height="76" rx="10" fill="url(#model-g)"/>
    <text x="400" y="162" text-anchor="middle" fill="#fff" font-family="Fraunces" font-weight="700" font-size="13">Model Server</text>
    <text x="400" y="179" text-anchor="middle" fill="rgba(255,255,255,0.65)" font-family="Source Sans 3" font-size="11">LLM Inference · 5 types</text>
    <text x="400" y="204" text-anchor="middle" fill="rgba(255,255,255,0.5)" font-family="JetBrains Mono" font-size="10">POST /v1/chat/completions</text>
  </g>

  <!-- Resources Server -->
  <g filter="url(#sh)">
    <rect x="526" y="138" width="212" height="76" rx="10" fill="url(#res-g)"/>
    <text x="632" y="162" text-anchor="middle" fill="#fff" font-family="Fraunces" font-weight="700" font-size="13">Resources Server</text>
    <text x="632" y="179" text-anchor="middle" fill="rgba(255,255,255,0.65)" font-family="Source Sans 3" font-size="11">Tools &amp; Verifiers · 34 types</text>
    <text x="632" y="204" text-anchor="middle" fill="rgba(255,255,255,0.5)" font-family="JetBrains Mono" font-size="10">POST /verify · POST /&lt;tool&gt;</text>
  </g>

  <!-- Agent → Model (call) -->
  <path d="M274 172 L292 172" stroke="#6b4c8a" stroke-width="1.5" marker-end="url(#arr)"/>
  <text x="283" y="166" text-anchor="middle" fill="#6b4c8a" font-family="JetBrains Mono" font-size="9">call</text>

  <!-- Model → Agent (response) -->
  <path d="M292 184 L274 184" stroke="#6b4c8a" stroke-width="1.5" marker-end="url(#arr)"/>
  <text x="283" y="196" text-anchor="middle" fill="#6b4c8a" font-family="JetBrains Mono" font-size="9">resp</text>

  <!-- Agent → Resources (tool / verify) -->
  <path d="M274 168 C360 230 440 240 524 185" stroke="#c4553a" stroke-width="1.2" stroke-dasharray="5 3" opacity="0.6" marker-end="url(#arr-r)"/>
  <text x="400" y="245" text-anchor="middle" fill="#c4553a" font-family="JetBrains Mono" font-size="9">tool calls &amp; /verify</text>

  <!-- Protocol label -->
  <rect x="290" y="258" width="220" height="32" rx="8" fill="rgba(59,111,160,0.08)" stroke="rgba(59,111,160,0.2)" stroke-width="1"/>
  <text x="400" y="278" text-anchor="middle" fill="#3b6fa0" font-family="JetBrains Mono" font-size="11" font-weight="500">All comms: async HTTP (aiohttp)</text>

  <defs>
    <marker id="arr" markerWidth="7" markerHeight="7" refX="5" refY="3.5" orient="auto">
      <polygon points="0 0, 7 3.5, 0 7" fill="#6b4c8a" opacity="0.7"/>
    </marker>
    <marker id="arr-r" markerWidth="7" markerHeight="7" refX="5" refY="3.5" orient="auto">
      <polygon points="0 0, 7 3.5, 0 7" fill="#c4553a" opacity="0.7"/>
    </marker>
  </defs>
</svg>
</div>
<p class="diagram-caption">The complete system. HeadServer handles lifecycle; the three core servers handle data flow. HTTP throughout.</p>

<div class="divider"></div>

<!-- ════════ AGENT SERVERS ════════ -->
<h2><span class="section-label">01</span> Agent Servers: The Orchestration Layer <span class="count-badge">7 types</span></h2>

<p>Agent servers are the conductors of the system. They receive a task via <code>POST /run</code>, execute the full loop — call the model, extract function calls, dispatch them to the resources server, feed results back, repeat — and then call <code>/verify</code> to return a reward. What makes the design interesting is that "orchestration" is not one-size-fits-all. NeMo Gym ships seven distinct agents, each encoding a different loop pattern.</p>

<!-- 7 Agent Types Diagram -->
<div class="diagram-wrap">
<svg viewBox="0 0 800 340" xmlns="http://www.w3.org/2000/svg" fill="none">
  <defs>
    <filter id="sh3" x="-3%" y="-6%" width="106%" height="120%">
      <feDropShadow dx="0" dy="1" stdDeviation="3" flood-opacity="0.07"/>
    </filter>
  </defs>
  <text x="400" y="22" text-anchor="middle" fill="#2c2420" font-family="Fraunces" font-weight="700" font-size="14">Seven Agent Implementations</text>

  <!-- Row 1 -->
  <!-- simple_agent -->
  <g filter="url(#sh3)">
    <rect x="16" y="38" width="230" height="80" rx="10" fill="#fff" stroke="#2a8f82" stroke-width="1.2"/>
    <rect x="16" y="38" width="230" height="28" rx="10" fill="#2a8f82"/>
    <rect x="16" y="54" width="230" height="12" fill="#2a8f82"/>
    <text x="131" y="56" text-anchor="middle" fill="#fff" font-family="JetBrains Mono" font-size="11" font-weight="500">simple_agent</text>
    <text x="131" y="82" text-anchor="middle" fill="#2c2420" font-family="Source Sans 3" font-size="11">Tool-augmented while loop.</text>
    <text x="131" y="100" text-anchor="middle" fill="#6b6560" font-family="Source Sans 3" font-size="10.5">Call → extract → dispatch → repeat.</text>
  </g>

  <!-- aviary_agent -->
  <g filter="url(#sh3)">
    <rect x="258" y="38" width="230" height="80" rx="10" fill="#fff" stroke="#3b6fa0" stroke-width="1.2"/>
    <rect x="258" y="38" width="230" height="28" rx="10" fill="#3b6fa0"/>
    <rect x="258" y="54" width="230" height="12" fill="#3b6fa0"/>
    <text x="373" y="56" text-anchor="middle" fill="#fff" font-family="JetBrains Mono" font-size="11" font-weight="500">aviary_agent</text>
    <text x="373" y="82" text-anchor="middle" fill="#2c2420" font-family="Source Sans 3" font-size="11">RL gym environments.</text>
    <text x="373" y="100" text-anchor="middle" fill="#6b6560" font-family="Source Sans 3" font-size="10.5">seed_session → step → verify → close.</text>
  </g>

  <!-- proof_refinement_agent -->
  <g filter="url(#sh3)">
    <rect x="500" y="38" width="285" height="80" rx="10" fill="#fff" stroke="#6b4c8a" stroke-width="1.2"/>
    <rect x="500" y="38" width="285" height="28" rx="10" fill="#6b4c8a"/>
    <rect x="500" y="54" width="285" height="12" fill="#6b4c8a"/>
    <text x="642" y="56" text-anchor="middle" fill="#fff" font-family="JetBrains Mono" font-size="11" font-weight="500">proof_refinement_agent</text>
    <text x="642" y="82" text-anchor="middle" fill="#2c2420" font-family="Source Sans 3" font-size="11">Error-feedback loop for formal proofs.</text>
    <text x="642" y="100" text-anchor="middle" fill="#6b6560" font-family="Source Sans 3" font-size="10.5">No tool calls — model ↔ verifier only.</text>
  </g>

  <!-- Row 2 -->
  <!-- tool_simulation_agent -->
  <g filter="url(#sh3)">
    <rect x="16" y="138" width="230" height="80" rx="10" fill="#fff" stroke="#d4843e" stroke-width="1.2"/>
    <rect x="16" y="138" width="230" height="28" rx="10" fill="#d4843e"/>
    <rect x="16" y="154" width="230" height="12" fill="#d4843e"/>
    <text x="131" y="156" text-anchor="middle" fill="#fff" font-family="JetBrains Mono" font-size="11" font-weight="500">tool_simulation_agent</text>
    <text x="131" y="182" text-anchor="middle" fill="#2c2420" font-family="Source Sans 3" font-size="11">Single call + single verify.</text>
    <text x="131" y="200" text-anchor="middle" fill="#6b6560" font-family="Source Sans 3" font-size="10.5">Tool use simulated inside verifier.</text>
  </g>

  <!-- verifiers_agent -->
  <g filter="url(#sh3)">
    <rect x="258" y="138" width="230" height="80" rx="10" fill="#fff" stroke="#c4553a" stroke-width="1.2"/>
    <rect x="258" y="138" width="230" height="28" rx="10" fill="#c4553a"/>
    <rect x="258" y="154" width="230" height="12" fill="#c4553a"/>
    <text x="373" y="156" text-anchor="middle" fill="#fff" font-family="JetBrains Mono" font-size="11" font-weight="500">verifiers_agent</text>
    <text x="373" y="182" text-anchor="middle" fill="#2c2420" font-family="Source Sans 3" font-size="11">External verifiers library.</text>
    <text x="373" y="200" text-anchor="middle" fill="#6b6560" font-family="Source Sans 3" font-size="10.5">Preserves token IDs + logprobs for RL.</text>
  </g>

  <!-- mini_swe_agent -->
  <g filter="url(#sh3)">
    <rect x="500" y="138" width="135" height="80" rx="10" fill="#fff" stroke="#2a8f82" stroke-width="1.2"/>
    <rect x="500" y="138" width="135" height="28" rx="10" fill="#2a8f82"/>
    <rect x="500" y="154" width="135" height="12" fill="#2a8f82"/>
    <text x="567" y="156" text-anchor="middle" fill="#fff" font-family="JetBrains Mono" font-size="10" font-weight="500">mini_swe_agent</text>
    <text x="567" y="182" text-anchor="middle" fill="#2c2420" font-family="Source Sans 3" font-size="11">SWE-gym via Ray.</text>
    <text x="567" y="200" text-anchor="middle" fill="#6b6560" font-family="Source Sans 3" font-size="10.5">Docker + Singularity.</text>
  </g>

  <!-- swe_agents -->
  <g filter="url(#sh3)">
    <rect x="650" y="138" width="135" height="80" rx="10" fill="#fff" stroke="#3b6fa0" stroke-width="1.2"/>
    <rect x="650" y="138" width="135" height="28" rx="10" fill="#3b6fa0"/>
    <rect x="650" y="154" width="135" height="12" fill="#3b6fa0"/>
    <text x="717" y="156" text-anchor="middle" fill="#fff" font-family="JetBrains Mono" font-size="10" font-weight="500">swe_agents</text>
    <text x="717" y="182" text-anchor="middle" fill="#2c2420" font-family="Source Sans 3" font-size="11">Full SWE-bench.</text>
    <text x="717" y="200" text-anchor="middle" fill="#6b6560" font-family="Source Sans 3" font-size="10.5">100 turns, 4 frameworks.</text>
  </g>

  <!-- Shared base note -->
  <rect x="16" y="240" width="769" height="36" rx="8" fill="rgba(42,143,130,0.07)" stroke="rgba(42,143,130,0.2)" stroke-width="1"/>
  <text x="400" y="262" text-anchor="middle" fill="#2a8f82" font-family="Source Sans 3" font-size="12">All inherit from <tspan font-family="JetBrains Mono" font-size="11">SimpleResponsesAPIAgent → SimpleServer → BaseServer</tspan> for shared endpoint registration, health checks &amp; session middleware.</text>

  <!-- Workhorse label -->
  <text x="131" y="130" text-anchor="middle" fill="#2a8f82" font-family="Source Sans 3" font-size="11" font-style="italic">← workhorse</text>
</svg>
</div>
<p class="diagram-caption">Seven agents, seven loop patterns. The shared base class provides infrastructure; each leaf defines behavior.</p>

<div class="callout tip">
  <div class="callout-title">The Composability Principle</div>
  <p>Instead of building one agent with a hundred config flags, NeMo Gym builds seven agents that each do one thing well. The <code>tool_simulation_agent</code> does not have a "disable tool calls" flag — it simply never makes tool calls. The <code>proof_refinement_agent</code> does not have a "correction mode" toggle — it always does correction. Each implementation is small, testable, and easy to reason about.</p>
</div>

<p>The <code>simple_agent</code> is the workhorse: a while loop that calls the model, extracts function calls, POSTs each one to the resources server's <code>/&lt;tool_name&gt;</code> endpoint, accumulates outputs, feeds them back, and repeats until there are no more function calls or <code>max_steps</code> is reached. The <code>verifiers_agent</code> is the most RL-aware: it preserves <code>prompt_token_ids</code>, <code>generation_token_ids</code>, and <code>logprobs</code> in its output, making it directly compatible with policy gradient training pipelines that need these quantities.</p>

<div class="divider"></div>

<!-- ════════ MODEL SERVERS ════════ -->
<h2><span class="section-label">02</span> Model Servers: Abstracting Inference <span class="count-badge">5 types</span></h2>

<p>Model servers expose two endpoints — <code>POST /v1/chat/completions</code> and <code>POST /v1/responses</code> — and return a <code>NeMoGymResponse</code> containing <code>output[]</code> (messages and function calls) and <code>usage</code> (input and output token counts). The five implementations form a clean capability inheritance chain.</p>

<div class="table-wrap">
<table>
  <thead>
    <tr><th>Server</th><th>Extends</th><th>Key Capability Added</th></tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>openai_model</strong></td>
      <td><code>SimpleModelServer</code></td>
      <td>Generic OpenAI-compatible client. Works with any endpoint that speaks the OpenAI API.</td>
    </tr>
    <tr>
      <td><strong>azure_openai_model</strong></td>
      <td><code>openai_model</code></td>
      <td>Azure deployments with <code>api-version</code> handling, <code>VLLMConverter</code>, semaphore concurrency control.</td>
    </tr>
    <tr>
      <td><strong>vllm_model</strong></td>
      <td><code>azure_openai_model</code></td>
      <td>Multi-endpoint round-robin, <code>&lt;think&gt;</code> tag reasoning parser, token ID + logprob tracking, graceful context-length truncation.</td>
    </tr>
    <tr>
      <td><strong>local_vllm_model</strong></td>
      <td><code>vllm_model</code></td>
      <td>Launches vLLM as a Ray actor. Tensor, pipeline, and data parallelism via Ray placement groups. Multi-node GPU support.</td>
    </tr>
    <tr>
      <td><strong>genrm_model</strong></td>
      <td><code>local_vllm_model</code></td>
      <td>Custom conversation roles (<code>response_1</code>, <code>response_2</code>, <code>principle</code>) for pairwise reward modeling comparisons.</td>
    </tr>
  </tbody>
</table>
</div>

<p>The inheritance chain is purposeful: each layer adds exactly one conceptual capability. The <code>local_vllm_model</code> is where things get operationally interesting — it handles HuggingFace token and cache management, polls for server health before accepting requests, and applies internal vLLM patches for compatibility. The <code>genrm_model</code> at the tip of the chain is the model server you use when your reward signal comes from comparing two outputs side by side.</p>

<div class="divider"></div>

<!-- ════════ RESOURCES SERVERS ════════ -->
<h2><span class="section-label">03</span> Resources Servers: 34 Verifiers, One Interface</h2>

<p>All 34 resources servers expose the same three endpoints: <code>POST /verify</code>, <code>POST /seed_session</code>, and <code>POST /&lt;tool_name&gt;</code>. They all return <code>BaseVerifyResponse { reward: float, info: dict }</code>. The reward types break into three families.</p>

<!-- Reward types diagram -->
<div class="diagram-wrap">
<svg viewBox="0 0 800 310" xmlns="http://www.w3.org/2000/svg" fill="none">
  <defs>
    <filter id="sh4" x="-3%" y="-6%" width="106%" height="120%">
      <feDropShadow dx="0" dy="1" stdDeviation="3" flood-opacity="0.06"/>
    </filter>
  </defs>
  <text x="400" y="22" text-anchor="middle" fill="#2c2420" font-family="Fraunces" font-weight="700" font-size="14">34 Resources Servers by Reward Type</text>

  <!-- Binary -->
  <g filter="url(#sh4)">
    <rect x="14" y="38" width="234" height="258" rx="12" fill="#fff" stroke="#2a8f82" stroke-width="1.2"/>
    <rect x="14" y="38" width="234" height="36" rx="12" fill="#2a8f82"/>
    <rect x="14" y="62" width="234" height="12" fill="#2a8f82"/>
    <text x="131" y="62" text-anchor="middle" fill="#fff" font-family="Fraunces" font-weight="700" font-size="13">Binary Reward</text>
    <text x="131" y="76" text-anchor="middle" fill="rgba(255,255,255,0.7)" font-family="JetBrains Mono" font-size="10">returns 0 or 1 · 18 servers</text>
  </g>
  <text x="30" y="104" fill="#2c2420" font-family="JetBrains Mono" font-size="10.5" font-weight="500">arc_agi</text>
  <text x="30" y="120" fill="#6b6560" font-family="Source Sans 3" font-size="10">2D grid parse from \boxed{}</text>
  <text x="30" y="140" fill="#2c2420" font-family="JetBrains Mono" font-size="10.5" font-weight="500">code_gen</text>
  <text x="30" y="156" fill="#6b6560" font-family="Source Sans 3" font-size="10">Ray execution + unit tests</text>
  <text x="30" y="176" fill="#2c2420" font-family="JetBrains Mono" font-size="10.5" font-weight="500">structured_outputs</text>
  <text x="30" y="192" fill="#6b6560" font-family="Source Sans 3" font-size="10">JSON → OpenAPI schema validation</text>
  <text x="30" y="212" fill="#2c2420" font-family="JetBrains Mono" font-size="10.5" font-weight="500">text_to_sql</text>
  <text x="30" y="228" fill="#6b6560" font-family="Source Sans 3" font-size="10">MySQL, PostgreSQL, SQLite dialects</text>
  <text x="30" y="248" fill="#2c2420" font-family="JetBrains Mono" font-size="10.5" font-weight="500">xlam_fc</text>
  <text x="30" y="264" fill="#6b6560" font-family="Source Sans 3" font-size="10">Function call greedy match</text>
  <text x="30" y="282" fill="#6b6560" font-family="Source Sans 3" font-size="10" font-style="italic">+ 13 more…</text>

  <!-- Continuous -->
  <g filter="url(#sh4)">
    <rect x="260" y="38" width="234" height="258" rx="12" fill="#fff" stroke="#6b4c8a" stroke-width="1.2"/>
    <rect x="260" y="38" width="234" height="36" rx="12" fill="#6b4c8a"/>
    <rect x="260" y="62" width="234" height="12" fill="#6b4c8a"/>
    <text x="377" y="62" text-anchor="middle" fill="#fff" font-family="Fraunces" font-weight="700" font-size="13">Continuous Reward</text>
    <text x="377" y="76" text-anchor="middle" fill="rgba(255,255,255,0.7)" font-family="JetBrains Mono" font-size="10">returns [0, 1] · 8 servers</text>
  </g>
  <text x="276" y="104" fill="#2c2420" font-family="JetBrains Mono" font-size="10.5" font-weight="500">genrm_compare</text>
  <text x="276" y="120" fill="#6b6560" font-family="Source Sans 3" font-size="10">Pairwise GenRM + length bonus</text>
  <text x="276" y="140" fill="#2c2420" font-family="JetBrains Mono" font-size="10.5" font-weight="500">math_formal_lean</text>
  <text x="276" y="156" fill="#6b6560" font-family="Source Sans 3" font-size="10">0.3× symbolic + 0.7× Lean4 RMSLE</text>
  <text x="276" y="176" fill="#2c2420" font-family="JetBrains Mono" font-size="10.5" font-weight="500">math_with_judge</text>
  <text x="276" y="192" fill="#6b6560" font-family="Source Sans 3" font-size="10">math_verify → LLM judge fallback</text>
  <text x="276" y="212" fill="#2c2420" font-family="JetBrains Mono" font-size="10.5" font-weight="500">multichallenge</text>
  <text x="276" y="228" fill="#6b6560" font-family="Source Sans 3" font-size="10">Multi-rubric: mean/min/max/all/any</text>
  <text x="276" y="248" fill="#2c2420" font-family="JetBrains Mono" font-size="10.5" font-weight="500">aviary</text>
  <text x="276" y="264" fill="#6b6560" font-family="Source Sans 3" font-size="10">Cumulative env step rewards</text>
  <text x="276" y="282" fill="#6b6560" font-family="Source Sans 3" font-size="10" font-style="italic">+ 3 more…</text>

  <!-- Compound -->
  <g filter="url(#sh4)">
    <rect x="506" y="38" width="280" height="258" rx="12" fill="#fff" stroke="#c4553a" stroke-width="1.2"/>
    <rect x="506" y="38" width="280" height="36" rx="12" fill="#c4553a"/>
    <rect x="506" y="62" width="280" height="12" fill="#c4553a"/>
    <text x="646" y="62" text-anchor="middle" fill="#fff" font-family="Fraunces" font-weight="700" font-size="13">Compound / Varied</text>
    <text x="646" y="76" text-anchor="middle" fill="rgba(255,255,255,0.7)" font-family="JetBrains Mono" font-size="10">mixed types · 8 servers</text>
  </g>
  <text x="522" y="104" fill="#2c2420" font-family="JetBrains Mono" font-size="10.5" font-weight="500">jailbreak_detection</text>
  <text x="522" y="120" fill="#6b6560" font-family="Source Sans 3" font-size="10">Safety × quality compound score</text>
  <text x="522" y="140" fill="#2c2420" font-family="JetBrains Mono" font-size="10.5" font-weight="500">instruction_following</text>
  <text x="522" y="156" fill="#6b6560" font-family="Source Sans 3" font-size="10">Binary strict or fraction [0,1]</text>
  <text x="522" y="176" fill="#2c2420" font-family="JetBrains Mono" font-size="10.5" font-weight="500">equivalence_llm_judge</text>
  <text x="522" y="192" fill="#6b6560" font-family="Source Sans 3" font-size="10">0 / 0.5 / 1 with swap-check</text>
  <text x="522" y="212" fill="#2c2420" font-family="JetBrains Mono" font-size="10.5" font-weight="500">over_refusal_detection</text>
  <text x="522" y="228" fill="#6b6560" font-family="Source Sans 3" font-size="10">Safety + helpfulness balance</text>
  <text x="522" y="248" fill="#2c2420" font-family="JetBrains Mono" font-size="10.5" font-weight="500">mcqa</text>
  <text x="522" y="264" fill="#6b6560" font-family="Source Sans 3" font-size="10">3-mode letter extraction</text>
  <text x="522" y="282" fill="#6b6560" font-family="Source Sans 3" font-size="10" font-style="italic">+ 3 more…</text>
</svg>
</div>
<p class="diagram-caption">34 verifiers, three reward families. 13 use LLM judges, 5 use sandboxed Ray execution, 4 use session state.</p>

<div class="callout info">
  <div class="callout-title">Rewards as First-Class Design</div>
  <p>With 34 verifiers spanning binary, continuous, and compound reward types, NeMo Gym treats reward computation as a rich design space rather than an afterthought. The fact that 13 of 34 verifiers use LLM judges reflects the reality that ground-truth verification is expensive or impossible for many tasks — and the system accommodates that complexity rather than pretending it doesn't exist.</p>
</div>

<p>Across all 34 servers, patterns repeat: session state flows through cookies (4 servers require <code>seed_session</code>), Ray handles sandboxed code execution (5 servers), and LLM judges step in wherever symbolic verification fails. The <code>math_formal_lean</code> verifier is the most sophisticated — a hybrid that weights Lean4 formal proof compilation at 70% and symbolic equivalence at 30%, with multi-turn error feedback injected back into the model conversation.</p>

<div class="divider"></div>

<!-- ════════ DATA FLOW ════════ -->
<h2><span class="section-label">04</span> The Data Flow: JSONL to Reward Profiles</h2>

<p>The end-to-end pipeline follows eight steps, from raw task input to aggregated pass@k metrics.</p>

<!-- 8-Step Flow Diagram -->
<div class="diagram-wrap">
<svg viewBox="0 0 800 200" xmlns="http://www.w3.org/2000/svg" fill="none">
  <defs>
    <filter id="sh5" x="-4%" y="-10%" width="108%" height="128%">
      <feDropShadow dx="0" dy="1" stdDeviation="3" flood-opacity="0.07"/>
    </filter>
    <linearGradient id="flow-g1" x1="0" y1="0" x2="1" y2="0">
      <stop offset="0%" stop-color="#3b6fa0"/>
      <stop offset="100%" stop-color="#5a94c4"/>
    </linearGradient>
    <linearGradient id="flow-g2" x1="0" y1="0" x2="1" y2="0">
      <stop offset="0%" stop-color="#2a8f82"/>
      <stop offset="100%" stop-color="#3db5a5"/>
    </linearGradient>
    <linearGradient id="flow-g3" x1="0" y1="0" x2="1" y2="0">
      <stop offset="0%" stop-color="#6b4c8a"/>
      <stop offset="100%" stop-color="#8b6caa"/>
    </linearGradient>
    <linearGradient id="flow-g4" x1="0" y1="0" x2="1" y2="0">
      <stop offset="0%" stop-color="#c4553a"/>
      <stop offset="100%" stop-color="#e8725a"/>
    </linearGradient>
  </defs>

  <!-- Step labels -->
  <text x="400" y="18" text-anchor="middle" fill="#2c2420" font-family="Fraunces" font-weight="700" font-size="13">8-Step Pipeline: JSONL → Reward Profile</text>

  <g filter="url(#sh5)">
    <rect x="8" y="34" width="82" height="56" rx="8" fill="url(#flow-g1)"/>
    <text x="49" y="56" text-anchor="middle" fill="#fff" font-family="JetBrains Mono" font-size="9" font-weight="500">01</text>
    <text x="49" y="70" text-anchor="middle" fill="#fff" font-family="Source Sans 3" font-weight="600" font-size="11">Input</text>
    <text x="49" y="83" text-anchor="middle" fill="rgba(255,255,255,0.65)" font-family="Source Sans 3" font-size="9.5">JSONL tasks</text>
  </g>
  <path d="M92 62 L100 62" stroke="#bbb" stroke-width="1.2"/>

  <g filter="url(#sh5)">
    <rect x="102" y="34" width="88" height="56" rx="8" fill="url(#flow-g1)"/>
    <text x="146" y="56" text-anchor="middle" fill="#fff" font-family="JetBrains Mono" font-size="9" font-weight="500">02</text>
    <text x="146" y="70" text-anchor="middle" fill="#fff" font-family="Source Sans 3" font-weight="600" font-size="11">Preprocess</text>
    <text x="146" y="83" text-anchor="middle" fill="rgba(255,255,255,0.65)" font-family="Source Sans 3" font-size="9.5">task/rollout idx</text>
  </g>
  <path d="M192 62 L200 62" stroke="#bbb" stroke-width="1.2"/>

  <g filter="url(#sh5)">
    <rect x="202" y="34" width="82" height="56" rx="8" fill="url(#flow-g2)"/>
    <text x="243" y="56" text-anchor="middle" fill="#fff" font-family="JetBrains Mono" font-size="9" font-weight="500">03</text>
    <text x="243" y="70" text-anchor="middle" fill="#fff" font-family="Source Sans 3" font-weight="600" font-size="11">Dispatch</text>
    <text x="243" y="83" text-anchor="middle" fill="rgba(255,255,255,0.65)" font-family="Source Sans 3" font-size="9.5">POST /run</text>
  </g>
  <path d="M286 62 L294 62" stroke="#bbb" stroke-width="1.2"/>

  <g filter="url(#sh5)">
    <rect x="296" y="34" width="90" height="56" rx="8" fill="url(#flow-g3)"/>
    <text x="341" y="56" text-anchor="middle" fill="#fff" font-family="JetBrains Mono" font-size="9" font-weight="500">04</text>
    <text x="341" y="70" text-anchor="middle" fill="#fff" font-family="Source Sans 3" font-weight="600" font-size="11">Model Call</text>
    <text x="341" y="83" text-anchor="middle" fill="rgba(255,255,255,0.65)" font-family="Source Sans 3" font-size="9.5">/v1/responses</text>
  </g>
  <path d="M388 62 L396 62" stroke="#bbb" stroke-width="1.2"/>

  <g filter="url(#sh5)">
    <rect x="398" y="34" width="86" height="56" rx="8" fill="url(#flow-g3)"/>
    <text x="441" y="56" text-anchor="middle" fill="#fff" font-family="JetBrains Mono" font-size="9" font-weight="500">05</text>
    <text x="441" y="70" text-anchor="middle" fill="#fff" font-family="Source Sans 3" font-weight="600" font-size="11">Tool Exec</text>
    <text x="441" y="83" text-anchor="middle" fill="rgba(255,255,255,0.65)" font-family="Source Sans 3" font-size="9.5">POST /&lt;tool&gt;</text>
  </g>

  <!-- Loop arrow back -->
  <path d="M441 92 L441 130 Q441 140 430 140 L310 140 Q299 140 299 130 L299 92" stroke="#6b4c8a" stroke-width="1.2" stroke-dasharray="5 3" fill="none" opacity="0.6"/>
  <text x="370" y="156" text-anchor="middle" fill="#6b4c8a" font-family="Source Sans 3" font-size="10" font-style="italic">repeat until no more calls or max_steps</text>

  <path d="M486 62 L494 62" stroke="#bbb" stroke-width="1.2"/>

  <g filter="url(#sh5)">
    <rect x="496" y="34" width="82" height="56" rx="8" fill="url(#flow-g4)"/>
    <text x="537" y="56" text-anchor="middle" fill="#fff" font-family="JetBrains Mono" font-size="9" font-weight="500">06</text>
    <text x="537" y="70" text-anchor="middle" fill="#fff" font-family="Source Sans 3" font-weight="600" font-size="11">Verify</text>
    <text x="537" y="83" text-anchor="middle" fill="rgba(255,255,255,0.65)" font-family="Source Sans 3" font-size="9.5">POST /verify</text>
  </g>
  <path d="M580 62 L588 62" stroke="#bbb" stroke-width="1.2"/>

  <g filter="url(#sh5)">
    <rect x="590" y="34" width="82" height="56" rx="8" fill="url(#flow-g4)"/>
    <text x="631" y="56" text-anchor="middle" fill="#fff" font-family="JetBrains Mono" font-size="9" font-weight="500">07</text>
    <text x="631" y="70" text-anchor="middle" fill="#fff" font-family="Source Sans 3" font-weight="600" font-size="11">Write</text>
    <text x="631" y="83" text-anchor="middle" fill="rgba(255,255,255,0.65)" font-family="Source Sans 3" font-size="9.5">output JSONL</text>
  </g>
  <path d="M674 62 L682 62" stroke="#bbb" stroke-width="1.2"/>

  <g filter="url(#sh5)">
    <rect x="684" y="34" width="108" height="56" rx="8" fill="url(#flow-g2)"/>
    <text x="738" y="56" text-anchor="middle" fill="#fff" font-family="JetBrains Mono" font-size="9" font-weight="500">08</text>
    <text x="738" y="70" text-anchor="middle" fill="#fff" font-family="Source Sans 3" font-weight="600" font-size="11">Reward Profile</text>
    <text x="738" y="83" text-anchor="middle" fill="rgba(255,255,255,0.65)" font-family="Source Sans 3" font-size="9.5">pass@1/4/16 + stats</text>
  </g>
</svg>
</div>
<p class="diagram-caption">Steps 04–05 form the inner loop. The outer pipeline runs once per task. <code>RewardProfiler</code> computes pass@k at the end.</p>

<p>The profiling step is worth dwelling on. <code>RewardProfiler</code> computes pass@k for k ∈ {1, 4, 16} along with mean, max, min, median, and standard deviation — both per-task and globally. This is not just a logging convenience; it directly measures the key quantity in RLVR: how often does the model produce a verifiably correct answer under multiple sample draws?</p>

<div class="divider"></div>

<!-- ════════ INFRASTRUCTURE ════════ -->
<h2><span class="section-label">05</span> Infrastructure: The Parts That Make It Work</h2>

<h3>ServerClient</h3>
<p>All inter-server HTTP communication goes through <code>ServerClient</code>, an aiohttp wrapper with retry logic (3× exponential backoff), and connection pooling set to 100,000 total connections and 1,000 per host. These numbers are aggressive but intentional: when running thousands of parallel rollouts across a Ray cluster, you need the headroom before connection exhaustion becomes the bottleneck.</p>

<h3>Session Middleware</h3>
<p>Cookie-based session state with UUID-per-session is an unconventional choice for a distributed system, but it elegantly solves the routing problem. When an agent calls <code>POST /seed_session</code>, the server creates session state and returns a cookie. All subsequent calls from that agent carry the cookie, giving the server access to the right session. No centralized session store, no distributed cache to manage, no sticky session configuration at the load balancer level.</p>

<div class="callout warn">
  <div class="callout-title">Session State Is Local to Each Server</div>
  <p>Because session state lives in server memory (keyed by cookie UUID), it means agent requests must reach the <em>same</em> resources server instance across all turns of a session. This is fine in the default single-instance configuration, but warrants attention if you scale resources servers horizontally behind a load balancer. You would need sticky sessions or session state externalization for that use case.</p>
</div>

<h3>Ray Cluster</h3>
<p>Ray handles distributed job scheduling, multi-node GPU management (tensor, pipeline, and data parallelism), placement groups for co-locating related processes, and actor-based vLLM server management. The <code>local_vllm_model</code> launches its vLLM instance as a Ray actor, which means Ray handles placement, fault tolerance, and resource allocation. TP × PP × DP configurations are set in Hydra config and passed through to vLLM's CLI flags automatically.</p>

<h3>Configuration</h3>
<p>Hydra + OmegaConf provides the config system, merging YAML files, CLI overrides, and <code>env.yaml</code> environment-specific settings. Five CLI entry points — <code>ng_run</code>, <code>ng_collect_rollouts</code>, <code>ng_test</code>, <code>ng_reward_profile</code>, <code>ng_status</code> — each compose different Hydra config groups to set up the appropriate server topology. This means switching from a local single-node run to a multi-node Ray cluster is a config change, not a code change.</p>

<div class="divider"></div>

<!-- ════════ DESIGN LESSONS ════════ -->
<h2><span class="section-label">06</span> Design Lessons</h2>

<p>A few things stand out about NeMo Gym's architecture that generalize beyond this specific system.</p>

<h3>HTTP as the Universal Connector</h3>
<p>The decision to use plain HTTP everywhere could be seen as a performance compromise, but it buys enormous flexibility. You can test a resources server with curl. You can run the model server on a different machine, in a different cloud, or behind a load balancer. You can replace any component with a mock. The 100k connection pool and async I/O ensure that HTTP overhead is not the bottleneck — the model inference is.</p>

<h3>Inheritance for Structure, Not Behavior</h3>
<p>The class hierarchy (<code>BaseServer → SimpleServer → SimpleResourcesServer / SimpleResponsesAPIModel / SimpleResponsesAPIAgent</code>) provides shared infrastructure — endpoint registration, health checks, session middleware, config loading — but each leaf implementation defines its own behavior. This is a principled use of inheritance that avoids the deep hierarchy trap: base classes provide <em>capabilities</em>, not <em>defaults that get overridden</em>.</p>

<h3>Verification Diversity Is a Feature</h3>
<p>Having 34 different verifiers is not bloat — it reflects the genuine diversity of what "correctness" means across tasks. Grid comparison, SQL equivalence, Lean4 compilation, safety × quality scoring, and pairwise LLM preference are fundamentally different notions of reward. Collapsing them all into one verifier interface (<code>BaseVerifyResponse</code>) while keeping their implementations separate is the right abstraction boundary.</p>

<div class="callout tip">
  <div class="callout-title">The Deeper Pattern</div>
  <p>NeMo Gym is a case study in how microservice decomposition, when done with discipline, can turn a complex ML systems problem into a collection of simple, composable pieces. The three-server-type design is easy to explain, easy to extend, and — critically for a research framework — easy to debug. Whether you are building your own RLVR pipeline or looking for architectural patterns that scale, there is a lot to learn from how this system was put together.</p>
</div>


<h2 id="reference" style="margin-top:56px"><span class="section-label">07</span> Architecture Reference Card</h2>
<p>A condensed reference view of the complete system — all server types, class hierarchy, data flow, and infrastructure at a glance.</p>

<style>
  .arch-ref-outer {
    --ar-bg: #0C0F16; --ar-bg2: #111723; --ar-panel: #151C28; --ar-border: #1F2A3A;
    --ar-text: #E2E8F0; --ar-text2: #8A96AD; --ar-text3: #54617A;
    --ar-agent: #4C93E0; --ar-agent-light: #6AADF5; --ar-agent-bg: rgba(76,147,224,0.07); --ar-agent-border: rgba(76,147,224,0.22);
    --ar-model: #3CC07E; --ar-model-light: #56D898; --ar-model-bg: rgba(60,192,126,0.07); --ar-model-border: rgba(60,192,126,0.22);
    --ar-res: #E89538; --ar-res-light: #F5AA55; --ar-res-bg: rgba(232,149,56,0.07); --ar-res-border: rgba(232,149,56,0.22);
    --ar-infra: #6E7A90; --ar-infra-light: #8895AC; --ar-infra-bg: rgba(110,122,144,0.06); --ar-infra-border: rgba(110,122,144,0.18);
    --ar-head: #B87CED; --ar-head-light: #CFA0F5; --ar-head-bg: rgba(184,124,237,0.07); --ar-head-border: rgba(184,124,237,0.22);
    background: var(--ar-bg); color: var(--ar-text);
    font-family: 'IBM Plex Sans', sans-serif;
    padding: 28px 24px; border-radius: 12px; margin: 32px -16px; overflow: hidden;
  }
  .arch-ref-outer .arch-ref-hdr { max-width: 1320px; margin: 0 auto 20px; }
  .arch-ref-outer .arch-ref-hdr h3 { font-family: 'IBM Plex Mono', monospace; font-size: 20px; font-weight: 700; color: var(--ar-head-light); letter-spacing: -0.5px; margin: 0; }
  .arch-ref-outer .arch-ref-hdr h3 span { color: var(--ar-text3); font-size: 12px; font-weight: 400; margin-left: 12px; }
  .arch-ref-outer .ar-legend { display: flex; gap: 18px; flex-wrap: wrap; margin-top: 10px; }
  .arch-ref-outer .ar-legend-item { display: flex; align-items: center; gap: 6px; font-size: 11px; color: var(--ar-text2); font-family: 'IBM Plex Mono', monospace; }
  .arch-ref-outer .ar-legend-dot { width: 9px; height: 9px; border-radius: 2px; }
  .arch-ref-outer .ar-legend-line { width: 18px; border-top: 1.5px dashed var(--ar-head); }
  .arch-ref-outer .ar-grid { display: grid; grid-template-columns: 1fr 1fr 320px; grid-template-rows: auto auto auto auto; gap: 14px; max-width: 1320px; margin: 0 auto; }
  .arch-ref-outer .ar-card { background: var(--ar-panel); border: 1px solid var(--ar-border); border-radius: 10px; overflow: hidden; position: relative; }
  .arch-ref-outer .ar-card-header { display: flex; align-items: center; justify-content: space-between; padding: 10px 16px; border-bottom: 1px solid var(--ar-border); }
  .arch-ref-outer .ar-card-header h3 { font-family: 'IBM Plex Mono', monospace; font-size: 13px; font-weight: 700; display: flex; align-items: center; gap: 8px; margin: 0; }
  .arch-ref-outer .ar-card-header h3 .ar-dot { width: 8px; height: 8px; border-radius: 50%; display: inline-block; }
  .arch-ref-outer .ar-count { font-family: 'IBM Plex Mono', monospace; font-size: 10px; font-weight: 600; padding: 2px 8px; border-radius: 4px; }
  .arch-ref-outer .ar-card-body { padding: 14px 16px; }
  .arch-ref-outer .ar-endpoint { display: inline-flex; align-items: center; gap: 6px; margin: 3px 4px 3px 0; font-family: 'IBM Plex Mono', monospace; font-size: 11px; }
  .arch-ref-outer .ar-method { font-size: 9px; font-weight: 700; padding: 1px 5px; border-radius: 3px; letter-spacing: 0.3px; }
  .arch-ref-outer .ar-path { color: var(--ar-text2); }
  .arch-ref-outer .ar-slabel { font-family: 'IBM Plex Mono', monospace; font-size: 10px; font-weight: 600; color: var(--ar-text2); margin: 12px 0 6px; text-transform: uppercase; letter-spacing: 0.5px; }
  .arch-ref-outer .ar-slabel:first-child { margin-top: 0; }
  .arch-ref-outer .ar-tag-list { display: flex; flex-wrap: wrap; gap: 5px; }
  .arch-ref-outer .ar-tag { font-family: 'IBM Plex Mono', monospace; font-size: 10px; padding: 3px 8px; border-radius: 4px; font-weight: 500; }
  .arch-ref-outer .ar-schema { font-family: 'IBM Plex Mono', monospace; font-size: 10.5px; padding: 10px 12px; border-radius: 6px; margin-top: 10px; line-height: 1.7; }
  .arch-ref-outer .ar-schema .ar-k { font-weight: 600; }
  .arch-ref-outer .ar-schema .ar-t { opacity: 0.55; }
  .arch-ref-outer .ar-agent .ar-card-header { background: var(--ar-agent-bg); }
  .arch-ref-outer .ar-agent .ar-card-header h3 { color: var(--ar-agent-light); }
  .arch-ref-outer .ar-agent .ar-count { background: var(--ar-agent-bg); color: var(--ar-agent); border: 1px solid var(--ar-agent-border); }
  .arch-ref-outer .ar-agent .ar-method { background: rgba(76,147,224,0.15); color: var(--ar-agent); }
  .arch-ref-outer .ar-agent .ar-tag { background: var(--ar-agent-bg); color: var(--ar-agent-light); border: 1px solid var(--ar-agent-border); }
  .arch-ref-outer .ar-model-c .ar-card-header { background: var(--ar-model-bg); }
  .arch-ref-outer .ar-model-c .ar-card-header h3 { color: var(--ar-model-light); }
  .arch-ref-outer .ar-model-c .ar-count { background: var(--ar-model-bg); color: var(--ar-model); border: 1px solid var(--ar-model-border); }
  .arch-ref-outer .ar-model-c .ar-method { background: rgba(60,192,126,0.15); color: var(--ar-model); }
  .arch-ref-outer .ar-model-c .ar-tag { background: var(--ar-model-bg); color: var(--ar-model-light); border: 1px solid var(--ar-model-border); }
  .arch-ref-outer .ar-model-c .ar-schema { background: var(--ar-model-bg); border: 1px solid var(--ar-model-border); }
  .arch-ref-outer .ar-model-c .ar-schema .ar-k { color: var(--ar-model-light); }
  .arch-ref-outer .ar-res-c .ar-card-header { background: var(--ar-res-bg); }
  .arch-ref-outer .ar-res-c .ar-card-header h3 { color: var(--ar-res-light); }
  .arch-ref-outer .ar-res-c .ar-count { background: var(--ar-res-bg); color: var(--ar-res); border: 1px solid var(--ar-res-border); }
  .arch-ref-outer .ar-res-c .ar-method { background: rgba(232,149,56,0.15); color: var(--ar-res); }
  .arch-ref-outer .ar-res-c .ar-tag { background: var(--ar-res-bg); color: var(--ar-res-light); border: 1px solid var(--ar-res-border); }
  .arch-ref-outer .ar-res-c .ar-schema { background: var(--ar-res-bg); border: 1px solid var(--ar-res-border); }
  .arch-ref-outer .ar-res-c .ar-schema .ar-k { color: var(--ar-res-light); }
  .arch-ref-outer .ar-head-c .ar-card-header { background: var(--ar-head-bg); }
  .arch-ref-outer .ar-head-c .ar-card-header h3 { color: var(--ar-head-light); }
  .arch-ref-outer .ar-head-c .ar-tag { background: var(--ar-head-bg); color: var(--ar-head-light); border: 1px solid var(--ar-head-border); }
  .arch-ref-outer .ar-infra-c .ar-card-header { background: var(--ar-infra-bg); }
  .arch-ref-outer .ar-infra-c .ar-card-header h3 { color: var(--ar-infra-light); }
  .arch-ref-outer .ar-infra-c .ar-tag { background: var(--ar-infra-bg); color: var(--ar-infra-light); border: 1px solid var(--ar-infra-border); }
  .arch-ref-outer .ar-flow-c .ar-card-header { background: rgba(255,255,255,0.02); }
  .arch-ref-outer .ar-flow-c .ar-card-header h3 { color: var(--ar-text); }
  .arch-ref-outer .ar-cli-row { grid-column: 1 / -1; }
  .arch-ref-outer .ar-agents-col { grid-column: 1; grid-row: 2 / 4; }
  .arch-ref-outer .ar-models-col { grid-column: 2; grid-row: 2; }
  .arch-ref-outer .ar-resources-col { grid-column: 1 / 3; grid-row: 4; }
  .arch-ref-outer .ar-hierarchy-col { grid-column: 3; grid-row: 2 / 4; }
  .arch-ref-outer .ar-flow-col { grid-column: 2; grid-row: 3; }
  .arch-ref-outer .ar-infra-col { grid-column: 1 / -1; grid-row: 5; }
  .arch-ref-outer .ar-agent-loop { margin-top: 12px; padding: 12px; background: rgba(76,147,224,0.03); border: 1px dashed var(--ar-agent-border); border-radius: 8px; }
  .arch-ref-outer .ar-agent-loop svg { width: 100%; height: auto; }
  .arch-ref-outer .ar-tree-node { display: flex; align-items: center; gap: 8px; padding: 5px 0; font-family: 'IBM Plex Mono', monospace; font-size: 11px; }
  .arch-ref-outer .ar-tree-indent { margin-left: 20px; position: relative; }
  .arch-ref-outer .ar-tree-indent::before { content: ''; position: absolute; left: -12px; top: 0; bottom: 50%; width: 1px; border-left: 1px solid var(--ar-border); }
  .arch-ref-outer .ar-tree-indent::after { content: ''; position: absolute; left: -12px; top: 50%; width: 10px; border-top: 1px solid var(--ar-border); }
  .arch-ref-outer .ar-tree-badge { font-size: 9px; padding: 1px 6px; border-radius: 3px; font-weight: 600; }
  .arch-ref-outer .ar-infra-grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(200px, 1fr)); gap: 10px; }
  .arch-ref-outer .ar-infra-item { padding: 12px; background: var(--ar-bg2); border: 1px solid var(--ar-border); border-radius: 6px; }
  .arch-ref-outer .ar-infra-item h4 { font-family: 'IBM Plex Mono', monospace; font-size: 11px; font-weight: 600; color: var(--ar-text); margin-bottom: 6px; }
  .arch-ref-outer .ar-infra-item p { font-family: 'IBM Plex Mono', monospace; font-size: 9.5px; color: var(--ar-text3); line-height: 1.6; margin: 0; }
  .arch-ref-outer .ar-step-list { list-style: none; counter-reset: step; padding: 0; margin: 0; }
  .arch-ref-outer .ar-step-list li { display: flex; align-items: flex-start; gap: 10px; padding: 5px 0; font-size: 11px; color: var(--ar-text2); font-family: 'IBM Plex Mono', monospace; line-height: 1.5; }
  .arch-ref-outer .ar-step-list li::before { counter-increment: step; content: counter(step); flex-shrink: 0; width: 20px; height: 20px; display: flex; align-items: center; justify-content: center; font-size: 10px; font-weight: 700; color: var(--ar-text2); border: 1px solid var(--ar-border); border-radius: 50%; background: var(--ar-bg); }
  .arch-ref-outer .ar-proto-bar { margin-top: 10px; padding: 8px 12px; background: rgba(110,122,144,0.04); border: 1px dashed var(--ar-infra-border); border-radius: 5px; font-family: 'IBM Plex Mono', monospace; font-size: 10px; color: var(--ar-text3); }
  @media (max-width: 1000px) {
    .arch-ref-outer .ar-grid { grid-template-columns: 1fr; grid-template-rows: auto; }
    .arch-ref-outer .ar-cli-row, .arch-ref-outer .ar-agents-col, .arch-ref-outer .ar-models-col,
    .arch-ref-outer .ar-resources-col, .arch-ref-outer .ar-hierarchy-col, .arch-ref-outer .ar-flow-col, .arch-ref-outer .ar-infra-col { grid-column: 1; grid-row: auto; }
  }
</style>

<div class="arch-ref-outer">
  <div class="arch-ref-hdr">
    <h3>NeMo Gym <span>NVIDIA Microservice-based RLVR Framework</span></h3>
    <div class="ar-legend">
      <div class="ar-legend-item"><div class="ar-legend-dot" style="background:var(--ar-agent)"></div> Agent Servers</div>
      <div class="ar-legend-item"><div class="ar-legend-dot" style="background:var(--ar-model)"></div> Model Servers</div>
      <div class="ar-legend-item"><div class="ar-legend-dot" style="background:var(--ar-res)"></div> Resources Servers</div>
      <div class="ar-legend-item"><div class="ar-legend-dot" style="background:var(--ar-infra)"></div> Infrastructure</div>
      <div class="ar-legend-item"><div class="ar-legend-dot" style="background:var(--ar-head)"></div> Head / Config</div>
      <div class="ar-legend-item"><div class="ar-legend-line"></div> session / cookie</div>
    </div>
  </div>
  <div class="ar-grid">

    <!-- CLI + CONFIG ROW -->
    <div class="ar-card ar-head-c ar-cli-row">
      <div class="ar-card-header">
        <h3><span class="ar-dot" style="background:var(--ar-head)"></span> CLI + Config Layer</h3>
        <span style="font-family:'IBM Plex Mono',monospace;font-size:10px;color:var(--ar-text3)">Hydra + OmegaConf &middot; YAML + CLI + env.yaml merge</span>
      </div>
      <div class="ar-card-body" style="display:flex;align-items:center;justify-content:space-between;flex-wrap:wrap;gap:8px">
        <div class="ar-tag-list">
          <span class="ar-tag">ng_run</span><span class="ar-tag">ng_collect_rollouts</span><span class="ar-tag">ng_test</span><span class="ar-tag">ng_reward_profile</span><span class="ar-tag">ng_status</span>
        </div>
        <div style="display:flex;align-items:center;gap:10px;padding:6px 12px;background:var(--ar-head-bg);border:1px solid var(--ar-head-border);border-radius:6px">
          <div class="ar-dot" style="width:8px;height:8px;border-radius:50%;background:var(--ar-head)"></div>
          <div>
            <div style="font-family:'IBM Plex Mono',monospace;font-size:11px;font-weight:700;color:var(--ar-head-light)">HeadServer :11000</div>
            <div style="font-family:'IBM Plex Mono',monospace;font-size:9px;color:var(--ar-text3)">Lifecycle coord &middot; Config distribution</div>
          </div>
        </div>
      </div>
    </div>

    <!-- AGENT SERVERS -->
    <div class="ar-card ar-agent ar-agents-col">
      <div class="ar-card-header">
        <h3><span class="ar-dot" style="background:var(--ar-agent)"></span> Agent Servers</h3>
        <span class="ar-count">8 types</span>
      </div>
      <div class="ar-card-body">
        <div class="ar-slabel">Endpoints</div>
        <div>
          <span class="ar-endpoint"><span class="ar-method">POST</span><span class="ar-path">/run</span></span>
          <span class="ar-endpoint"><span class="ar-method">POST</span><span class="ar-path">/v1/responses</span></span>
        </div>
        <div class="ar-slabel">Implementations</div>
        <div class="ar-tag-list">
          <span class="ar-tag">simple_agent</span><span class="ar-tag">proof_refinement_agent</span><span class="ar-tag">verifiers_agent</span><span class="ar-tag">tool_simulation_agent</span><span class="ar-tag">mini_swe_agent</span><span class="ar-tag">swe_agents</span><span class="ar-tag">aviary_agent</span>
        </div>
        <div class="ar-slabel">Agent Loop (core cycle)</div>
        <div class="ar-agent-loop">
          <svg viewBox="0 0 420 110" xmlns="http://www.w3.org/2000/svg">
            <defs>
              <marker id="ar-ag" markerWidth="7" markerHeight="5" refX="6" refY="2.5" orient="auto"><polygon points="0 0,7 2.5,0 5" fill="#4C93E0" opacity="0.8"/></marker>
              <marker id="ar-mg" markerWidth="7" markerHeight="5" refX="6" refY="2.5" orient="auto"><polygon points="0 0,7 2.5,0 5" fill="#3CC07E" opacity="0.8"/></marker>
              <marker id="ar-rg" markerWidth="7" markerHeight="5" refX="6" refY="2.5" orient="auto"><polygon points="0 0,7 2.5,0 5" fill="#E89538" opacity="0.8"/></marker>
            </defs>
            <rect x="2" y="22" width="58" height="30" rx="5" fill="rgba(255,255,255,0.03)" stroke="#1F2A3A"/>
            <text x="31" y="41" fill="#8A96AD" font-family="IBM Plex Mono" font-size="10" text-anchor="middle" font-weight="600">Input</text>
            <line x1="62" y1="37" x2="94" y2="37" stroke="#4C93E0" stroke-width="1.3" marker-end="url(#ar-ag)"/>
            <rect x="96" y="22" width="80" height="30" rx="5" fill="rgba(60,192,126,0.07)" stroke="rgba(60,192,126,0.22)"/>
            <text x="136" y="41" fill="#56D898" font-family="IBM Plex Mono" font-size="10" text-anchor="middle" font-weight="600">Model</text>
            <text x="136" y="18" fill="#54617A" font-family="IBM Plex Mono" font-size="8" text-anchor="middle">/v1/responses</text>
            <line x1="178" y1="37" x2="210" y2="37" stroke="#3CC07E" stroke-width="1.3" marker-end="url(#ar-mg)"/>
            <text x="194" y="30" fill="#54617A" font-family="IBM Plex Mono" font-size="7.5" text-anchor="middle">fn_calls</text>
            <rect x="212" y="22" width="80" height="30" rx="5" fill="rgba(232,149,56,0.07)" stroke="rgba(232,149,56,0.22)"/>
            <text x="252" y="41" fill="#F5AA55" font-family="IBM Plex Mono" font-size="10" text-anchor="middle" font-weight="600">Tools</text>
            <text x="252" y="18" fill="#54617A" font-family="IBM Plex Mono" font-size="8" text-anchor="middle">/&lt;tool_name&gt;</text>
            <path d="M252,54 C252,82 136,82 136,54" fill="none" stroke="#4C93E0" stroke-width="1.3" stroke-dasharray="4,3" marker-end="url(#ar-ag)"/>
            <text x="194" y="80" fill="#4C93E0" font-family="IBM Plex Mono" font-size="8" text-anchor="middle" opacity="0.7">loop until no fn_calls (or max_steps)</text>
            <line x1="294" y1="37" x2="326" y2="37" stroke="#E89538" stroke-width="1.3" marker-end="url(#ar-rg)"/>
            <text x="310" y="30" fill="#54617A" font-family="IBM Plex Mono" font-size="7.5" text-anchor="middle">final</text>
            <rect x="328" y="22" width="80" height="30" rx="5" fill="rgba(232,149,56,0.07)" stroke="rgba(232,149,56,0.22)"/>
            <text x="368" y="41" fill="#F5AA55" font-family="IBM Plex Mono" font-size="10" text-anchor="middle" font-weight="600">Verify</text>
            <text x="368" y="18" fill="#54617A" font-family="IBM Plex Mono" font-size="8" text-anchor="middle">/verify</text>
            <text x="368" y="65" fill="#8A96AD" font-family="IBM Plex Mono" font-size="8.5" text-anchor="middle">&#8594; reward: float</text>
            <circle r="3" fill="#4C93E0" opacity="0.8">
              <animateMotion dur="4s" repeatCount="indefinite" path="M31,37 L136,37 L252,37 L252,54 C252,82 136,82 136,54 L136,37 L252,37 L368,37"/>
            </circle>
          </svg>
        </div>
      </div>
    </div>

    <!-- MODEL SERVERS -->
    <div class="ar-card ar-model-c ar-models-col">
      <div class="ar-card-header">
        <h3><span class="ar-dot" style="background:var(--ar-model)"></span> Model Servers</h3>
        <span class="ar-count" style="background:var(--ar-model-bg);color:var(--ar-model);border:1px solid var(--ar-model-border)">5 types</span>
      </div>
      <div class="ar-card-body">
        <div class="ar-slabel">Endpoints</div>
        <div>
          <span class="ar-endpoint"><span class="ar-method">POST</span><span class="ar-path">/v1/chat/completions</span></span>
          <span class="ar-endpoint"><span class="ar-method">POST</span><span class="ar-path">/v1/responses</span></span>
        </div>
        <div class="ar-slabel">Types</div>
        <div class="ar-tag-list">
          <span class="ar-tag">OpenAI</span><span class="ar-tag">Azure OpenAI</span><span class="ar-tag">vLLM</span><span class="ar-tag">Local vLLM</span><span class="ar-tag">GenRM</span>
        </div>
        <div class="ar-schema">
          <span class="ar-k">NeMoGymResponse</span> {<br>
          &nbsp;&nbsp;<span class="ar-t">output_messages</span><br>
          &nbsp;&nbsp;<span class="ar-t">function_calls</span><br>
          &nbsp;&nbsp;<span class="ar-t">token_usage</span><br>
          }
        </div>
      </div>
    </div>

    <!-- DATA FLOW -->
    <div class="ar-card ar-flow-c ar-flow-col">
      <div class="ar-card-header"><h3>Data Flow</h3></div>
      <div class="ar-card-body">
        <ol class="ar-step-list">
          <li>Input JSONL &#8594; <strong style="color:var(--ar-res-light)">PreprocessRows</strong> assigns task_index, rollout_index</li>
          <li><strong style="color:var(--ar-agent-light)">RolloutCollectionHelper</strong> dispatches to Agent Server <code style="color:var(--ar-agent)">/run</code></li>
          <li>Agent calls Model Server <code style="color:var(--ar-model)">/v1/responses</code> &#8594; output + fn_calls</li>
          <li>Agent calls Resources Server <code style="color:var(--ar-res)">/&lt;tool_name&gt;</code> per fn_call</li>
          <li>Agent loops 3–4 until no fn_calls (or max_steps)</li>
          <li>Agent calls <code style="color:var(--ar-res)">/verify</code> &#8594; reward</li>
          <li>Results written to output JSONL</li>
          <li><strong style="color:var(--ar-res-light)">RewardProfiler</strong> computes pass@k, mean/max/min/std</li>
        </ol>
      </div>
    </div>

    <!-- CLASS HIERARCHY -->
    <div class="ar-card ar-hierarchy-col" style="border-color:var(--ar-border)">
      <div class="ar-card-header" style="background:rgba(255,255,255,0.015)">
        <h3 style="color:var(--ar-text)">Class Hierarchy</h3>
      </div>
      <div class="ar-card-body">
        <div class="ar-tree-node"><span style="color:var(--ar-text);font-weight:600;font-family:'IBM Plex Mono',monospace;font-size:11px">BaseServer</span></div>
        <div class="ar-tree-indent">
          <div class="ar-tree-node"><span style="color:var(--ar-text);font-weight:600;font-family:'IBM Plex Mono',monospace;font-size:11px">SimpleServer</span></div>
          <div class="ar-tree-indent">
            <div class="ar-tree-node"><span style="color:var(--ar-res-light);font-weight:600;font-family:'IBM Plex Mono',monospace;font-size:11px">SimpleResourcesServer</span><span class="ar-tree-badge" style="background:var(--ar-res-bg);color:var(--ar-res);border:1px solid var(--ar-res-border)">35+</span></div>
          </div>
          <div class="ar-tree-indent">
            <div class="ar-tree-node"><span style="color:var(--ar-model-light);font-weight:600;font-family:'IBM Plex Mono',monospace;font-size:11px">SimpleResponsesAPIModel</span><span class="ar-tree-badge" style="background:var(--ar-model-bg);color:var(--ar-model);border:1px solid var(--ar-model-border)">5</span></div>
          </div>
          <div class="ar-tree-indent">
            <div class="ar-tree-node"><span style="color:var(--ar-agent-light);font-weight:600;font-family:'IBM Plex Mono',monospace;font-size:11px">SimpleResponsesAPIAgent</span><span class="ar-tree-badge" style="background:var(--ar-agent-bg);color:var(--ar-agent);border:1px solid var(--ar-agent-border)">8</span></div>
          </div>
        </div>
        <div style="margin-top:14px;padding-top:14px;border-top:1px dashed var(--ar-head-border)">
          <div class="ar-tree-node"><span style="color:var(--ar-head-light);font-weight:600;font-family:'IBM Plex Mono',monospace;font-size:11px">HeadServer</span><span style="color:var(--ar-text3);font-family:'IBM Plex Mono',monospace;font-size:9px">(separate, coordinates all)</span></div>
        </div>
        <div style="margin-top:18px;padding:10px 12px;background:rgba(76,147,224,0.03);border:1px solid var(--ar-border);border-radius:6px">
          <div style="font-family:'IBM Plex Mono',monospace;font-size:9px;color:var(--ar-text3);margin-bottom:4px;font-weight:600;text-transform:uppercase;letter-spacing:0.5px">Communication</div>
          <div style="font-family:'IBM Plex Mono',monospace;font-size:10px;color:var(--ar-text2);line-height:1.6">3 composable FastAPI server types<br>All via async HTTP (aiohttp)<br>JSON payloads + cookie sessions</div>
        </div>
        <div style="margin-top:12px;padding:10px 12px;background:rgba(184,124,237,0.03);border:1px dashed var(--ar-head-border);border-radius:6px">
          <div style="font-family:'IBM Plex Mono',monospace;font-size:9px;color:var(--ar-text3);margin-bottom:4px;font-weight:600;text-transform:uppercase;letter-spacing:0.5px">Session Flow</div>
          <div style="font-family:'IBM Plex Mono',monospace;font-size:10px;color:var(--ar-text2);line-height:1.6">Cookie-based state (UUID/session)<br><code style="color:var(--ar-res-light)">/seed_session</code> initializes state<br>Cookies propagated across servers</div>
        </div>
      </div>
    </div>

    <!-- RESOURCES SERVERS -->
    <div class="ar-card ar-res-c ar-resources-col">
      <div class="ar-card-header">
        <h3><span class="ar-dot" style="background:var(--ar-res)"></span> Resources Servers</h3>
        <span class="ar-count" style="background:var(--ar-res-bg);color:var(--ar-res);border:1px solid var(--ar-res-border)">35+ impls</span>
      </div>
      <div class="ar-card-body">
        <div style="display:flex;gap:20px;flex-wrap:wrap">
          <div style="flex:1;min-width:260px">
            <div class="ar-slabel">Endpoints</div>
            <div>
              <span class="ar-endpoint"><span class="ar-method">POST</span><span class="ar-path">/verify</span></span>
              <span class="ar-endpoint"><span class="ar-method">POST</span><span class="ar-path">/seed_session</span></span>
              <span class="ar-endpoint"><span class="ar-method">POST</span><span class="ar-path">/&lt;tool_name&gt;</span></span>
            </div>
            <div class="ar-slabel">Implementations</div>
            <div class="ar-tag-list">
              <span class="ar-tag">code_gen</span><span class="ar-tag">math_with_code</span><span class="ar-tag">instruction_following</span><span class="ar-tag">jailbreak_detection</span><span class="ar-tag">genrm_compare</span><span class="ar-tag">arc_agi</span><span class="ar-tag">aviary</span><span class="ar-tag">mini_swe_agent</span><span class="ar-tag">text_to_sql</span><span class="ar-tag">structured_outputs</span><span class="ar-tag">tool_calling</span><span class="ar-tag">data_extraction</span><span class="ar-tag" style="opacity:0.5">... and 23+ more</span>
            </div>
          </div>
          <div style="min-width:200px">
            <div class="ar-slabel">Returns</div>
            <div class="ar-schema">
              <span class="ar-k">BaseVerifyResponse</span> {<br>
              &nbsp;&nbsp;<span class="ar-t">reward: float</span><br>
              &nbsp;&nbsp;<span class="ar-t">info: dict</span><br>
              }
            </div>
          </div>
        </div>
      </div>
    </div>

    <!-- INFRASTRUCTURE -->
    <div class="ar-card ar-infra-c ar-infra-col">
      <div class="ar-card-header">
        <h3><span class="ar-dot" style="background:var(--ar-infra)"></span> Infrastructure Layer</h3>
      </div>
      <div class="ar-card-body">
        <div class="ar-infra-grid">
          <div class="ar-infra-item"><h4>ServerClient</h4><p>aiohttp wrapper<br>Retry: 3x exponential backoff<br>Pool: 100k total, 1k/host<br>Cookie propagation</p></div>
          <div class="ar-infra-item"><h4>Session Middleware</h4><p>Cookie-based session state<br>UUID per session<br>State propagation across<br>all server types</p></div>
          <div class="ar-infra-item"><h4>Concurrency</h4><p>asyncio.Semaphore<br>Parallel rollouts<br>async HTTP (aiohttp)<br>Non-blocking I/O</p></div>
          <div class="ar-infra-item"><h4>Ray Cluster</h4><p>Distributed job scheduling<br>Multi-node scale-out<br>Resource management</p></div>
          <div class="ar-infra-item"><h4>Config System</h4><p>Hydra + OmegaConf<br>YAML + CLI + env.yaml merge<br>Per-server configuration</p></div>
        </div>
        <div class="ar-proto-bar">All inter-server communication: async HTTP (aiohttp) &middot; FastAPI endpoints &middot; JSON payloads &middot; Cookie-based session propagation &middot; 3 composable server types on configurable ports</div>
      </div>
    </div>

  </div>
</div>

</article>

<div class="post-end">
  <p class="post-author">Lorenzo Xiao &middot; Language Technologies Institute &middot; Carnegie Mellon University</p>
  <a href="/blog/">&larr; Back to blog</a>
</div>]]></content><author><name>Lorenzo Xiao</name></author><category term="blog" /><category term="academic" /><category term="AI" /><category term="engineering" /><summary type="html"><![CDATA[A deep dive into NeMo Gym's three-server-type design, 34 reward verifiers, and the infrastructure decisions that make RLVR pipelines composable at scale.]]></summary></entry></feed>