ML Infrastructure

Dissecting NeMo Gym: How NVIDIA Built a Modular Microservice Architecture for RL Verification at Scale

Three server types. Thirty-four verifiers. One HTTP protocol. A close read of how composability beats configuration in RLVR pipelines.

Lorenzo Xiao · Language Technologies Institute, CMU · March 2026

If you have spent any time working on reinforcement learning from verifiable rewards (RLVR), you know the pain: a model generating rollouts, tools for it to interact with, a verifier to score the output, and some orchestration glue to hold it together. Most frameworks handle this by cramming everything into a monolithic training loop that becomes impossible to extend. NeMo Gym decomposes the entire pipeline into three types of composable microservices, connected by nothing more than async HTTP calls and cookie-based sessions.

This post is a deep dive into the architecture — the design philosophy, the three server types, the data flow from input to reward profiling, and the infrastructure decisions that make it work at scale.

System Architecture HeadServer port 11000 · lifecycle & config Agent Server Orchestration · 7 types POST /run · POST /v1/responses Model Server LLM Inference · 5 types POST /v1/chat/completions Resources Server Tools & Verifiers · 34 types POST /verify · POST /<tool> call resp tool calls & /verify All comms: async HTTP (aiohttp)

The complete system. HeadServer handles lifecycle; the three core servers handle data flow. HTTP throughout.

Agent Servers: The Orchestration Layer 7 types

Agent servers are the conductors of the system. They receive a task via POST /run, execute the full loop — call the model, extract function calls, dispatch them to the resources server, feed results back, repeat — and then call /verify to return a reward. What makes the design interesting is that "orchestration" is not one-size-fits-all. NeMo Gym ships seven distinct agents, each encoding a different loop pattern.

Seven Agent Implementations simple_agent Tool-augmented while loop. Call → extract → dispatch → repeat. aviary_agent RL gym environments. seed_session → step → verify → close. proof_refinement_agent Error-feedback loop for formal proofs. No tool calls — model ↔ verifier only. tool_simulation_agent Single call + single verify. Tool use simulated inside verifier. verifiers_agent External verifiers library. Preserves token IDs + logprobs for RL. mini_swe_agent SWE-gym via Ray. Docker + Singularity. swe_agents Full SWE-bench. 100 turns, 4 frameworks. All inherit from SimpleResponsesAPIAgent → SimpleServer → BaseServer for shared endpoint registration, health checks & session middleware. ← workhorse

Seven agents, seven loop patterns. The shared base class provides infrastructure; each leaf defines behavior.

The Composability Principle

Instead of building one agent with a hundred config flags, NeMo Gym builds seven agents that each do one thing well. The tool_simulation_agent does not have a "disable tool calls" flag — it simply never makes tool calls. The proof_refinement_agent does not have a "correction mode" toggle — it always does correction. Each implementation is small, testable, and easy to reason about.

The simple_agent is the workhorse: a while loop that calls the model, extracts function calls, POSTs each one to the resources server's /<tool_name> endpoint, accumulates outputs, feeds them back, and repeats until there are no more function calls or max_steps is reached. The verifiers_agent is the most RL-aware: it preserves prompt_token_ids, generation_token_ids, and logprobs in its output, making it directly compatible with policy gradient training pipelines that need these quantities.

Model Servers: Abstracting Inference 5 types

Model servers expose two endpoints — POST /v1/chat/completions and POST /v1/responses — and return a NeMoGymResponse containing output[] (messages and function calls) and usage (input and output token counts). The five implementations form a clean capability inheritance chain.

Server Extends Key Capability Added
openai_model SimpleModelServer Generic OpenAI-compatible client. Works with any endpoint that speaks the OpenAI API.
azure_openai_model openai_model Azure deployments with api-version handling, VLLMConverter, semaphore concurrency control.
vllm_model azure_openai_model Multi-endpoint round-robin, <think> tag reasoning parser, token ID + logprob tracking, graceful context-length truncation.
local_vllm_model vllm_model Launches vLLM as a Ray actor. Tensor, pipeline, and data parallelism via Ray placement groups. Multi-node GPU support.
genrm_model local_vllm_model Custom conversation roles (response_1, response_2, principle) for pairwise reward modeling comparisons.

The inheritance chain is purposeful: each layer adds exactly one conceptual capability. The local_vllm_model is where things get operationally interesting — it handles HuggingFace token and cache management, polls for server health before accepting requests, and applies internal vLLM patches for compatibility. The genrm_model at the tip of the chain is the model server you use when your reward signal comes from comparing two outputs side by side.

Resources Servers: 34 Verifiers, One Interface

All 34 resources servers expose the same three endpoints: POST /verify, POST /seed_session, and POST /<tool_name>. They all return BaseVerifyResponse { reward: float, info: dict }. The reward types break into three families.

34 Resources Servers by Reward Type Binary Reward returns 0 or 1 · 18 servers arc_agi 2D grid parse from \boxed{} code_gen Ray execution + unit tests structured_outputs JSON → OpenAPI schema validation text_to_sql MySQL, PostgreSQL, SQLite dialects xlam_fc Function call greedy match + 13 more… Continuous Reward returns [0, 1] · 8 servers genrm_compare Pairwise GenRM + length bonus math_formal_lean 0.3× symbolic + 0.7× Lean4 RMSLE math_with_judge math_verify → LLM judge fallback multichallenge Multi-rubric: mean/min/max/all/any aviary Cumulative env step rewards + 3 more… Compound / Varied mixed types · 8 servers jailbreak_detection Safety × quality compound score instruction_following Binary strict or fraction [0,1] equivalence_llm_judge 0 / 0.5 / 1 with swap-check over_refusal_detection Safety + helpfulness balance mcqa 3-mode letter extraction + 3 more…

34 verifiers, three reward families. 13 use LLM judges, 5 use sandboxed Ray execution, 4 use session state.

Rewards as First-Class Design

With 34 verifiers spanning binary, continuous, and compound reward types, NeMo Gym treats reward computation as a rich design space rather than an afterthought. The fact that 13 of 34 verifiers use LLM judges reflects the reality that ground-truth verification is expensive or impossible for many tasks — and the system accommodates that complexity rather than pretending it doesn't exist.

Across all 34 servers, patterns repeat: session state flows through cookies (4 servers require seed_session), Ray handles sandboxed code execution (5 servers), and LLM judges step in wherever symbolic verification fails. The math_formal_lean verifier is the most sophisticated — a hybrid that weights Lean4 formal proof compilation at 70% and symbolic equivalence at 30%, with multi-turn error feedback injected back into the model conversation.

The Data Flow: JSONL to Reward Profiles

The end-to-end pipeline follows eight steps, from raw task input to aggregated pass@k metrics.

8-Step Pipeline: JSONL → Reward Profile 01 Input JSONL tasks 02 Preprocess task/rollout idx 03 Dispatch POST /run 04 Model Call /v1/responses 05 Tool Exec POST /<tool> repeat until no more calls or max_steps 06 Verify POST /verify 07 Write output JSONL 08 Reward Profile pass@1/4/16 + stats

Steps 04–05 form the inner loop. The outer pipeline runs once per task. RewardProfiler computes pass@k at the end.

The profiling step is worth dwelling on. RewardProfiler computes pass@k for k ∈ {1, 4, 16} along with mean, max, min, median, and standard deviation — both per-task and globally. This is not just a logging convenience; it directly measures the key quantity in RLVR: how often does the model produce a verifiably correct answer under multiple sample draws?

Infrastructure: The Parts That Make It Work

ServerClient

All inter-server HTTP communication goes through ServerClient, an aiohttp wrapper with retry logic (3× exponential backoff), and connection pooling set to 100,000 total connections and 1,000 per host. These numbers are aggressive but intentional: when running thousands of parallel rollouts across a Ray cluster, you need the headroom before connection exhaustion becomes the bottleneck.

Session Middleware

Cookie-based session state with UUID-per-session is an unconventional choice for a distributed system, but it elegantly solves the routing problem. When an agent calls POST /seed_session, the server creates session state and returns a cookie. All subsequent calls from that agent carry the cookie, giving the server access to the right session. No centralized session store, no distributed cache to manage, no sticky session configuration at the load balancer level.

Session State Is Local to Each Server

Because session state lives in server memory (keyed by cookie UUID), it means agent requests must reach the same resources server instance across all turns of a session. This is fine in the default single-instance configuration, but warrants attention if you scale resources servers horizontally behind a load balancer. You would need sticky sessions or session state externalization for that use case.

Ray Cluster

Ray handles distributed job scheduling, multi-node GPU management (tensor, pipeline, and data parallelism), placement groups for co-locating related processes, and actor-based vLLM server management. The local_vllm_model launches its vLLM instance as a Ray actor, which means Ray handles placement, fault tolerance, and resource allocation. TP × PP × DP configurations are set in Hydra config and passed through to vLLM's CLI flags automatically.

Configuration

Hydra + OmegaConf provides the config system, merging YAML files, CLI overrides, and env.yaml environment-specific settings. Five CLI entry points — ng_run, ng_collect_rollouts, ng_test, ng_reward_profile, ng_status — each compose different Hydra config groups to set up the appropriate server topology. This means switching from a local single-node run to a multi-node Ray cluster is a config change, not a code change.

Design Lessons

A few things stand out about NeMo Gym's architecture that generalize beyond this specific system.

HTTP as the Universal Connector

The decision to use plain HTTP everywhere could be seen as a performance compromise, but it buys enormous flexibility. You can test a resources server with curl. You can run the model server on a different machine, in a different cloud, or behind a load balancer. You can replace any component with a mock. The 100k connection pool and async I/O ensure that HTTP overhead is not the bottleneck — the model inference is.

Inheritance for Structure, Not Behavior

The class hierarchy (BaseServer → SimpleServer → SimpleResourcesServer / SimpleResponsesAPIModel / SimpleResponsesAPIAgent) provides shared infrastructure — endpoint registration, health checks, session middleware, config loading — but each leaf implementation defines its own behavior. This is a principled use of inheritance that avoids the deep hierarchy trap: base classes provide capabilities, not defaults that get overridden.

Verification Diversity Is a Feature

Having 34 different verifiers is not bloat — it reflects the genuine diversity of what "correctness" means across tasks. Grid comparison, SQL equivalence, Lean4 compilation, safety × quality scoring, and pairwise LLM preference are fundamentally different notions of reward. Collapsing them all into one verifier interface (BaseVerifyResponse) while keeping their implementations separate is the right abstraction boundary.

The Deeper Pattern

NeMo Gym is a case study in how microservice decomposition, when done with discipline, can turn a complex ML systems problem into a collection of simple, composable pieces. The three-server-type design is easy to explain, easy to extend, and — critically for a research framework — easy to debug. Whether you are building your own RLVR pipeline or looking for architectural patterns that scale, there is a lot to learn from how this system was put together.

Architecture Reference Card

A condensed reference view of the complete system — all server types, class hierarchy, data flow, and infrastructure at a glance.

NeMo Gym NVIDIA Microservice-based RLVR Framework

Agent Servers
Model Servers
Resources Servers
Infrastructure
Head / Config
session / cookie

CLI + Config Layer

Hydra + OmegaConf · YAML + CLI + env.yaml merge
ng_runng_collect_rolloutsng_testng_reward_profileng_status
HeadServer :11000
Lifecycle coord · Config distribution

Agent Servers

8 types
Endpoints
POST/run POST/v1/responses
Implementations
simple_agentproof_refinement_agentverifiers_agenttool_simulation_agentmini_swe_agentswe_agentsaviary_agent
Agent Loop (core cycle)
Input Model /v1/responses fn_calls Tools /<tool_name> loop until no fn_calls (or max_steps) final Verify /verify → reward: float

Model Servers

5 types
Endpoints
POST/v1/chat/completions POST/v1/responses
Types
OpenAIAzure OpenAIvLLMLocal vLLMGenRM
NeMoGymResponse {
  output_messages
  function_calls
  token_usage
}

Data Flow

  1. Input JSONL → PreprocessRows assigns task_index, rollout_index
  2. RolloutCollectionHelper dispatches to Agent Server /run
  3. Agent calls Model Server /v1/responses → output + fn_calls
  4. Agent calls Resources Server /<tool_name> per fn_call
  5. Agent loops 3–4 until no fn_calls (or max_steps)
  6. Agent calls /verify → reward
  7. Results written to output JSONL
  8. RewardProfiler computes pass@k, mean/max/min/std

Class Hierarchy

BaseServer
SimpleServer
SimpleResourcesServer35+
SimpleResponsesAPIModel5
SimpleResponsesAPIAgent8
HeadServer(separate, coordinates all)
Communication
3 composable FastAPI server types
All via async HTTP (aiohttp)
JSON payloads + cookie sessions
Session Flow
Cookie-based state (UUID/session)
/seed_session initializes state
Cookies propagated across servers

Resources Servers

35+ impls
Endpoints
POST/verify POST/seed_session POST/<tool_name>
Implementations
code_genmath_with_codeinstruction_followingjailbreak_detectiongenrm_comparearc_agiaviarymini_swe_agenttext_to_sqlstructured_outputstool_callingdata_extraction... and 23+ more
Returns
BaseVerifyResponse {
  reward: float
  info: dict
}

Infrastructure Layer

ServerClient

aiohttp wrapper
Retry: 3x exponential backoff
Pool: 100k total, 1k/host
Cookie propagation

Session Middleware

Cookie-based session state
UUID per session
State propagation across
all server types

Concurrency

asyncio.Semaphore
Parallel rollouts
async HTTP (aiohttp)
Non-blocking I/O

Ray Cluster

Distributed job scheduling
Multi-node scale-out
Resource management

Config System

Hydra + OmegaConf
YAML + CLI + env.yaml merge
Per-server configuration

All inter-server communication: async HTTP (aiohttp) · FastAPI endpoints · JSON payloads · Cookie-based session propagation · 3 composable server types on configurable ports
← Back to blog