If you have spent any time working on reinforcement learning from verifiable rewards (RLVR), you know the pain: a model generating rollouts, tools for it to interact with, a verifier to score the output, and some orchestration glue to hold it together. Most frameworks handle this by cramming everything into a monolithic training loop that becomes impossible to extend. NeMo Gym decomposes the entire pipeline into three types of composable microservices, connected by nothing more than async HTTP calls and cookie-based sessions.
This post is a deep dive into the architecture — the design philosophy, the three server types, the data flow from input to reward profiling, and the infrastructure decisions that make it work at scale.
The complete system. HeadServer handles lifecycle; the three core servers handle data flow. HTTP throughout.
01 Agent Servers: The Orchestration Layer 7 types
Agent servers are the conductors of the system. They receive a task via POST /run, execute the full loop — call the model, extract function calls, dispatch them to the resources server, feed results back, repeat — and then call /verify to return a reward. What makes the design interesting is that "orchestration" is not one-size-fits-all. NeMo Gym ships seven distinct agents, each encoding a different loop pattern.
Seven agents, seven loop patterns. The shared base class provides infrastructure; each leaf defines behavior.
Instead of building one agent with a hundred config flags, NeMo Gym builds seven agents that each do one thing well. The tool_simulation_agent does not have a "disable tool calls" flag — it simply never makes tool calls. The proof_refinement_agent does not have a "correction mode" toggle — it always does correction. Each implementation is small, testable, and easy to reason about.
The simple_agent is the workhorse: a while loop that calls the model, extracts function calls, POSTs each one to the resources server's /<tool_name> endpoint, accumulates outputs, feeds them back, and repeats until there are no more function calls or max_steps is reached. The verifiers_agent is the most RL-aware: it preserves prompt_token_ids, generation_token_ids, and logprobs in its output, making it directly compatible with policy gradient training pipelines that need these quantities.
02 Model Servers: Abstracting Inference 5 types
Model servers expose two endpoints — POST /v1/chat/completions and POST /v1/responses — and return a NeMoGymResponse containing output[] (messages and function calls) and usage (input and output token counts). The five implementations form a clean capability inheritance chain.
| Server | Extends | Key Capability Added |
|---|---|---|
| openai_model | SimpleModelServer |
Generic OpenAI-compatible client. Works with any endpoint that speaks the OpenAI API. |
| azure_openai_model | openai_model |
Azure deployments with api-version handling, VLLMConverter, semaphore concurrency control. |
| vllm_model | azure_openai_model |
Multi-endpoint round-robin, <think> tag reasoning parser, token ID + logprob tracking, graceful context-length truncation. |
| local_vllm_model | vllm_model |
Launches vLLM as a Ray actor. Tensor, pipeline, and data parallelism via Ray placement groups. Multi-node GPU support. |
| genrm_model | local_vllm_model |
Custom conversation roles (response_1, response_2, principle) for pairwise reward modeling comparisons. |
The inheritance chain is purposeful: each layer adds exactly one conceptual capability. The local_vllm_model is where things get operationally interesting — it handles HuggingFace token and cache management, polls for server health before accepting requests, and applies internal vLLM patches for compatibility. The genrm_model at the tip of the chain is the model server you use when your reward signal comes from comparing two outputs side by side.
03 Resources Servers: 34 Verifiers, One Interface
All 34 resources servers expose the same three endpoints: POST /verify, POST /seed_session, and POST /<tool_name>. They all return BaseVerifyResponse { reward: float, info: dict }. The reward types break into three families.
34 verifiers, three reward families. 13 use LLM judges, 5 use sandboxed Ray execution, 4 use session state.
With 34 verifiers spanning binary, continuous, and compound reward types, NeMo Gym treats reward computation as a rich design space rather than an afterthought. The fact that 13 of 34 verifiers use LLM judges reflects the reality that ground-truth verification is expensive or impossible for many tasks — and the system accommodates that complexity rather than pretending it doesn't exist.
Across all 34 servers, patterns repeat: session state flows through cookies (4 servers require seed_session), Ray handles sandboxed code execution (5 servers), and LLM judges step in wherever symbolic verification fails. The math_formal_lean verifier is the most sophisticated — a hybrid that weights Lean4 formal proof compilation at 70% and symbolic equivalence at 30%, with multi-turn error feedback injected back into the model conversation.
04 The Data Flow: JSONL to Reward Profiles
The end-to-end pipeline follows eight steps, from raw task input to aggregated pass@k metrics.
Steps 04–05 form the inner loop. The outer pipeline runs once per task. RewardProfiler computes pass@k at the end.
The profiling step is worth dwelling on. RewardProfiler computes pass@k for k ∈ {1, 4, 16} along with mean, max, min, median, and standard deviation — both per-task and globally. This is not just a logging convenience; it directly measures the key quantity in RLVR: how often does the model produce a verifiably correct answer under multiple sample draws?
05 Infrastructure: The Parts That Make It Work
ServerClient
All inter-server HTTP communication goes through ServerClient, an aiohttp wrapper with retry logic (3× exponential backoff), and connection pooling set to 100,000 total connections and 1,000 per host. These numbers are aggressive but intentional: when running thousands of parallel rollouts across a Ray cluster, you need the headroom before connection exhaustion becomes the bottleneck.
Session Middleware
Cookie-based session state with UUID-per-session is an unconventional choice for a distributed system, but it elegantly solves the routing problem. When an agent calls POST /seed_session, the server creates session state and returns a cookie. All subsequent calls from that agent carry the cookie, giving the server access to the right session. No centralized session store, no distributed cache to manage, no sticky session configuration at the load balancer level.
Because session state lives in server memory (keyed by cookie UUID), it means agent requests must reach the same resources server instance across all turns of a session. This is fine in the default single-instance configuration, but warrants attention if you scale resources servers horizontally behind a load balancer. You would need sticky sessions or session state externalization for that use case.
Ray Cluster
Ray handles distributed job scheduling, multi-node GPU management (tensor, pipeline, and data parallelism), placement groups for co-locating related processes, and actor-based vLLM server management. The local_vllm_model launches its vLLM instance as a Ray actor, which means Ray handles placement, fault tolerance, and resource allocation. TP × PP × DP configurations are set in Hydra config and passed through to vLLM's CLI flags automatically.
Configuration
Hydra + OmegaConf provides the config system, merging YAML files, CLI overrides, and env.yaml environment-specific settings. Five CLI entry points — ng_run, ng_collect_rollouts, ng_test, ng_reward_profile, ng_status — each compose different Hydra config groups to set up the appropriate server topology. This means switching from a local single-node run to a multi-node Ray cluster is a config change, not a code change.
06 Design Lessons
A few things stand out about NeMo Gym's architecture that generalize beyond this specific system.
HTTP as the Universal Connector
The decision to use plain HTTP everywhere could be seen as a performance compromise, but it buys enormous flexibility. You can test a resources server with curl. You can run the model server on a different machine, in a different cloud, or behind a load balancer. You can replace any component with a mock. The 100k connection pool and async I/O ensure that HTTP overhead is not the bottleneck — the model inference is.
Inheritance for Structure, Not Behavior
The class hierarchy (BaseServer → SimpleServer → SimpleResourcesServer / SimpleResponsesAPIModel / SimpleResponsesAPIAgent) provides shared infrastructure — endpoint registration, health checks, session middleware, config loading — but each leaf implementation defines its own behavior. This is a principled use of inheritance that avoids the deep hierarchy trap: base classes provide capabilities, not defaults that get overridden.
Verification Diversity Is a Feature
Having 34 different verifiers is not bloat — it reflects the genuine diversity of what "correctness" means across tasks. Grid comparison, SQL equivalence, Lean4 compilation, safety × quality scoring, and pairwise LLM preference are fundamentally different notions of reward. Collapsing them all into one verifier interface (BaseVerifyResponse) while keeping their implementations separate is the right abstraction boundary.
NeMo Gym is a case study in how microservice decomposition, when done with discipline, can turn a complex ML systems problem into a collection of simple, composable pieces. The three-server-type design is easy to explain, easy to extend, and — critically for a research framework — easy to debug. Whether you are building your own RLVR pipeline or looking for architectural patterns that scale, there is a lot to learn from how this system was put together.
07 Architecture Reference Card
A condensed reference view of the complete system — all server types, class hierarchy, data flow, and infrastructure at a glance.
NeMo Gym NVIDIA Microservice-based RLVR Framework
CLI + Config Layer
Hydra + OmegaConf · YAML + CLI + env.yaml mergeAgent Servers
8 typesModel Servers
5 typesoutput_messages
function_calls
token_usage
}
Data Flow
- Input JSONL → PreprocessRows assigns task_index, rollout_index
-
RolloutCollectionHelper dispatches to Agent Server
/run - Agent calls Model Server
/v1/responses→ output + fn_calls - Agent calls Resources Server
/<tool_name>per fn_call - Agent loops 3–4 until no fn_calls (or max_steps)
- Agent calls
/verify→ reward - Results written to output JSONL
- RewardProfiler computes pass@k, mean/max/min/std
Class Hierarchy
All via async HTTP (aiohttp)
JSON payloads + cookie sessions
/seed_session initializes stateCookies propagated across servers
Resources Servers
35+ implsreward: float
info: dict
}
Infrastructure Layer
ServerClient
aiohttp wrapper
Retry: 3x exponential backoff
Pool: 100k total, 1k/host
Cookie propagation
Session Middleware
Cookie-based session state
UUID per session
State propagation across
all server types
Concurrency
asyncio.Semaphore
Parallel rollouts
async HTTP (aiohttp)
Non-blocking I/O
Ray Cluster
Distributed job scheduling
Multi-node scale-out
Resource management
Config System
Hydra + OmegaConf
YAML + CLI + env.yaml merge
Per-server configuration