How I Learned RL for LLMs: A Researcher's Detour in Five Parts

I will be honest about how this series started, because the origin story matters for understanding what it is and what it is not.

How This Started

I am an NLP researcher. My work has been on evaluation, specifically on non-verifiable tasks: persona consistency, cultural appropriateness, anthropomorphism—the kinds of things where "correctness" is not a number you can look up in the back of a textbook. I have spent the past few years building benchmarks, designing evaluation protocols, and thinking carefully about what it means to measure something that is inherently subjective. I thought this was my path. I intended to keep going.

Then I did not get into any of the PhD programs I applied to.

That was the moment I had to sit with an uncomfortable realization: most of the skills I had acquired in creating benchmarks do not transfer cleanly to industry. I knew how to design evaluation frameworks. I could tell you whether a benchmark was measuring what it claimed to measure. But I could not tell you how a post-training pipeline actually works end to end, how reward signals flow into policy updates, or why a training run that looks fine on paper collapses in practice. The evaluation side of the house was where I lived. The training side was a foreign country.

So I made a decision. If I was going to be competitive for post-training teams in industry, I needed to learn reinforcement learning. Not the textbook version. The version that people are actually using right now, in 2025 and 2026, to make language models reason, use tools, write code, and hold conversations.

This blog series is the documentation of that learning process.

What This Series Is

I want to be clear about two things.

First, I am writing this to make sure I can clearly explain what I have learned. The act of writing forces a kind of precision that reading alone does not. If I cannot write a coherent paragraph about why DAPO's asymmetric clipping matters, I probably do not actually understand it yet. Some of what I write will contain mistakes. I am learning in public, and I sincerely welcome corrections. If you spot an error, please let me know. You will be doing me a genuine favor.

Second, I am writing this for people from a similar background. If you are an NLP researcher who has spent more time on evaluation than on training, if you know what a good benchmark looks like but have never debugged a reward hacking failure, if you are pivoting toward post-training work and feeling the gap between what you know and what you need to know: this series is for you. I am not writing from a position of expertise. I am writing from a position of "I figured this out six months ago and here is how I organized it in my head."

This series would not have existed without the help of Hanchi Sun, who patiently walked me through the parts I could not figure out from papers alone.

The Five Parts

Each part ends with a question that the next part answers. This is deliberate. The series is meant to build, not to be a disconnected collection of literature reviews.

The Algorithm Zoo

REINFORCE, PPO, GRPO, and the lineage of papers that fix what GRPO got wrong: Dr. GRPO, DAPO, CISPO, MaxRL, DPPO. The organizing question for the entire section: how does each subsequent paper fix GRPO? GRPO removed the critic, introduced group-relative baselines, and made large-scale reasoning RL tractable. But it also introduced subtle biases—length normalization, standard deviation weighting, symmetric clipping, token-level loss aggregation—that turned out to be consequential. Every paper that came after is asking the same question: what did GRPO get wrong, and how do we fix it without breaking what it got right?

↓ "These algorithms all optimize a reward signal. But where does that signal come from?"

Coming in Part 1

The Reward Problem

The current RL-for-LLM literature is dominated by tasks where reward is cheap and unambiguous: math problems with checkable answers, code with executable test suites. The moment you step outside those domains—into summarization, creative writing, open-ended dialogue, cultural sensitivity—the reward problem becomes the hard problem. This part covers generative reward models that reason before scoring (DeepSeek-GRM, RM-R1), process reward models that evaluate intermediate steps (PAV, ThinkPRM), rubrics-as-rewards for structuring subjective preferences, and the evaluation benchmarks (RewardBench and its successors) that make RM development iterative. This is the section closest to my original research, and honestly the one I found most intellectually exciting.

↓ "Reward models approximate human judgment. But what if we could learn directly from preferences without an explicit reward model?"

Coming in Part 2

From Preferences to Alignment

DPO simplifies RLHF by removing the reward model. It works well for single-turn preference alignment, and then it starts to plateau in settings that require multi-turn coherence, long-term persona consistency, or reward signals that are noisy and delayed. What I explore here: how do we bring modern RL techniques into the tasks I actually care about? Multi-turn dialogue where the reward is not "did you get the math right" but "did you maintain character across twenty turns." Creative writing where the failure mode is not incorrectness but blandness. PPP for proactive personalized agents, HER for dual-layer role-playing thinking, OMAR for multi-role self-play—papers that point toward where the field is heading once the math-and-code gold rush settles.

↓ "We know what to optimize and how. But can any of this actually run at scale?"

Coming in Part 3

Making It Work: Systems

The section I was most tempted to skip and most glad I did not. A beautiful algorithm that requires synchronous rollout-then-update will lose to a mediocre algorithm running on async infrastructure that keeps GPUs utilized, if the wall-clock time difference is large enough. But this section is about more than async versus sync. Why does MoE routing behave differently during rollout than during training? Why can FP16 rounding in your inference kernel silently corrupt importance-sampling ratios? Why does deterministic inference across different tensor-parallel sizes require explicit engineering? These are the questions that separate "I read the GRPO paper" from "I could actually help debug a training run."

↓ "With the full stack understood, where is RL for LLMs actually heading?"

Coming in Part 4

The Agent Frontier

RL is no longer just for math and code. This part covers coding agents (Qwen3-Coder-Next, SWE-Master), deep research agents (Search-R1, Tongyi DeepResearch), computer-use agents (ComputerRL, GUI-Libra), and the emerging generalist agentic RL systems that attempt to unify all of these under one training framework (RLAnything, AGENTRL, iStar). A dedicated section covers NVIDIA's Nemotron-Tool-N1, which takes the minimalist approach of binary reward for tool-calling correctness and shows it works surprisingly well. Sometimes the answer to "how do we design reward for agentic tasks?" is "just check if the tool call was correct and let RL figure out the rest."

Coming in Part 5

How to Read This Series

Each post is self-contained. You can read Part 2 (Rewards) without reading Part 1 (Algorithms), though I cross-reference when the same concept appears in multiple places.

Within each post, every paper gets a reading-depth recommendation:

A Deep read. Explain the method, the training pipeline, the failure modes, and why it was necessary. A− Focused read. Abstract, method figure, key tables, and ablation. Know the core contribution well enough to discuss it. B Skim. Know the problem setting, the main conclusion, and how it connects to the mainline. C Awareness only. Know it exists and what niche it fills.

There are 96 papers across the five parts, 17 rated A and 39 rated A−.

If you only have three hours

Read seven papers that cover the skeleton of the entire story: DeepSeekMath (Part 1), DeepSeek-R1 (Part 1), DAPO (Part 1), DeepSeek-GRM (Part 2), OLMo 3 (Part 3), Magistral (Part 4), and Search-R1 (Part 5).

One Last Thing

I want to acknowledge something that I think more people should say out loud: getting rejected from every program you applied to is a specific kind of painful. It makes you question whether the work you have done matters, whether the skills you have built are real, whether the direction you chose was the right one.

I do not have a neat resolution to that story yet. What I have is this: the process of learning RL for LLMs, of building this map from scratch, of forcing myself to understand systems and algorithms and reward design that were outside my comfort zone, has been the most intellectually alive I have felt in a long time.

And somewhere along the way, I realized something that changed how I think about my own research: evaluation and training are not separate worlds. A good reward model is an evaluation. A good benchmark is a reward signal waiting to be operationalized. The skills transfer. They just transfer in directions I did not expect.

Part 1 goes up next. We start with REINFORCE, because everything else is a footnote to REINFORCE, and end with MaxRL and DPPO, the two 2026 papers that suggest the algorithmic story is far from over.

This is Part 0 of a 5-part series.

Each subsequent post takes one layer of the RL-for-LLMs stack and goes deep: algorithms, rewards, preferences, systems, and agents. The parts build on each other, but each can be read independently.