RL for LLMs: The Reading List | Yunze (Lorenzo) Xiao

A Deep read A− Method + experiments B+ Problem + conclusion B Skim C Core idea only NV NVIDIA paper

96 papers 5 categories 24 sub-topics 2017: 1 2022: 2 2023: 1 2024: 8 2025: 54 2026: 30

Algorithms 26 papers

Core RL: PPO → GRPO → DAPO → CISPO → MaxRL → DPPO 13

2017PPO

Know ratio clipping + trust region

2022InstructGPT

RM + PPO + KL penalty; shared with §3.1

A−

2024RLOO / Revisiting REINFORCEA−

2024DeepSeekMath (GRPO origin)

Start your interview story here

2025DeepSeek-R1A

2025DAPO

Recipe + system-level scaling

2025GSPOA−

2025SAPOB

2025MiniMax-M1 / CISPO

Changed clipping target — key interview talking point; also notable for long-context RL (see §4.3)

2025PPO Collapse in Long-CoTB+

2025Dr. GRPO

Fixes length bias + std normalization

B+

2026MaxRL

pass@k objective; RL–MLE continuum

A−

2026DPPO

TV/KL divergence trust region

B+

Scaling Laws & Meta 5

2025Qwen3 Technical Report

Thinking/non-thinking unified framework + thinking budget + distillation recipe; source of Nemotron 3 reasoning budget control

2025ScaleRL / The Art of Scaling RL ComputeA

2025Scaling Behaviors of LLM RL Post-Training

Power-law: model scale × data × compute; complements ScaleRL

2025ProRL

Steps scaling dimension; prerequisite to ScaleRL

B+NV

2025BroRL

Rollouts scaling dimension; complementary to ProRL

B+NV

Cascade & Stage-wise RL (Nemotron) 4

2025AceReason-Nemotron 1.1ANV

2025Nemotron-Cascade 1

Foundational motivation for Cascade 2

A−NV

2026Nemotron 3 Super Technical Report

LatentMoE + NVFP4 + multi-env simultaneous RL + reasoning budget control

A−NV

2026Nemotron-Cascade 2

IMO/IOI/ICPC gold medals; 30B MoE; on-policy distillation

ANV

Distillation (on-policy / self / context) 4

2025On-Policy Distillation (Thinking Machines)

Clearest explanation of compute tradeoffs

A−

2026MiMo-V2-Flash

Multi-teacher on-policy distillation; core Cascade 2 technique

A−

2026Self-Distilled Reasoner

Teacher = self with privileged information

B+

2026On-Policy Context DistillationB

Reward Modeling 20 papers

Generative Reward Models 5

2024Generative Verifiers: RM as Next-Token Prediction

346 citations; GRM origin; yes/no token + majority vote

2025HelpSteer3-Preference

40k samples; NVIDIA latest; RM-Bench SOTA

ANV

2025DeepSeek-GRM / SPCT

SPCT online RL for GRM training; 193 citations

2025RM-R1: Reward Modeling as Reasoning

Reason first, then score; 101 citations

A−

2026P-GenRM (ICLR 2026 Oral)

Personalized GRM — directly relevant to persona research

A−

Process Reward Models (PRM) 7

2024PAV / Rewarding Progress

ICLR; 225 citations; process reward = advantage

A−

2025Lessons of Developing PRMs

Qwen/Tongyi; PRM training practice: what works, annotation, stability

A−

2025ThinkPRM

Long CoT verifier; 58 citations

A−

2025GenPRM

PRM test-time generative reasoning; 21 citations

A−

2025ReasonFlux-PRM

NeurIPS 2025 Spotlight; trajectory-aware step + trajectory dual supervision

A−

2025R-PRMB+

2026PRISMB+

Rubrics-as-Rewards 4

2025Rubrics as Rewards (RaR)

111 citations; rubrics structure subjective preferences

A−

2025OpenRubricsB+

2026Rubric-ARM

Alternating optimization of rubric generator + judge

B+

2026Golden Goose

Non-verifiable → verifiable bridge; high value for research narrative

A−NV

RM Evaluation & Benchmarks 4

2024How to Evaluate RM for RLHF / PPE

61 citations; offline-online gap

A−

2024HelpSteer3 dataset

Feedback + edit; inference-time scaling

B+NV

2025RewardBench 2

63 citations; reward hacking

B+

2026Long-form RewardBenchB

RLHF & Preference Optimization 13 papers

Foundations 3

2022InstructGPT

Shared with §1.1

A−

2022Bai et al. / HH-RLHF

Safety alignment; HH-RLHF dataset

B+

2023DPO

Know why it eventually fell short

A−

From Simplification to Online RL 4

2024SimPO

DPO simplification endpoint

2024Online Iterative RLHF

Turning point: online >> offline DPO

A−

2025Asynchronous RLHF (ICLR 2025)B+

2025OLMo 3 (SFT-DPO-RL-RL Zero)

Fully transparent & reproducible; all data/code/checkpoints

Multi-turn / Social / Creative RLHF 6

2025PPP: Proactive and Personalized Agents

Three-objective RL; UserVille benchmark

A−

2025RL from User Conversations

Persona-conditioned rewards

A−

2025RLMR: Mixed Rewards for Creative Writing

Subjective aesthetic RM + objective constraint verifier

2026HER: RL for Role-playing

Dual-layer thinking; relevant to InCharacter work

A−

2026OMAR (One Model All Roles)B

2026Social-R1B

Systems 17 papers

Sync vs Async Training 6

2025MagistralA

2025AReaLA−

2025slime (SGLang RL blog)A−

2025A-3PO

Decoupled PPO; independent dimension

B+

2026StaleFlowB+

2026GACB+

Train-Inference Mismatch & RL Stability 9

2025Stabilizing RL with LLMs

IS correction + Routing Replay theory

A−

2025Give Me FP32 or Give Me DeathA−

2025Defeating Nondeterminism (Thinking Machines)A−

2025SGLang Deterministic InferenceA−

2025R3: MoE Routing Replay

Theoretically explained by Stabilizing RL paper

B+

2025FP16 Mismatch

Counter-intuitive: train/infer path consistency matters more

B+

2025TP Sizes DeterminismA−

2025Unified FP8 (SGLang blog)A−

2025DeepSeek-V3.2B

Long-Context RL 2

2025Context-Folding

Long-horizon agent context compression/folding management

B+

2025QwenLong-L1.5B

Tasks & Agents 21 papers

Coding Agents 2

2026Qwen3-Coder-NextA−

2026SWE-MasterB+

Deep Research & Search Agents 4

2025Search-R1

790 citations

2025Tongyi DeepResearchA−

2026Yunque DeepResearchB+

2026How to Train DR AgentB+

Computer-Use & GUI Agents 3

2026OmegaUseB+

2026GUI-LibraB+

2025ComputerRLB+

Tool-Use RL 3

2025ToolRL

191 citations; reward design textbook

2025Nemotron-Tool-N1

NVIDIA; binary reward; must-discuss in interviews

ANV

2025ReTool

231 citations; cold-start synthesis + outcome RL

A−

Generalist Agent RL Frameworks 5

2026RLAnything

Joint optimization of env + policy + RM

A−

2025AGENTRL

Async + cross-policy sampling

A−

2025Kimi K2.5 Agent Swarm

Multi-agent orchestration for agentic tasks

B+

2025WebAgent-R1

73 citations; end-to-end web-agent RL

B+

2026ProRL Agent

NVIDIA; rollout-as-a-service for multi-turn agent RL

B+NV

Agent RL Challenges & Self-Evolution 4

2025RAGEN

148 citations; Echo Trap — why agent RL fails

2026iStar

ICLR 2026; implicit PRM for credit assignment

A−

2025AgentPRM

MC rollout actor-critic; 29 citations

B+

2025Absolute Zero

161 citations; zero-data self-play

A−

Click subsection headers to expand. 5 categories · 24 sub-topics · Updated March 2026.