Reference · 96 Papers

RL for LLMs: The Reading List

A curated taxonomy of 96 papers across algorithms, rewards, preferences, systems, and agents — with reading depth recommendations.

Lorenzo Xiao · v8 · Updated March 2026

A Deep read A− Method + experiments B+ Problem + conclusion B Skim C Core idea only NV NVIDIA paper
96 papers 5 categories 24 sub-topics 2017: 1 2022: 2 2023: 1 2024: 8 2025: 54 2026: 30
Algorithms 26 papers
Core RL: PPO → GRPO → DAPO → CISPO → MaxRL → DPPO 13
2017PPO
Know ratio clipping + trust region
B
2022InstructGPT
RM + PPO + KL penalty; shared with §3.1
A−
2024DeepSeekMath (GRPO origin)
Start your interview story here
A
2025DAPO
Recipe + system-level scaling
A
2025GSPOA−
2025SAPOB
2025MiniMax-M1 / CISPO
Changed clipping target — key interview talking point; also notable for long-context RL (see §4.3)
A
2025Dr. GRPO
Fixes length bias + std normalization
B+
2026MaxRL
pass@k objective; RL–MLE continuum
A−
2026DPPO
TV/KL divergence trust region
B+
Scaling Laws & Meta 5
2025Qwen3 Technical Report
Thinking/non-thinking unified framework + thinking budget + distillation recipe; source of Nemotron 3 reasoning budget control
A
2025Scaling Behaviors of LLM RL Post-Training
Power-law: model scale × data × compute; complements ScaleRL
B
2025ProRL
Steps scaling dimension; prerequisite to ScaleRL
B+NV
2025BroRL
Rollouts scaling dimension; complementary to ProRL
B+NV
Cascade & Stage-wise RL (Nemotron) 4
2025Nemotron-Cascade 1
Foundational motivation for Cascade 2
A−NV
2026Nemotron 3 Super Technical Report
LatentMoE + NVFP4 + multi-env simultaneous RL + reasoning budget control
A−NV
2026Nemotron-Cascade 2
IMO/IOI/ICPC gold medals; 30B MoE; on-policy distillation
ANV
Distillation (on-policy / self / context) 4
2025On-Policy Distillation (Thinking Machines)
Clearest explanation of compute tradeoffs
A−
2026MiMo-V2-Flash
Multi-teacher on-policy distillation; core Cascade 2 technique
A−
2026Self-Distilled Reasoner
Teacher = self with privileged information
B+
Reward Modeling 20 papers
Generative Reward Models 5
2024Generative Verifiers: RM as Next-Token Prediction
346 citations; GRM origin; yes/no token + majority vote
A
2025HelpSteer3-Preference
40k samples; NVIDIA latest; RM-Bench SOTA
ANV
2025DeepSeek-GRM / SPCT
SPCT online RL for GRM training; 193 citations
A
2025RM-R1: Reward Modeling as Reasoning
Reason first, then score; 101 citations
A−
2026P-GenRM (ICLR 2026 Oral)
Personalized GRM — directly relevant to persona research
A−
Process Reward Models (PRM) 7
2024PAV / Rewarding Progress
ICLR; 225 citations; process reward = advantage
A−
2025Lessons of Developing PRMs
Qwen/Tongyi; PRM training practice: what works, annotation, stability
A−
2025ThinkPRM
Long CoT verifier; 58 citations
A−
2025GenPRM
PRM test-time generative reasoning; 21 citations
A−
2025ReasonFlux-PRM
NeurIPS 2025 Spotlight; trajectory-aware step + trajectory dual supervision
A−
2025R-PRMB+
2026PRISMB+
Rubrics-as-Rewards 4
2025Rubrics as Rewards (RaR)
111 citations; rubrics structure subjective preferences
A−
2025OpenRubricsB+
2026Rubric-ARM
Alternating optimization of rubric generator + judge
B+
2026Golden Goose
Non-verifiable → verifiable bridge; high value for research narrative
A−NV
RM Evaluation & Benchmarks 4
2024How to Evaluate RM for RLHF / PPE
61 citations; offline-online gap
A−
2024HelpSteer3 dataset
Feedback + edit; inference-time scaling
B+NV
2025RewardBench 2
63 citations; reward hacking
B+
RLHF & Preference Optimization 13 papers
Foundations 3
2022InstructGPT
Shared with §1.1
A−
2022Bai et al. / HH-RLHF
Safety alignment; HH-RLHF dataset
B+
2023DPO
Know why it eventually fell short
A−
From Simplification to Online RL 4
2024SimPO
DPO simplification endpoint
B
2024Online Iterative RLHF
Turning point: online >> offline DPO
A−
2025OLMo 3 (SFT-DPO-RL-RL Zero)
Fully transparent & reproducible; all data/code/checkpoints
A
Multi-turn / Social / Creative RLHF 6
2025PPP: Proactive and Personalized Agents
Three-objective RL; UserVille benchmark
A−
2025RL from User Conversations
Persona-conditioned rewards
A−
2025RLMR: Mixed Rewards for Creative Writing
Subjective aesthetic RM + objective constraint verifier
B
2026HER: RL for Role-playing
Dual-layer thinking; relevant to InCharacter work
A−
2026Social-R1B
Systems 17 papers
Sync vs Async Training 6
2025MagistralA
2025AReaLA−
2025A-3PO
Decoupled PPO; independent dimension
B+
2026StaleFlowB+
2026GACB+
Train-Inference Mismatch & RL Stability 9
2025Stabilizing RL with LLMs
IS correction + Routing Replay theory
A−
2025R3: MoE Routing Replay
Theoretically explained by Stabilizing RL paper
B+
2025FP16 Mismatch
Counter-intuitive: train/infer path consistency matters more
B+
Long-Context RL 2
2025Context-Folding
Long-horizon agent context compression/folding management
B+
Tasks & Agents 21 papers
Coding Agents 2
Deep Research & Search Agents 4
Computer-Use & GUI Agents 3
2026OmegaUseB+
2026GUI-LibraB+
2025ComputerRLB+
Tool-Use RL 3
2025ToolRL
191 citations; reward design textbook
A
2025Nemotron-Tool-N1
NVIDIA; binary reward; must-discuss in interviews
ANV
2025ReTool
231 citations; cold-start synthesis + outcome RL
A−
Generalist Agent RL Frameworks 5
2026RLAnything
Joint optimization of env + policy + RM
A−
2025AGENTRL
Async + cross-policy sampling
A−
2025Kimi K2.5 Agent Swarm
Multi-agent orchestration for agentic tasks
B+
2025WebAgent-R1
73 citations; end-to-end web-agent RL
B+
2026ProRL Agent
NVIDIA; rollout-as-a-service for multi-turn agent RL
B+NV
Agent RL Challenges & Self-Evolution 4
2025RAGEN
148 citations; Echo Trap — why agent RL fails
A
2026iStar
ICLR 2026; implicit PRM for credit assignment
A−
2025AgentPRM
MC rollout actor-critic; 29 citations
B+
2025Absolute Zero
161 citations; zero-data self-play
A−

Click subsection headers to expand. 5 categories · 24 sub-topics · Updated March 2026.

← Back to blog