A Deep read
A− Method + experiments
B+ Problem + conclusion
B Skim
C Core idea only
NV NVIDIA paper
96 papers
5 categories
24 sub-topics
2017: 1
2022: 2
2023: 1
2024: 8
2025: 54
2026: 30
Algorithms 26 papers
Core RL: PPO → GRPO → DAPO → CISPO → MaxRL → DPPO 13
2025MiniMax-M1 / CISPO
Changed clipping target — key interview talking point; also notable for long-context RL (see §4.3)
A
Scaling Laws & Meta 5
2025Qwen3 Technical Report
Thinking/non-thinking unified framework + thinking budget + distillation recipe; source of Nemotron 3 reasoning budget control
A
2025Scaling Behaviors of LLM RL Post-Training
Power-law: model scale × data × compute; complements ScaleRL
B
Cascade & Stage-wise RL (Nemotron) 4
2026Nemotron 3 Super Technical Report
LatentMoE + NVFP4 + multi-env simultaneous RL + reasoning budget control
A−NV
Distillation (on-policy / self / context) 4
Reward Modeling 20 papers
Generative Reward Models 5
2024Generative Verifiers: RM as Next-Token Prediction
346 citations; GRM origin; yes/no token + majority vote
A
Process Reward Models (PRM) 7
2025Lessons of Developing PRMs
Qwen/Tongyi; PRM training practice: what works, annotation, stability
A−
Rubrics-as-Rewards 4
RM Evaluation & Benchmarks 4
RLHF & Preference Optimization 13 papers
Foundations 3
From Simplification to Online RL 4
Multi-turn / Social / Creative RLHF 6
2025RLMR: Mixed Rewards for Creative Writing
Subjective aesthetic RM + objective constraint verifier
B
Systems 17 papers
Sync vs Async Training 6
Train-Inference Mismatch & RL Stability 9
Long-Context RL 2
Tasks & Agents 21 papers
Coding Agents 2
Deep Research & Search Agents 4
Computer-Use & GUI Agents 3
Tool-Use RL 3
Generalist Agent RL Frameworks 5
Agent RL Challenges & Self-Evolution 4
Click subsection headers to expand. 5 categories · 24 sub-topics · Updated March 2026.