Dario Amodei, Anthropic's CEO, has described RL as showing the same kind of scaling pre-training once did, where performance climbs log-linearly with how long you train.
The RL training loop
Three actors: the generator, the RL environment, and the trainer.
The generator performs inference on prompts from the dataset, generating a rollout: a prompt and a model's generated response.
The RL environment produces a reward based on the rollout. For example, a code environment executes the generated code in a sandbox and gives reward scores based on the test case pass rate. Then, the trainer ingests rollouts and rewards generated by the generator and trains the model, producing new model weights.
GRPO
GRPO samples multiple completions for each prompt, forming a group of rollouts. We assign each rollout a reward, a score of the rollout. We then compute each rollout's advantage: its reward relative to the group's average, capturing how much better or worse it did than a typical rollout for that prompt. Rollouts above the group average get reinforced, and those below get suppressed.
RL environment
Practically, the RL environment is implemented as a sandbox, a containerized runtime for code execution.
Sandboxes for RL environments have unique system challenges. The interaction latency between the generator and the RL environment is critical to the end-to-end rollout latency, and sandbox startup latency is one of the major overheads. Sandbox service companies like Modal optimize the startup latency with techniques like content-addressed caching.
Matching trainer and generator throughput
The generator produces rollouts into a queue, and the trainer consumes from it. When the generator is slower than the trainer, the queue empties and the trainer starves, idling between steps. When the generator is faster, the queue grows and its samples age, causing policy staleness issues.
In an ideal RL training system, the trainer consumption rate should be roughly equal to the generator production rate. The trainer consumes samples, performs a training step, and then broadcasts weights to the generator.
RL training system efficiency is a matter of queue health.
Generator constraints
Concurrency, or the number of concurrent rollouts, is limited by KV cache memory space and average sequence length.
KV cache dictates the maximum number of tokens the generator can hold, and dividing it by the average sequence length gives us an estimate of max concurrency.
The moving target: curriculum and output length
Model capability determines how well the model can solve a task, measured by solve rate: what percentage of the rollouts in a group passes the verification. When the solve rate is near 0 or near 100%, the reward distribution is uniform, leading to zero advantage, and collapses the training signal.
The curriculum: the order of tasks being presented to the model. The curriculum is chosen to keep the solve rate at a productive middle band, so the tasks are neither too easy nor too hard for the model's current capabilities.
A model's token usage behavior affects the output length, which indirectly determines the max concurrent rollouts in the generator. RL training typically elicits Chain-of-Thought reasoning, causing a model to generate long reasoning traces. This behavior drives the average output length, which increases KV cache usage. Under the same memory budget, this reduces the max concurrent rollouts and increases the sample generation end-to-end latency.
Key efficiency factors
Two factors influence the system efficiency: how the model produces long responses with long thinking traces, and how well the model learns the task.