Predicting and Controlling Staleness in Fully Asynchronous RL Training

At Applied Compute, we are pushing the frontier of large-scale reinforcement learning. Prior work has primarily focused on better systems performance to keep the GPUs running at full utilization. In particular, fully asynchronous RL achieves this by disaggregating training and rollout generation so that neither has to wait for the other. But this design introduces a hidden cost: because the rollout engines are never idle, they often generate samples using a policy that is several training steps behind. We call this lag staleness — and as it grows, the model trains on increasingly off-policy data, which can degrade downstream results.

How we control staleness is not well understood, and worse yet, not independent of performance. In previous work⌝, we explored how to allocate train and rollout engines from a fixed compute pool to balance staleness and performance. In this post, we go further: we derive closed-form formulas for mean staleness, which reveal that staleness depends on a handful of system and workload parameters: utilization, batch size, queue capacity, rollout concurrency, and the tailness of the response length distribution. We also show that there is an inherent trade-off between staleness and performance. We validate these predictions against both a discrete-event simulator and real training runs, then give practical recommendations for navigating the trade-off.

Approximating staleness in async RL

Let’s start by introducing the cast of an async RL stack:

The rollout engines continuously produce rollouts. For language models, this occurs via a high-performance inference engine.
A queue sits between the rollout engines and the trainer. It stores a group once all rollouts for a task have completed. The queue controls which groups are trained on via the queue algorithm.
The trainer accepts groups (or waits until there are enough groups in the queue to fill a full batch) and trains on them to generate a new policy. In this blog, we assume that our RL algorithm is GRPO.
This new policy is then synced back to the rollout engines in-flight (while the engines are still running).

Note that because the rollout engines are never idle, the trainer often receives tokens generated by an old policy version. We define the staleness of a rollout as the number of policy versions between the first token generated and when the rollout is trained on. As staleness grows, the model is trained on more off-policy samples, which can affect the training result significantly.

We aim to understand the factors that influence staleness. The total staleness can be divided into two parts, which we analyze separately: pre-queue and in-queue staleness.

The key variables to keep track of in the derivation are:

Variable	Symbol	Units
Response length distribution	$\mathcal{D}$	tokens
Rollout throughput	$v_R$	tokens per second
Sampling concurrency	$C$	slots
Train throughput	$v_T$	tokens per second
Groups per train batch	$G$	groups
Samples per group	$S$	rollouts
Max queue capacity	$Q$	rollouts

These quantities allow us to derive the following:

Variable	Symbol	Units
Trainer utilization	$\rho = \dfrac{v_R}{v_T}$	-
Batch size	$B = GS$	rollouts
Decode speed (per request)	$s_R = v_R/C$	tokens per second
Mean response length	$\mathbb{E}[L]$ (where $L \sim \mathcal{D}$ )	tokens
Group tailness multiplier	$M_{\text{tail}} = \dfrac{\mathbb{E}[\max(L_1,\ldots,L_S)]}{\mathbb{E}[L]}$	-
Queue factor	$q = \dfrac{Q}{B}$	-

Pre-queue staleness

The pre-queue staleness (PQS) measures how many policy updates happen while a rollout is being generated. Formally, it is the average difference in the policy version between the start and end of a rollout. To calculate the PQS, we need two pieces of information:

Group generation time: the time it takes to generate a rollout group,
Effective train period: the time between two policy updates.

When training with GRPO, rollouts enter the queue when all $S$ rollouts for the same prompt have been generated. Assuming that the rollouts in a group start at the same time, the mean group generation time is

t_{\text{generate}} = \frac{\mathbb E[\max(L_1, \ldots, L_S)]}{s_R}=\frac{M_{\text{tail}}\mathbb E[L]}{s_R}=\frac{CM_{\text{tail}}\mathbb E[L]}{v_R},

where $L_i\sim\mathcal D$ .

The training period depends on the number of tokens per batch as well as the slower of the train and rollout throughputs because either a batch is already available for the trainer to process or the trainer is waiting for the batch to finish generating. The mean number of tokens in a batch is $B\cdot\mathbb E[L]$ , so the mean effective train period is

t_{\text{period}} = \frac{B\mathbb{E}[L]}{\min(v_R, v_T)}.

Putting these together, the PQS is

\text{PQS}(\rho) = \frac{t_{\text{generate}}}{t_{\text{period}}} =\frac{CM_{\text{tail}}}{B\max(1,\rho)}= \begin{cases} \dfrac{CM_{\mathrm{tail}}}{B}, & \rho<1, \\[8pt] \dfrac{CM_{\mathrm{tail}}}{\rho B}, & \rho>1. \end{cases}.

In-queue staleness

The in-queue staleness (IQS) measures how many train steps pass while a rollout is in the queue. Without some control over the queue size, the queue can grow unbounded, leading to unlimited staleness. As such, the queue algorithms used in practice set a limit on how high staleness can grow, either explicitly as in queue-max or implicitly like queue-drop.

Both of these algorithms send the oldest $G$ completed groups to the trainer, but queue-max discards samples above a max staleness threshold, while queue-drop drops the oldest groups when the number of groups in the queue exceeds its capacity. We find that queue-max preferentially trains on shorter rollouts, which biases the trained rollout distribution. On the other hand, queue-drop is approximately unbiased with respect to response length because a rollout’s generation time is independent of its length once steady-state is reached (a formal proof is given in the appendix). Therefore, in this blog, we choose to analyze queue-drop.

To derive the staleness of queue-drop, we make the following assumptions:

Rollout groups arrive in the queue at a constant rate.
The trainer processes a batch of groups from the queue at a constant rate.
Syncing weights from the trainer to the rollout engines takes negligible time (e.g. via RDMA).

With these assumptions, we can calculate the IQS. We split the derivation into two cases: rollout-bound ( $\rho<1$ ) and train-bound ( $\rho>1$ ).

Case 1: rollout-bound ( $\rho<1$ )

Let the current train step be $\tau$ . Since $\rho<1$ , the trainer is faster than the rollout engines, so the trainer fully drains the queue at the start of the forward pass. Thus, at the start of the train step, the queue is empty. At the end, $\rho B$ rollouts have arrived in the queue, all admitted during generation version $\tau$ . The trainer must wait for $(1-\rho)B$ more rollouts to arrive, all of which will enter the queue at version $\tau+1$ . Thus, during step $\tau+1$ , the mean policy version of queue samples is

\frac{\rho B\tau + (1-\rho)B(\tau+1)}{B} = \tau+1-\rho.

After the policy is updated to version $\tau+1$ , we derive a mean IQS of $(\tau+1)-(\tau+1-\rho)=\rho.$

Case 2: train-bound ( $\rho>1$ )

When $\rho>1$ , the rollout side is faster than the trainer, so in steady state, our queue reaches the max capacity of $Q_{\text{group}}=Q/S$ rollout groups. To find the mean IQS of a train batch, we first find the average age of a rollout in the batch, then divide it by the train period to obtain a continuous estimate of staleness. Finally, we use a smooth-arrival approximation to correct our estimate to the discrete case, since staleness increases in integral increments.

First, since the trainer takes $G$ groups per step and rollout produces $\rho$ times as fast as the trainer consumes, the rollout engine completes $\rho G$ groups per train step. Thus, under the constant group arrival rate assumption, the average time between completed group arrivals is

\Delta t \approx \frac{t_{\mathrm{period}}}{\rho G}.

Order the queue from newest to oldest with ranks 1 to $Q_{\text{group}}$ . The trainer consumes at a roughly uniform time between two completed-group arrivals. As such, the newest queued group, rank- $1$ , waits about half a group-arrival interval on average. A rank- $r$ group arrives r-1 group arrivals before the newest group, so its mean queue age is

t_{\mathrm{queue}}(r)\ = \left(r-\frac12\right)\Delta t \approx \left(r-\frac12\right)\frac{t_{\mathrm{period}}}{\rho G}.

Thus, the mean queue age of the consumed batch is

\bar t_{\mathrm{queue}} = \frac{1}{G}\sum_{r=Q_{\text{group}}-G+1}^{Q_{\text{group}}}t_{\text{queue}}(r) \approx \frac{t_{\mathrm{period}}}{\rho G} \left(Q_{\text{group}}-\frac{G}{2}\right).

Dividing by $t_{\mathrm{period}}$ , we have

\frac{\bar t_{\mathrm{queue}}}{t_{\mathrm{period}}} \approx \frac{Q_{\text{group}}-\frac{G}{2}}{\rho G} = \frac{2q-1}{2\rho},

since $q = Q_{\text{group}}/G$ .

As mentioned above, this is the continuous queue age measured in train-step units. However, staleness ultimately measures a discrete quantity: the number of policy updates between when a rollout begins generating and when it is trained on. To bridge this gap, we introduce the smooth-arrival approximation, which states that a rollout’s expected staleness at train time is its total age divided by the train period, plus $\frac{1}{2}$ , when the standard deviation of the response length distribution is on the order of $\Omega(t_{\text{period}}\cdot s_R)$ . See the appendix for a more detailed derivation.

The smooth-arrival approximation allows us to write the IQS of a discrete queue as

\mathrm{IQS}(\rho) \approx \frac{2q-1}{2\rho} + \frac12 = \frac{2q+\rho-1}{2\rho}, \qquad \rho>1.

Combining the two cases, the queue-drop in-queue staleness is

\mathrm{IQS}(\rho) = \begin{cases} \rho, & \rho<1, \\[6pt] \dfrac{2q+\rho-1}{2\rho}, & \rho>1. \end{cases}

Total staleness

Putting it all together, we find that the mean total staleness using the queue-drop algorithm is

\mathrm{staleness}(\rho) = \begin{cases} \dfrac{CM_{\mathrm{tail}}}{B}+\rho, & \rho<1, \\[8pt] \dfrac{CM_{\mathrm{tail}}}{\rho B} + \dfrac{2q+\rho-1}{2\rho}, & \rho>1. \end{cases}

To validate our staleness formula efficiently, we built a discrete-event async RL simulator. The simulator mirrors the parts of the real system that determine staleness: rollout generation, buffering, trainer consumption, and weight updates. We give a full description of our simulator in the appendix.

Comparing our formula against the simulator, we see that our predictions are accurate:

Predicted and simulated mean staleness as a function of trainer utilization ρ, for queue factors q=2 (left) and q=5 (right). Each curve varies the tailness multiplier M_tail.

We also evaluate our formula against real training runs. All runs train Qwen3-8B with GRPO on the DAPO math reasoning task⌝. For each run, we extract $C$ , $G$ , $S$ , $M_{\text{tail}}$ , and $\rho$ from logged metrics, then compare the predicted staleness against its empirical mean:

Concurrency $C$	Batch size $B$	Queue factor $q$	Utilization $\rho$	Tailness multiplier $M_\text{tail}$	Predicted staleness	Measured staleness	Error
120	240	2	0.63	1.42	1.34	1.26	+0.08
240	120	2	0.92	1.43	3.86	3.59	+0.27
128	128	2	1.07	1.44	3.29	3.09	+0.20
240	120	1	0.86	1.42	3.34	3.40	−0.06
120	120	1	0.67	1.42	1.95	1.92	+0.03
128	128	1	1.14	1.45	2.24	2.01	+0.24

Moving along the staleness frontier

Our staleness formula suggests that the following factors can meaningfully affect staleness:

Response length tailness
Compute allocation
Batch size
Queue capacity
Sampling concurrency

Below, we present methods of mitigating staleness, the intuition behind them, and their trade-offs.

Monitor the response length tailness

A key finding is that staleness is independent of the mean response length $\mathbb E[L]$ . No term in our formula depends on this quantity. While response length does affect generation time in PQS, generation time and effective train period both increase linearly with the response length, thus cancelling out. In contrast, the important statistic from the response length distribution is the tailness $M_{\text{tail}}$ , which increases staleness proportionally.

Sensitivity of mean staleness to response length statistics predicted by the formula. Both panels plot mean staleness versus trainer utilization ρ with all settings fixed except mean response length on the left and tailness multiplier on the right.

We validate this effect in real runs. We select four 64-prompt subsets that separately vary average response length and group tailness. We find that for roughly fixed mean response lengths, increasing the tailness multiplier leads to higher staleness in both the short-response and long-response bands. Conversely, increasing mean response length while keeping tailness similar did not increase staleness.

Prompt subset type	Mean response length	Tailness multiplier M_tail	Mean staleness
Short mean response length, low grouped tailness	7.31k	1.35	1.98
Short mean response length, high grouped tailness	6.74k	1.68	2.37
Long mean response length, low grouped tailness	11.29k	1.29	1.85
Long mean response length, high grouped tailness	10.38k	1.68	2.22

In our production runs, we note that the response length distribution over time is a leading indicator of staleness changes. In high-tailed settings, a concrete lever here is tuning the response length cap. However, the max response length is often task dependent, so varying the cap can have large consequences on final results.

Change compute allocation

Per our formula, varying $\rho$ allows us to adjust staleness. We can control this quantity with our compute allocation. Specifically, if our goal is to reduce staleness, then:

When we are rollout-bound ( $\rho<1$ ), staleness scales linearly with $\rho$ , so we should allocate less compute to rollouts.
When we are train-bound ( $\rho>1$ ), staleness scales inversely with $\rho$ , so we should allocate less compute to training.

Clearly, adjusting $\rho$ away from $1$ has undesirable effects on performance, which suggests that there is a trade-off between staleness and performance. We can make this trade-off explicit: assuming constant token throughput $c_R$ per rollout GPU and $c_T$ per train GPU, we can derive the effective train period (our proxy for performance) for different compute splits of a fixed number of GPUs $N$ :

t_{\text{period}}(\rho)= \begin{cases} \dfrac{B\mathbb E[L]}{\rho N}\left(\dfrac1{c_T}+\dfrac1{c_R}\rho\right), & \rho<1,\\[8pt] \dfrac{B\mathbb E[L]}{N}\left(\dfrac1{c_T}+\dfrac1{c_R}\rho\right), & \rho\ge1. \end{cases}

We prove this result in the appendix. Combining this with the staleness formula, we can plot the staleness-performance Pareto frontier:

Staleness-performance trade-off under a fixed compute budget predicted by the formula. We sweep GPU allocation between rollout and training with the same workload.

In practice, we choose a point along this frontier depending on the underlying constraints of the task and the downstream results. Real run results also show a clear staleness-performance trade-off:

Empirical staleness-performance frontier from Qwen3-8B DAPO runs under a fixed 8-GPU budget. Each point reallocates GPUs between rollout and training, while keeping rollout concurrency high enough to saturate the rollout engines.

The above plots also suggest a simple rule: it is better to be rollout-bound than train-bound. We demonstrate in the appendix that this is true except in rare cases where producing rollouts is extremely cheap compared to training on them.

Increase the batch size

Scaling the batch size reduces staleness mainly through the PQS, since a larger batch size increases the effective train period without changing the decode speed (assuming constant rollout concurrency), resulting in fewer weight updates during the same duration of a single rollout’s generation. However, the trade-off is that each step also takes proportionally longer since we have more work to do in each step.

This does not necessarily mean larger batches are less efficient. Instead, we should evaluate the update period by the amount of useful data in the update. If each sample in a batch contributes equally to it, the simplest normalized measure is $t_{\text{step}}/B$ , which measures wall-clock time per rollout sample. However, the benefit of larger batches is often sublinear . A more realistic proxy is a noise-adjusted update time such as $t_{\text{step}}/\sqrt{B}$ (or ideally $t_{\mathrm{step}}/\sqrt{B_{\mathrm{eff}}}$ when the effective batch size can be estimated⌝ from sample correlation, masking, importance weights, or filtering).

We plot the staleness-performance frontier below using the noise-corrected update time. Different batch sizes trace roughly the same Pareto frontier, but larger batches are able to achieve lower staleness than smaller ones. Conversely, smaller batches can reach lower update times that larger batches cannot. Thus, excluding additional ML effects such as reduced optimizer-update count, sample-efficiency changes, or effective-batch saturation, the choice of batch size is primarily about which region of the frontier we wish to reach.

Sensitivity of the staleness-performance trade-off to batch size under a fixed GPU budget. We compare predicted staleness against the noise-adjusted effective training period t_period/√B, which accounts for the sublinear benefit of using more rollouts per update.

We observe the batch size effect in real runs. In the table below, we hold concurrency fixed at $C = 120$ for a clean comparison.

B	ρ	Predicted PQS	Predicted IQS	Predicted staleness	Measured staleness	Step time (s)	Step time / B (s)	Step time / √B (s)
60	0.62	2.76	0.62	3.38	3.03	72.1	1.20	9.3
120	0.67	1.28	0.67	1.95	1.92	137.1	1.14	12.5
240	0.63	0.72	0.63	1.35	1.26	270.4	1.13	17.5

Doubling the batch size approximately halved PQS, consistent with the $1/B$ scaling in PQS formula. On the performance side, the effective traing period nearly doubled, but the per-rollout effective train period $t_{\text{period}}/B$ is virtually identical. Under the more realistic noise-adjusted measure $t_{\text{period}}/\sqrt{B}$ , the larger batch is 40% slower, indicating a genuine trade-off between staleness and performance. This is consistent with the theoretical frontier above: larger batches dominate the low-staleness regime, while smaller batches are preferable when we want faster steps.

Shrink the queue capacity

Queue capacity only enters the IQS formula in the train-bound regime ( $\rho > 1$ ):

\text{IQS}(\rho)=\frac{2q+\rho -1}{2\rho}.

In this regime, the rollout engines produce groups faster than the trainer can consume them, so the queue fills to max capacity. The trainer always consumes the oldest groups first — which are also the stalest — so a larger queue means more stale groups accumulate before reaching the trainer.

In the rollout-bound regime ( $\rho < 1$ ), the formula is independent of $q$ because the trainer drains the queue every step regardless of its size. These effects are shown below:

Sensitivity of the staleness-performance trade-off to the queue size under a fixed compute budget. Increasing q leaves the rollout-bound branch unchanged but increases train-bound in-queue staleness because older rollouts can remain in the queue longer.

Based on the formula, we should always minimize $q$ . In practice, however, we find the optimal choice to be more nuanced. Intuitively, extra queue capacity should help performance — when the rate of rollouts arriving to the queue occasionally dips below the mean, extra queue capacity ensures that the queue always has sufficient data for the trainer to fetch. This effect will be significant when $\rho \approx 1$ and there is high arrival variability, which our formula assumes to be zero.

For simplicity, we use our staleness simulation to show this effect. At $\rho=1$ , using $q=1$ decreases staleness compared to $q=2$ and $q=4$ but also slightly increases the effective train period.

Simulated mean staleness as a function of the effective train period t_period. Curves sweep the rollout/trainer compute split under a fixed total compute budget. Solid segments are rollout-bound (ρ < 1), dashed segments are train-bound (ρ > 1), and stars mark the balanced point (ρ = 1).

Regardless, our recommendation is to set $q=1$ . A larger queue is only beneficial when the system is near balance, and the slight performance improvement does not justify the risk of extra staleness if $\rho$ drifts above 1.

We observe the queue capacity effect in real runs as well. The following two runs share $C = 128$ , $B = 128$ , and the same compute split, but differ in queue factor ( $q = 2$ vs. $q = 1$ ).

q	ρ	Predicted PQS	Predicted IQS	Predicted staleness	Measured staleness	Step time (s)
2	1.07	1.36	1.93	3.29	3.09	141.4
1	1.14	1.31	0.94	2.25	2.01	142.6

Both runs are train-bound, so the $q = 1$ configuration achieved 35% lower total staleness while maintaining virtually identical effective train period. The reduction is concentrated entirely in the in-queue component, consistent with the formula.

Tune the rollout concurrency

Rollout concurrency $C$ enters staleness exclusively through the PQS term. We analyze the effect of concurrency in the linear throughput case, where $v_R$ scales linearly with $C$ . The effect of concurrency depends on the utilization:

Train-bound ( $\rho > 1$ ): PQS simplifies to $\frac{CM_{\text{tail}}}{B \rho} = \frac{M_{\text{tail}}v_T}{B s_R}$ . Because the right-hand side is independent of $C$ (assuming per-slot decode speed $s_R$ is roughly constant as concurrency changes), reducing concurrency has little effect on staleness in this regime.
Rollout-bound ( $\rho < 1$ ): PQS is $\frac{CM_{\text{tail}}}B$ , which scales linearly with $C$ . Reducing concurrency directly lowers staleness because fewer slots are actively generating at once, meaning fewer samples are stamped with stale policy versions before they reach the queue.

At the same time, lower concurrency decreases rollout throughput and thus increases the effective train period. We plot the theoretical staleness-performance frontier under different per-GPU concurrencies below (we adjust concurrency per replica so the comparison isolates the concurrency effect from changes in rollout compute allocation). The frontiers largely overlap: different concurrency settings trace the same Pareto curve, but lower concurrency extends it into the low-staleness region at the cost of longer steps. In practice, the choice depends on whether the priority is minimizing staleness or maximizing training speed.

Sensitivity of the staleness-performance trade-off to the rollout concurrency under a fixed compute budget. Higher concurrency can reduce the effective train period t_period by increasing rollout throughput, but it also increases the number of in-flight samples and can raise staleness.

We observe this trade-off directly in real runs. In the following two runs, both use $q = 1$ , $B = 120$ , $S = 8$ , and the same number of rollout GPUs, and they differ only in rollout concurrency.

C	ρ	Predicted PQS	Predicted IQS	Predicted staleness	Measured staleness	Step time (s)
120	0.67	1.28	0.67	1.95	1.92	137.1
240	0.85	2.24	0.85	3.09	3.15	106.4

Halving the concurrency reduces staleness by 39% at the cost of 29% longer effective train periods. The staleness reduction is driven primarily by PQS, as predicted.

Conclusion

In this post, we developed a closed-form framework for reasoning about staleness in fully async RL. By decomposing staleness into its pre-queue and in-queue components, we derived tractable formulas that depend on a small number of system parameters: response length tailness $M_{\text{tail}}$ , utilization $\rho$ , batch size $B$ , rollout concurrency $C$ , and queue factor $q$ . Our predictions match both discrete-event simulations and real training runs to within fractions of a step.

The central insight is that staleness and throughput form a Pareto frontier. For a fixed GPU budget, reducing staleness requires moving utilization away from $\rho = 1$ , which necessarily costs throughput. Our analysis also reveals that the frontier is asymmetric: it is almost always better to be rollout-bound than train-bound. The practical levers for navigating this frontier — compute allocation, batch size, queue capacity, and concurrency — each offer different trade-off profiles, and our formulas make it possible to reason about these trade-offs quantitatively before committing to a run.

Using this work, practitioners can predict staleness from a proposed configuration before launching a run, diagnose the dominant source of staleness in a live system, and choose targeted interventions with known costs. We reiterate our general rules of thumb:

Set $q=1$ : a minimal queue avoids unnecessary staleness accumulation with negligible throughput cost.
Prefer to be rollout-bound: slightly under-provisioning rollout gives a better staleness-performance trade-off than over-provisioning.
Monitor response length tailness: it can be a leading indicator of staleness changes.
Trade-off staleness with train period via batch size: larger batches access lower-staleness operating points at the cost of slower noise-adjusted updates.

We end by highlighting a few directions for future work:

Better queue algorithms: can we design algorithms that exploit the response-length distribution to reduce staleness below queue-drop without introducing bias?
Algorithm bias beyond response length: on that note, what other types of bias might be worth investigating? We showed that queue-drop is unbiased with respect to response length, but other sample properties may be correlated with generation time in subtle ways.
Staleness vs. task reward: our analysis treats staleness as a proxy for off-policy degradation, but the actual relationship between staleness and downstream results is task-dependent. How can we characterize this relationship and identify when staleness is detrimental?

If any of these directions excite you, we’d love to explore them together at Applied Compute.

Appendix

Response length bias in queue algorithms

Recall that queue-max evicts samples past a max staleness threshold. Since longer samples accrue more policy updates while generating (higher PQS), they are more likely to be dropped, resulting in a shorter response length distribution of rollouts sent to the trainer. In contrast, queue-drop evicts by queue position, which we will show does not bias the response length in steady state.

We begin with the following steady-state approximation: in steady state, the response lengths of completed rollouts are independent from the lengths and generation progress of in-flight rollouts. This has two consequences:

Distinct groups have independent response lengths
A group’s response length is independent of the completion times of all other groups.

We now divide the proof into two cases:

Case 1: rollout-bound

Since the trainer is faster than the rollout engines, it must process every sample that the rollout engine produces. Thus, the response length distribution of trained samples is exactly the distribution of response lengths produced.

Case 2: train-bound

In steady-state, the queue is always full before a train step, with train batches sampled from the end. Formally, we train on the groups with ranks $\{Q_{\text{group}}-G+1,\ldots,Q_{\text{group}}\}$ , where rank $1$ is the newest group and $Q_{\text{group}}$ is the oldest group.

Consider some candidate group $g$ . This group will be in the queue until it is dropped at some time $\tau_d$ , when $Q_{\text{group}}$ groups finish generating after $g$ . By the steady-state approximation, the completion times of groups after $g$ are independent of any groups already completed. Therefore, the queue exit time $\tau_d$ is independent of group $g$ or any group generated before it.

Now, consider the time $\tau$ when the trainer trains on group $g$ or a group generated after it. If $\tau > \tau_d$ , $g$ exits the queue before getting trained on. If $\tau < \tau_d$ , $g$ gets trained on.

Since any step before time $\tau$ only trains on samples produced before $g$ , the contents of the corresponding train batches are independent of group $g$ or any groups produced after it. Therefore, in steady state, $\tau$ is also independent of group $g$ or any groups produced after it. Since $\tau_d$ only depends on groups generated after $g$ and $\tau$ only depends on groups generated before $g$ , the event $\tau < \tau_d$ is independent from $g$ , so the probability that $g$ is trained on in steady-state is independent of its contents.

The steady-state approximation is not perfect; with $S > 1$ , some rollouts in a group can finish earlier than others, creating extra available concurrency slots which are then biased towards containing short samples. In practice, the approximation holds as long as we process rollouts from many different groups concurrently.

Empirical results

We verify these response length biases (or lack thereof) empirically. At tailness 50, we see that queue-drop is unbiased unlike queue-max, which becomes decreases at stricter max staleness thresholds:

Queue algorithm	Sampled mean	Trained mean	Sampled max	Trained max
queue-drop with Q=1	1404	1403	8080	8080
queue-drop with Q=2	1405	1403	8080	8080
queue-max with max staleness 1	1401	1229	8080	4983
queue-max with max staleness 2	1399	1259	8080	6555
queue-max with max staleness 3	1403	1292	8080	8080
queue-max with max staleness 4	1402	1351	8080	8080

At tailness 90, the bias is even more apparent:

Queue algorithm	Sampled mean	Trained mean	Sampled max	Trained max
queue-drop with Q=1	1349	1344	12080	12080
queue-drop with Q=2	1357	1352	12080	12080
queue-max with max staleness 1	1354	897	12080	5430
queue-max with max staleness 2	1359	962	12080	6388
queue-max with max staleness 3	1355	1027	12080	7979
queue-max with max staleness 4	1351	1077	12080	9572

Smooth-arrival approximation proof

The smooth-arrival approximation is used to estimate the number of policy updates that occur from the start of a rollout’s generation to when it is trained on based on its lifetime. This approximation holds when we have a wide response length distribution. Formally, consider a rollout which begins generating at time $t_0$ , where $t_0$ is a random variable, and is trained on at time $t_1$ , which is fixed and known. Then the smooth-arrival approximation states that the expected number of policy updates between $t_0$ and $t_1$ (including the update at time $t_1$ ) is

\mathbb{E}[n_{\text{update}} ]=\frac{t_1-\mathbb{E}[t_0]}{t_\text{period}}+\frac{1}{2}.

The approximation holds well when the order of the standard deviation of $t_0$ is $\Omega(t_{\text{period}})$ .

We now derive this approximation. Since there is a policy update at time $t_1$ and policy updates occur with interval $t_{\text{period}}$ , the number of policy updates is approximately

n_{\text{update}}\approx \left\lceil \frac{t_1-t_0}{t_{\text{period}}}\right\rceil.

Here and below, $\lceil x\rceil$ is the ceiling of $x$ and $\{x\}$ is the fractional part of $x$ . For all non-integer $x$ , we have $\lceil x \rceil = x+1-\{x\}$ . Then,

\begin{align*} \mathbb{E}[n_{\text{update}} ] &\approx \mathbb{E}\left[\left\lceil \frac{t_1-t_0}{t_{\text{period}}}\right\rceil\right] \\ &= \frac{t_1-\mathbb E[t_0]}{t_{\text{period}}}+1-\mathbb{E}\left[\left\{ \frac{t_0}{t_{\text{period}}}\right\}\right]. \end{align*}

When the distribution of $t_0$ has a large standard deviation of order $\Omega(t_{\text{period}})$ , the distribution of $\left\{ \frac{t_0}{t_{\text{period}}}\right\}$ is nearly uniform. Thus, its expected value is $\frac12$ , giving the desired approximation.

The smooth arrival approximation works especially well in multi-turn, agentic settings, where rollouts have highly variable lengths. In our simulations and production runs, we find that its predictions also hold well for single-turn rollouts.

Building an efficient staleness simulator

To validate the staleness formula efficiently, we built a discrete-event simulator for fully asynchronous RL with a queue-drop buffer algorithm. The simulator mirrors the parts of the real system that determine staleness: rollout generation, buffering, trainer consumption, and weight-version updates.

We assume that the rollout engines have $C$ concurrent inference slots. Each slot generates one sample at a time. A sample draws its response length $L_i$ from a capped lognormal distribution $\mathcal D$ : when the tailness parameter is zero, all responses have deterministic length $\mathbb E[L]$ ; otherwise, we sample $z \sim \mathcal N(0,1)$ and set

L_i = \mathbb E[L]\exp(\sigma z - \sigma^2/2), \qquad \sigma = 1.3 \cdot \frac{\text{tailness}}{100},

then round and cap the length to avoid unrealistically large outliers. The $1.3$ is to help align the response length distribution of real workloads. The rollout engines’ generation time is $L/s_R$ . A group can only enter the buffer after all $S$ samples finish, so the simulator naturally captures the straggler effect from long-tail responses as described before: the group completion time is controlled by the slowest sample in the group.

Each group is stamped with the rollout policy version used when its generation begins. This version stays fixed through generation and buffering. Once a group completes, it enters a bounded queue of capacity $Q$ . For example, under the queue-drop policy, if the queue is full when a new group arrives, the oldest queued group is dropped and a new group is inserted. The trainer consumes the $G$ oldest queued groups whenever a full training batch is available; otherwise, it waits for the queue to refill. Each train step takes $t_{\text{step}}$ seconds and increments the trainer policy version by one.

For every trained group, the simulator computes staleness exactly as

\small{\text{staleness = train policy version at consumption - rollout policy version at generation}}.

We then measure the mean staleness over trained samples after the simulation reaches steady state.

Note that the simulator is intentionally event-driven rather than step-based: it advances directly to the next sample completion, trainer completion, or history checkpoint. This makes it fast enough to run large parameter grids while preserving the discrete effects that matter most for staleness, especially group stragglers, bounded-queue drops, trainer waiting, and integer policy-version jumps.

Train period under fixed GPU budget

Recall that the train period is this quantity divided by the bottleneck throughput:

t_{\text{period}}=\frac{B\mathbb E[L]}{\min(v_R,v_T)}.

Let the total GPU budget be $N=N_R+N_T$ , where $N_R$ is the number of rollout GPUs and $N_T$ is the number of train GPUs. Now assume that the token throughput $c_R$ per rollout GPU and the token throughput $c_T$ per train GPU are constant. Then $v_R=c_RN_R$ and $v_T=c_TN_T$ scale linearly with their GPU allocations.

We will re-parametrize the train period in terms of $\rho$ and $N$ . Since

\rho=\frac{v_R}{v_T}=\frac{c_R}{c_T}\cdot\frac{N_R}{N_T},

we can use $N=N_R+N_T$ to derive

v_R=\frac{\rho N}{\frac1{c_T}+\frac1{c_R}\rho},\qquad v_T=\frac N{\frac1{c_T}+\frac1{c_R}\rho}.

Thus,

t_{\text{period}}(\rho)=\frac{B\mathbb E[L]\left(\frac1{c_T}+\frac1{c_R}\rho\right)}{N\min(1,\rho)}.

Consequently, $t_{\text{period}}$ is strictly decreasing on $(0,1)$ , strictly increasing on $(1,\infty)$ , and minimized at $\rho=1$ .

Does rollout-bound always lead to better trade-off?

We observe above that for any achievable effective train period $t_{\text{period}}$ , a setup can reach that period with two different utilizations (unless $\rho=1$ ). One of these utilizations is less than 1, and the other is more than 1. But these setups are not equal: we often achieve lower staleness via the rollout-bound utilization compared to the train-bound utilization. The condition that determines which setup is better depends on two factors:

The queue factor $q=\frac QB$ , and
The rollout-to-trainer compute ratio $\beta^*$ at $\rho=1$ .

Smaller $\beta$ means rollouts are cheaper relative to training. We will derive this condition below.

Like in our proof of the formula for $t_{\text{period}}$ , we will assume that rollout and train throughput scale linearly with rollout and train GPUs, respectively. This tells us that $\beta^*$ is a constant; in fact, it is exactly $\frac{c_T}{c_R}$ (both terms are defined above in our $t_{\text{period}}$ proof).

Recall that

\text{IQS}(\rho) = \begin{cases} \rho, & \rho<1,\\[4pt] \dfrac{2q+\rho-1}{2\rho}, & \rho>1. \end{cases}

Define $\rho_+=1+z$ where $z>0$ . Then we can derive the other utilization $\rho_-$ with the same effective train period as follows:

\begin{align*} t_{\text{period}}(\rho_-) &= t_{\text{period}}(\rho_+),\\ \frac{B\mathbb E[L]\left(\frac1{c_T}+\frac1{c_R}\rho_-\right)}{N\rho_-} &= \frac{B\mathbb E[L]\left(\frac1{c_T}+\frac1{c_R}\rho_+\right)}{N},\\ \rho_- &= \frac1{1+\beta^*(\rho_+-1)}=\frac1{1+\beta^*z}. \end{align*}

The PQS is constant for a fixed effective train period, so the signed staleness gap is only the difference in IQS:

\begin{align*} \mathrm{gap}(z;\beta,q) &= \mathrm{staleness}(\rho_+) - \mathrm{staleness}(\rho_-)\\ &= \text{IQS}(\rho_+)-\text{IQS}(\rho_-)\\ &= \frac{z+2q}{2(1+z)} - \frac{1}{1+\beta^* z}. \end{align*}

Negative gap means the $\rho>1$ point is better. Define the best possible signed gap as

\mathrm{gap}_{\min}(\beta^*,q) = \inf_{z>0}\mathrm{gap}(z;\beta^*,q).

Then there exists $z$ such that the $\rho>1$ branch improves the frontier iff

\mathrm{gap}_{\min}(\beta^*,q)<0 \iff \frac{z+2q}{2(1+z)} < \frac{1}{1+\beta^* z}.

Recalling $q\ge1$ , this condition simplifies to

\beta^* < \beta_{\mathrm{crit}}^*(q) = \frac{1}{6q-4+4\sqrt{(q-1)(2q-1)}}.

In particular,

\beta_{\mathrm{crit}}^*(1)=\frac12,\qquad \beta_{\mathrm{crit}}^*(2) = \frac{1}{8+4\sqrt3} \approx 0.067.

Thus, even for $q=1$ , the $\rho>1$ side can only improve the frontier when rollout compute is less than half of trainer compute at the balanced point. For $q=2$ , rollout compute must be less than about $6.7\%$ of trainer compute.