Applied Compute raises $80M led by Kleiner PerkinsRead more
JULY 3, 2026CHENGYU DONG, AGASTYA GOEL, DYLAN YU

Predicting and Controlling Staleness in Fully Asynchronous RL Training

Predicting and Controlling Staleness in Fully Asynchronous RL Training

At Applied Compute, we are pushing the frontier of large-scale reinforcement learning. Prior work has primarily focused on better systems performance to keep the GPUs running at full utilization. In particular, fully asynchronous RL achieves this by disaggregating training and rollout generation so that neither has to wait for the other. But this design introduces a hidden cost: because the rollout engines are never idle, they often generate samples using a policy that is several training steps behind. We call this lag staleness — and as it grows, the model trains on increasingly off-policy data, which can degrade downstream results.

How we control staleness is not well understood, and worse yet, not independent of performance. In previous work, we explored how to allocate train and rollout engines from a fixed compute pool to balance staleness and performance. In this post, we go further: we derive closed-form formulas for mean staleness, which reveal that staleness depends on a handful of system and workload parameters: utilization, batch size, queue capacity, rollout concurrency, and the tailness of the response length distribution. We also show that there is an inherent trade-off between staleness and performance. We validate these predictions against both a discrete-event simulator and real training runs, then give practical recommendations for navigating the trade-off.

Approximating staleness in async RL

Let’s start by introducing the cast of an async RL stack:

  1. The rollout engines continuously produce rollouts. For language models, this occurs via a high-performance inference engine.
  2. A queue sits between the rollout engines and the trainer. It stores a group once all rollouts for a task have completed. The queue controls which groups are trained on via the queue algorithm.
  3. The trainer accepts groups (or waits until there are enough groups in the queue to fill a full batch) and trains on them to generate a new policy. In this blog, we assume that our RL algorithm is GRPO.
  4. This new policy is then synced back to the rollout engines in-flight (while the engines are still running).

Note that because the rollout engines are never idle, the trainer often receives tokens generated by an old policy version. We define the staleness of a rollout as the number of policy versions between the first token generated and when the rollout is trained on. As staleness grows, the model is trained on more off-policy samples, which can affect the training result significantly.

We aim to understand the factors that influence staleness. The total staleness can be divided into two parts, which we analyze separately: pre-queue and in-queue staleness.

The key variables to keep track of in the derivation are:

VariableSymbolUnits

Response length distribution

D\mathcal{D}

tokens

Rollout throughput

vRv_R

tokens per second

Sampling concurrency

CC

slots

Train throughput

vTv_T

tokens per second

Groups per train batch

GG

groups

Samples per group

SS

rollouts

Max queue capacity

QQ

rollouts

These quantities allow us to derive the following:

VariableSymbolUnits

Trainer utilization

ρ=vRvT\rho = \dfrac{v_R}{v_T}

-

Batch size

B=GSB = GS

rollouts

Decode speed (per request)

sR=vR/Cs_R = v_R/C

tokens per second

Mean response length

E[L]\mathbb{E}[L] (where LDL \sim \mathcal{D})

tokens

Group tailness multiplier

Mtail=E[max(L1,,LS)]E[L]M_{\text{tail}} = \dfrac{\mathbb{E}[\max(L_1,\ldots,L_S)]}{\mathbb{E}[L]}

-

Queue factor

q=QBq = \dfrac{Q}{B}

-

Pre-queue staleness

The pre-queue staleness (PQS) measures how many policy updates happen while a rollout is being generated. Formally, it is the average difference in the policy version between the start and end of a rollout. To calculate the PQS, we need two pieces of information:

  • Group generation time: the time it takes to generate a rollout group,
  • Effective train period: the time between two policy updates.

When training with GRPO, rollouts enter the queue when all SS rollouts for the same prompt have been generated. Assuming that the rollouts in a group start at the same time, the mean group generation time is

tgenerate=E[max(L1,,LS)]sR=MtailE[L]sR=CMtailE[L]vR,t_{\text{generate}} = \frac{\mathbb E[\max(L_1, \ldots, L_S)]}{s_R}=\frac{M_{\text{tail}}\mathbb E[L]}{s_R}=\frac{CM_{\text{tail}}\mathbb E[L]}{v_R},

where LiDL_i\sim\mathcal D.

The training period depends on the number of tokens per batch as well as the slower of the train and rollout throughputs because either a batch is already available for the trainer to process or the trainer is waiting for the batch to finish generating. The mean number of tokens in a batch is BE[L]B\cdot\mathbb E[L], so the mean effective train period is

tperiod=BE[L]min(vR,vT).t_{\text{period}} = \frac{B\mathbb{E}[L]}{\min(v_R, v_T)}.

Putting these together, the PQS is

PQS(ρ)=tgeneratetperiod=CMtailBmax(1,ρ)={CMtailB,ρ<1,CMtailρB,ρ>1..\text{PQS}(\rho) = \frac{t_{\text{generate}}}{t_{\text{period}}} =\frac{CM_{\text{tail}}}{B\max(1,\rho)}= \begin{cases} \dfrac{CM_{\mathrm{tail}}}{B}, & \rho<1, \\[8pt] \dfrac{CM_{\mathrm{tail}}}{\rho B}, & \rho>1. \end{cases}.

In-queue staleness

The in-queue staleness (IQS) measures how many train steps pass while a rollout is in the queue. Without some control over the queue size, the queue can grow unbounded, leading to unlimited staleness. As such, the queue algorithms used in practice set a limit on how high staleness can grow, either explicitly as in queue-max or implicitly like queue-drop.

Both of these algorithms send the oldest GG completed groups to the trainer, but queue-max discards samples above a max staleness threshold, while queue-drop drops the oldest groups when the number of groups in the queue exceeds its capacity. We find that queue-max preferentially trains on shorter rollouts, which biases the trained rollout distribution. On the other hand, queue-drop is approximately unbiased with respect to response length because a rollout’s generation time is independent of its length once steady-state is reached (a formal proof is given in the appendix). Therefore, in this blog, we choose to analyze queue-drop.

To derive the staleness of queue-drop, we make the following assumptions:

  1. Rollout groups arrive in the queue at a constant rate.
  2. The trainer processes a batch of groups from the queue at a constant rate.
  3. Syncing weights from the trainer to the rollout engines takes negligible time (e.g. via RDMA).

With these assumptions, we can calculate the IQS. We split the derivation into two cases: rollout-bound (ρ<1\rho<1) and train-bound (ρ>1\rho>1).

Case 1: rollout-bound (ρ<1\rho<1)

Let the current train step be τ\tau. Since ρ<1\rho<1, the trainer is faster than the rollout engines, so the trainer fully drains the queue at the start of the forward pass. Thus, at the start of the train step, the queue is empty. At the end, ρB\rho B rollouts have arrived in the queue, all admitted during generation version τ\tau. The trainer must wait for (1ρ)B(1-\rho)B more rollouts to arrive, all of which will enter the queue at version τ+1\tau+1. Thus, during step τ+1\tau+1, the mean policy version of queue samples is

ρBτ+(1ρ)B(τ+1)B=τ+1ρ.\frac{\rho B\tau + (1-\rho)B(\tau+1)}{B} = \tau+1-\rho.

After the policy is updated to version τ+1\tau+1, we derive a mean IQS of (τ+1)(τ+1ρ)=ρ.(\tau+1)-(\tau+1-\rho)=\rho.

Case 2: train-bound (ρ>1\rho>1)

When ρ>1\rho>1, the rollout side is faster than the trainer, so in steady state, our queue reaches the max capacity of Qgroup=Q/SQ_{\text{group}}=Q/S rollout groups. To find the mean IQS of a train batch, we first find the average age of a rollout in the batch, then divide it by the train period to obtain a continuous estimate of staleness. Finally, we use a smooth-arrival approximation to correct our estimate to the discrete case, since staleness increases in integral increments.

First, since the trainer takes GG groups per step and rollout produces ρ\rho times as fast as the trainer consumes, the rollout engine completes ρG\rho G groups per train step. Thus, under the constant group arrival rate assumption, the average time between completed group arrivals is

ΔttperiodρG.\Delta t \approx \frac{t_{\mathrm{period}}}{\rho G}.

Order the queue from newest to oldest with ranks 1 to QgroupQ_{\text{group}}. The trainer consumes at a roughly uniform time between two completed-group arrivals. As such, the newest queued group, rank-11, waits about half a group-arrival interval on average. A rank-rr group arrives r-1 group arrivals before the newest group, so its mean queue age is

tqueue(r) =(r12)Δt(r12)tperiodρG.t_{\mathrm{queue}}(r)\ = \left(r-\frac12\right)\Delta t \approx \left(r-\frac12\right)\frac{t_{\mathrm{period}}}{\rho G}.

Thus, the mean queue age of the consumed batch is

tˉqueue=1Gr=QgroupG+1Qgrouptqueue(r)tperiodρG(QgroupG2).\bar t_{\mathrm{queue}} = \frac{1}{G}\sum_{r=Q_{\text{group}}-G+1}^{Q_{\text{group}}}t_{\text{queue}}(r) \approx \frac{t_{\mathrm{period}}}{\rho G} \left(Q_{\text{group}}-\frac{G}{2}\right).

Dividing by tperiodt_{\mathrm{period}}, we have

tˉqueuetperiodQgroupG2ρG=2q12ρ,\frac{\bar t_{\mathrm{queue}}}{t_{\mathrm{period}}} \approx \frac{Q_{\text{group}}-\frac{G}{2}}{\rho G} = \frac{2q-1}{2\rho},

since q=Qgroup/Gq = Q_{\text{group}}/G.

As mentioned above, this is the continuous queue age measured in train-step units. However, staleness ultimately measures a discrete quantity: the number of policy updates between when a rollout begins generating and when it is trained on. To bridge this gap, we introduce the smooth-arrival approximation, which states that a rollout’s expected staleness at train time is its total age divided by the train period, plus 12\frac{1}{2}, when the standard deviation of the response length distribution is on the order of Ω(tperiodsR)\Omega(t_{\text{period}}\cdot s_R). See the appendix for a more detailed derivation.

The smooth-arrival approximation allows us to write the IQS of a discrete queue as

IQS(ρ)2q12ρ+12=2q+ρ12ρ,ρ>1.\mathrm{IQS}(\rho) \approx \frac{2q-1}{2\rho} + \frac12 = \frac{2q+\rho-1}{2\rho}, \qquad \rho>1.

Combining the two cases, the queue-drop in-queue staleness is

IQS(ρ)={ρ,ρ<1,2q+ρ12ρ,ρ>1.\mathrm{IQS}(\rho) = \begin{cases} \rho, & \rho<1, \\[6pt] \dfrac{2q+\rho-1}{2\rho}, & \rho>1. \end{cases}

Total staleness

Putting it all together, we find that the mean total staleness using the queue-drop algorithm is

staleness(ρ)={CMtailB+ρ,ρ<1,CMtailρB+2q+ρ12ρ,ρ>1.\mathrm{staleness}(\rho) = \begin{cases} \dfrac{CM_{\mathrm{tail}}}{B}+\rho, & \rho<1, \\[8pt] \dfrac{CM_{\mathrm{tail}}}{\rho B} + \dfrac{2q+\rho-1}{2\rho}, & \rho>1. \end{cases}

To validate our staleness formula efficiently, we built a discrete-event async RL simulator. The simulator mirrors the parts of the real system that determine staleness: rollout generation, buffering, trainer consumption, and weight updates. We give a full description of our simulator in the appendix.

Comparing our formula against the simulator, we see that our predictions are accurate:

Predicted and simulated mean staleness as a function of trainer utilization ρ, for queue factors q=2 (left) and q=5 (right). Each curve varies the tailness multiplier M_tail.
Predicted and simulated mean staleness as a function of trainer utilization ρ, for queue factors q=2 (left) and q=5 (right). Each curve varies the tailness multiplier M_tail.

We also evaluate our formula against real training runs. All runs train Qwen3-8B with GRPO on the DAPO math reasoning task. For each run, we extract CC, GG, SS, MtailM_{\text{tail}}, and ρ\rho from logged metrics, then compare the predicted staleness against its empirical mean:

Concurrency CCBatch size BBQueue factor qqUtilization ρ\rhoTailness multiplier MtailM_\text{tail}Predicted stalenessMeasured stalenessError

120

240

2

0.63

1.42

1.34

1.26

+0.08

240

120

2

0.92

1.43

3.86

3.59

+0.27

128

128

2

1.07

1.44

3.29

3.09

+0.20

240

120

1

0.86

1.42

3.34

3.40

−0.06

120

120

1

0.67

1.42

1.95

1.92

+0.03

128

128

1

1.14

1.45

2.24

2.01

+0.24

Moving along the staleness frontier

Our staleness formula suggests that the following factors can meaningfully affect staleness:

  • Response length tailness
  • Compute allocation
  • Batch size
  • Queue capacity
  • Sampling concurrency

Below, we present methods of mitigating staleness, the intuition behind them, and their trade-offs.

Monitor the response length tailness

A key finding is that staleness is independent of the mean response length E[L]\mathbb E[L]. No term in our formula depends on this quantity. While response length does affect generation time in PQS, generation time and effective train period both increase linearly with the response length, thus cancelling out. In contrast, the important statistic from the response length distribution is the tailness MtailM_{\text{tail}}, which increases staleness proportionally.

Sensitivity of mean staleness to response length statistics predicted by the formula. Both panels plot mean staleness versus trainer utilization ρ with all settings fixed except mean response length on the left and tailness multiplier on the right.
Sensitivity of mean staleness to response length statistics predicted by the formula. Both panels plot mean staleness versus trainer utilization ρ with all settings fixed except mean response length on the left and tailness multiplier on the right.

We validate this effect in real runs. We select four 64-prompt subsets that separately vary average response length and group tailness. We find that for roughly fixed mean response lengths, increasing the tailness multiplier leads to higher staleness in both the short-response and long-response bands. Conversely, increasing mean response length while keeping tailness similar did not increase staleness.

Prompt subset typeMean response lengthTailness multiplier MtailMean staleness

Short mean response length, low grouped tailness

7.31k

1.35

1.98

Short mean response length, high grouped tailness

6.74k

1.68

2.37

Long mean response length, low grouped tailness

11.29k

1.29

1.85

Long mean response length, high grouped tailness

10.38k

1.68

2.22

In our production runs, we note that the response length distribution over time is a leading indicator of staleness changes. In high-tailed settings, a concrete lever here is tuning the response length cap. However, the max response length is often task dependent, so varying the cap can have large consequences on final results.

Change compute allocation

Per our formula, varying ρ\rho allows us to adjust staleness. We can control this quantity with our compute allocation. Specifically, if our goal is to reduce staleness, then:

  • When we are rollout-bound (ρ<1\rho<1), staleness scales linearly with ρ\rho, so we should allocate less compute to rollouts.
  • When we are train-bound (ρ>1\rho>1), staleness scales inversely with ρ\rho, so we should allocate less compute to training.

Clearly, adjusting ρ\rho away from 11 has undesirable effects on performance, which suggests that there is a trade-off between staleness and performance. We can make this trade-off explicit: assuming constant token throughput cRc_R per rollout GPU and cTc_T per train GPU, we can derive the effective train period (our proxy for performance) for different compute splits of a fixed number of GPUs NN:

tperiod(ρ)={BE[L]ρN(1cT+1cRρ),ρ<1,BE[L]N(1cT+1cRρ),ρ1.t_{\text{period}}(\rho)= \begin{cases} \dfrac{B\mathbb E[L]}{\rho N}\left(\dfrac1{c_T}+\dfrac1{c_R}\rho\right), & \rho<1,\\[8pt] \dfrac{B\mathbb E[L]}{N}\left(\dfrac1{c_T}+\dfrac1{c_R}\rho\right), & \rho\ge1. \end{cases}

We prove this result in the appendix. Combining this with the staleness formula, we can plot the staleness-performance Pareto frontier:

 Staleness-performance trade-off under a fixed compute budget predicted by the formula. We sweep GPU allocation between rollout and training with the same workload.
Staleness-performance trade-off under a fixed compute budget predicted by the formula. We sweep GPU allocation between rollout and training with the same workload.

In practice, we choose a point along this frontier depending on the underlying constraints of the task and the downstream results. Real run results also show a clear staleness-performance trade-off:

Empirical staleness-performance frontier from Qwen3-8B DAPO runs under a fixed 8-GPU budget. Each point reallocates GPUs between rollout and training, while keeping rollout concurrency high enough to saturate the rollout engines.
Empirical staleness-performance frontier from Qwen3-8B DAPO runs under a fixed 8-GPU budget. Each point reallocates GPUs between rollout and training, while keeping rollout concurrency high enough to saturate the rollout engines.

The above plots also suggest a simple rule: it is better to be rollout-bound than train-bound. We demonstrate in the appendix that this is true except in rare cases where producing rollouts is extremely cheap compared to training on them.

Increase the batch size

Scaling the batch size reduces staleness mainly through the PQS, since a larger batch size increases the effective train period without changing the decode speed (assuming constant rollout concurrency), resulting in fewer weight updates during the same duration of a single rollout’s generation. However, the trade-off is that each step also takes proportionally longer since we have more work to do in each step.

This does not necessarily mean larger batches are less efficient. Instead, we should evaluate the update period by the amount of useful data in the update. If each sample in a batch contributes equally to it, the simplest normalized measure is tstep/Bt_{\text{step}}/B, which measures wall-clock time per rollout sample. However, the benefit of larger batches is often sublinear . A more realistic proxy is a noise-adjusted update time such as tstep/Bt_{\text{step}}/\sqrt{B} (or ideally tstep/Befft_{\mathrm{step}}/\sqrt{B_{\mathrm{eff}}} when the effective batch size can be estimated from sample correlation, masking, importance weights, or filtering).

We plot the staleness-performance frontier below using the noise-corrected update time. Different batch sizes trace roughly the same Pareto frontier, but larger batches are able to achieve lower staleness than smaller ones. Conversely, smaller batches can reach lower update times that larger batches cannot. Thus, excluding additional ML effects such as reduced optimizer-update count, sample-efficiency changes, or effective-batch saturation, the choice of batch size is primarily about which region of the frontier we wish to reach.

Sensitivity of the staleness-performance trade-off to batch size under a fixed GPU budget. We compare predicted staleness against the noise-adjusted effective training period t_period/√B, which accounts for the sublinear benefit of using more rollouts per update.
Sensitivity of the staleness-performance trade-off to batch size under a fixed GPU budget. We compare predicted staleness against the noise-adjusted effective training period t_period/√B, which accounts for the sublinear benefit of using more rollouts per update.

We observe the batch size effect in real runs. In the table below, we hold concurrency fixed at C=120C = 120 for a clean comparison.

BρPredicted PQSPredicted IQSPredicted stalenessMeasured stalenessStep time (s)Step time / B (s)Step time / √B (s)

60

0.62

2.76

0.62

3.38

3.03

72.1

1.20

9.3

120

0.67

1.28

0.67

1.95

1.92

137.1

1.14

12.5

240

0.63

0.72

0.63

1.35

1.26

270.4

1.13

17.5

Doubling the batch size approximately halved PQS, consistent with the 1/B1/B scaling in PQS formula. On the performance side, the effective traing period nearly doubled, but the per-rollout effective train period tperiod/Bt_{\text{period}}/B is virtually identical. Under the more realistic noise-adjusted measure tperiod/Bt_{\text{period}}/\sqrt{B}, the larger batch is 40% slower, indicating a genuine trade-off between staleness and performance. This is consistent with the theoretical frontier above: larger batches dominate the low-staleness regime, while smaller batches are preferable when we want faster steps.

Shrink the queue capacity

Queue capacity only enters the IQS formula in the train-bound regime (ρ>1\rho > 1):

IQS(ρ)=2q+ρ12ρ.\text{IQS}(\rho)=\frac{2q+\rho -1}{2\rho}.

In this regime, the rollout engines produce groups faster than the trainer can consume them, so the queue fills to max capacity. The trainer always consumes the oldest groups first — which are also the stalest — so a larger queue means more stale groups accumulate before reaching the trainer.

In the rollout-bound regime (ρ<1\rho < 1), the formula is independent of qq because the trainer drains the queue every step regardless of its size. These effects are shown below:

Sensitivity of the staleness-performance trade-off to the queue size under a fixed compute budget. Increasing q leaves the rollout-bound branch unchanged but increases train-bound in-queue staleness because older rollouts can remain in the queue longer.
Sensitivity of the staleness-performance trade-off to the queue size under a fixed compute budget. Increasing q leaves the rollout-bound branch unchanged but increases train-bound in-queue staleness because older rollouts can remain in the queue longer.

Based on the formula, we should always minimize qq. In practice, however, we find the optimal choice to be more nuanced. Intuitively, extra queue capacity should help performance — when the rate of rollouts arriving to the queue occasionally dips below the mean, extra queue capacity ensures that the queue always has sufficient data for the trainer to fetch. This effect will be significant when ρ1\rho \approx 1 and there is high arrival variability, which our formula assumes to be zero.

For simplicity, we use our staleness simulation to show this effect. At ρ=1\rho=1, using q=1q=1 decreases staleness compared to q=2q=2 and q=4q=4 but also slightly increases the effective train period.

Simulated mean staleness as a function of the effective train period t_period. Curves sweep the rollout/trainer compute split under a fixed total compute budget. Solid segments are rollout-bound (ρ < 1), dashed segments are train-bound (ρ > 1), and stars mark the balanced point (ρ = 1).
Simulated mean staleness as a function of the effective train period t_period. Curves sweep the rollout/trainer compute split under a fixed total compute budget. Solid segments are rollout-bound (ρ < 1), dashed segments are train-bound (ρ > 1), and stars mark the balanced point (ρ = 1).

Regardless, our recommendation is to set q=1q=1. A larger queue is only beneficial when the system is near balance, and the slight performance improvement does not justify the risk of extra staleness if ρ\rho drifts above 1.

We observe the queue capacity effect in real runs as well. The following two runs share C=128C = 128, B=128B = 128, and the same compute split, but differ in queue factor (q=2q = 2 vs. q=1q = 1).

qρPredicted PQSPredicted IQSPredicted stalenessMeasured stalenessStep time (s)

2

1.07

1.36

1.93

3.29

3.09

141.4

1

1.14

1.31

0.94

2.25

2.01

142.6

Both runs are train-bound, so the q=1q = 1 configuration achieved 35% lower total staleness while maintaining virtually identical effective train period. The reduction is concentrated entirely in the in-queue component, consistent with the formula.

Tune the rollout concurrency

Rollout concurrency CC enters staleness exclusively through the PQS term. We analyze the effect of concurrency in the linear throughput case, where vRv_R scales linearly with CC. The effect of concurrency depends on the utilization:

  • Train-bound (ρ>1\rho > 1): PQS simplifies to CMtailBρ=MtailvTBsR\frac{CM_{\text{tail}}}{B \rho} = \frac{M_{\text{tail}}v_T}{B s_R}. Because the right-hand side is independent of CC (assuming per-slot decode speed sRs_R is roughly constant as concurrency changes), reducing concurrency has little effect on staleness in this regime.
  • Rollout-bound (ρ<1\rho < 1): PQS is CMtailB\frac{CM_{\text{tail}}}B, which scales linearly with CC. Reducing concurrency directly lowers staleness because fewer slots are actively generating at once, meaning fewer samples are stamped with stale policy versions before they reach the queue.

At the same time, lower concurrency decreases rollout throughput and thus increases the effective train period. We plot the theoretical staleness-performance frontier under different per-GPU concurrencies below (we adjust concurrency per replica so the comparison isolates the concurrency effect from changes in rollout compute allocation). The frontiers largely overlap: different concurrency settings trace the same Pareto curve, but lower concurrency extends it into the low-staleness region at the cost of longer steps. In practice, the choice depends on whether the priority is minimizing staleness or maximizing training speed.

Sensitivity of the staleness-performance trade-off to the rollout concurrency under a fixed compute budget. Higher concurrency can reduce the effective train period t_period by increasing rollout throughput, but it also increases the number of in-flight samples and can raise staleness.
Sensitivity of the staleness-performance trade-off to the rollout concurrency under a fixed compute budget. Higher concurrency can reduce the effective train period t_period by increasing rollout throughput, but it also increases the number of in-flight samples and can raise staleness.

We observe this trade-off directly in real runs. In the following two runs, both use q=1q = 1, B=120B = 120, S=8S = 8, and the same number of rollout GPUs, and they differ only in rollout concurrency.

CρPredicted PQSPredicted IQSPredicted stalenessMeasured stalenessStep time (s)

120

0.67

1.28

0.67

1.95

1.92

137.1

240

0.85

2.24

0.85

3.09

3.15

106.4

Halving the concurrency reduces staleness by 39% at the cost of 29% longer effective train periods. The staleness reduction is driven primarily by PQS, as predicted.

Conclusion

In this post, we developed a closed-form framework for reasoning about staleness in fully async RL. By decomposing staleness into its pre-queue and in-queue components, we derived tractable formulas that depend on a small number of system parameters: response length tailness MtailM_{\text{tail}}, utilization ρ\rho, batch size BB, rollout concurrency CC, and queue factor qq. Our predictions match both discrete-event simulations and real training runs to within fractions of a step.

The central insight is that staleness and throughput form a Pareto frontier. For a fixed GPU budget, reducing staleness requires moving utilization away from ρ=1\rho = 1, which necessarily costs throughput. Our analysis also reveals that the frontier is asymmetric: it is almost always better to be rollout-bound than train-bound. The practical levers for navigating this frontier — compute allocation, batch size, queue capacity, and concurrency — each offer different trade-off profiles, and our formulas make it possible to reason about these trade-offs quantitatively before committing to a run.

Using this work, practitioners can predict staleness from a proposed configuration before launching a run, diagnose the dominant source of staleness in a live system, and choose targeted interventions with known costs. We reiterate our general rules of thumb:

  • Set q=1q=1: a minimal queue avoids unnecessary staleness accumulation with negligible throughput cost.
  • Prefer to be rollout-bound: slightly under-provisioning rollout gives a better staleness-performance trade-off than over-provisioning.
  • Monitor response length tailness: it can be a leading indicator of staleness changes.
  • Trade-off staleness with train period via batch size: larger batches access lower-staleness operating points at the cost of slower noise-adjusted updates.

We end by highlighting a few directions for future work:

  1. Better queue algorithms: can we design algorithms that exploit the response-length distribution to reduce staleness below queue-drop without introducing bias?
  2. Algorithm bias beyond response length: on that note, what other types of bias might be worth investigating? We showed that queue-drop is unbiased with respect to response length, but other sample properties may be correlated with generation time in subtle ways.
  3. Staleness vs. task reward: our analysis treats staleness as a proxy for off-policy degradation, but the actual relationship between staleness and downstream results is task-dependent. How can we characterize this relationship and identify when staleness is detrimental?

If any of these directions excite you, we’d love to explore them together at Applied Compute.

Appendix

Response length bias in queue algorithms

Recall that queue-max evicts samples past a max staleness threshold. Since longer samples accrue more policy updates while generating (higher PQS), they are more likely to be dropped, resulting in a shorter response length distribution of rollouts sent to the trainer. In contrast, queue-drop evicts by queue position, which we will show does not bias the response length in steady state.

We begin with the following steady-state approximation: in steady state, the response lengths of completed rollouts are independent from the lengths and generation progress of in-flight rollouts. This has two consequences:

  1. Distinct groups have independent response lengths
  2. A group’s response length is independent of the completion times of all other groups.

We now divide the proof into two cases:

Case 1: rollout-bound

Since the trainer is faster than the rollout engines, it must process every sample that the rollout engine produces. Thus, the response length distribution of trained samples is exactly the distribution of response lengths produced.

Case 2: train-bound

In steady-state, the queue is always full before a train step, with train batches sampled from the end. Formally, we train on the groups with ranks {QgroupG+1,,Qgroup}\{Q_{\text{group}}-G+1,\ldots,Q_{\text{group}}\}, where rank 11 is the newest group and QgroupQ_{\text{group}} is the oldest group.

Consider some candidate group gg. This group will be in the queue until it is dropped at some time τd\tau_d, when QgroupQ_{\text{group}} groups finish generating after gg. By the steady-state approximation, the completion times of groups after gg are independent of any groups already completed. Therefore, the queue exit time τd\tau_d is independent of group gg or any group generated before it.

Now, consider the time τ\tau when the trainer trains on group gg or a group generated after it. If τ>τd\tau > \tau_d, gg exits the queue before getting trained on. If τ<τd\tau < \tau_d, gg gets trained on.

Since any step before time τ\tau only trains on samples produced before gg, the contents of the corresponding train batches are independent of group gg or any groups produced after it. Therefore, in steady state, τ\tau is also independent of group gg or any groups produced after it. Since τd\tau_d only depends on groups generated after gg and τ\tau only depends on groups generated before gg, the event τ<τd\tau < \tau_d is independent from gg, so the probability that gg is trained on in steady-state is independent of its contents.

The steady-state approximation is not perfect; with S>1S > 1, some rollouts in a group can finish earlier than others, creating extra available concurrency slots which are then biased towards containing short samples. In practice, the approximation holds as long as we process rollouts from many different groups concurrently.

Empirical results

We verify these response length biases (or lack thereof) empirically. At tailness 50, we see that queue-drop is unbiased unlike queue-max, which becomes decreases at stricter max staleness thresholds:

Queue algorithmSampled meanTrained meanSampled maxTrained max

queue-drop with Q=1

1404

1403

8080

8080

queue-drop with Q=2

1405

1403

8080

8080

queue-max with max staleness 1

1401

1229

8080

4983

queue-max with max staleness 2

1399

1259

8080

6555

queue-max with max staleness 3

1403

1292

8080

8080

queue-max with max staleness 4

1402

1351

8080

8080

At tailness 90, the bias is even more apparent:

Queue algorithmSampled meanTrained meanSampled maxTrained max

queue-drop with Q=1

1349

1344

12080

12080

queue-drop with Q=2

1357

1352

12080

12080

queue-max with max staleness 1

1354

897

12080

5430

queue-max with max staleness 2

1359

962

12080

6388

queue-max with max staleness 3

1355

1027

12080

7979

queue-max with max staleness 4

1351

1077

12080

9572

Smooth-arrival approximation proof

The smooth-arrival approximation is used to estimate the number of policy updates that occur from the start of a rollout’s generation to when it is trained on based on its lifetime. This approximation holds when we have a wide response length distribution. Formally, consider a rollout which begins generating at time t0t_0, where t0t_0 is a random variable, and is trained on at time t1t_1, which is fixed and known. Then the smooth-arrival approximation states that the expected number of policy updates between t0t_0 and t1t_1 (including the update at time t1t_1) is

E[nupdate]=t1E[t0]tperiod+12.\mathbb{E}[n_{\text{update}} ]=\frac{t_1-\mathbb{E}[t_0]}{t_\text{period}}+\frac{1}{2}.

The approximation holds well when the order of the standard deviation of t0t_0 is Ω(tperiod)\Omega(t_{\text{period}}).

We now derive this approximation. Since there is a policy update at time t1t_1 and policy updates occur with interval tperiodt_{\text{period}}, the number of policy updates is approximately

nupdatet1t0tperiod.n_{\text{update}}\approx \left\lceil \frac{t_1-t_0}{t_{\text{period}}}\right\rceil.

Here and below, x\lceil x\rceil is the ceiling of xx and {x}\{x\} is the fractional part of xx. For all non-integer xx, we have x=x+1{x}\lceil x \rceil = x+1-\{x\}. Then,

E[nupdate]E[t1t0tperiod]=t1E[t0]tperiod+1E[{t0tperiod}].\begin{align*} \mathbb{E}[n_{\text{update}} ] &\approx \mathbb{E}\left[\left\lceil \frac{t_1-t_0}{t_{\text{period}}}\right\rceil\right] \\ &= \frac{t_1-\mathbb E[t_0]}{t_{\text{period}}}+1-\mathbb{E}\left[\left\{ \frac{t_0}{t_{\text{period}}}\right\}\right]. \end{align*}

When the distribution of t0t_0 has a large standard deviation of order Ω(tperiod)\Omega(t_{\text{period}}), the distribution of {t0tperiod}\left\{ \frac{t_0}{t_{\text{period}}}\right\} is nearly uniform. Thus, its expected value is 12\frac12, giving the desired approximation.

The smooth arrival approximation works especially well in multi-turn, agentic settings, where rollouts have highly variable lengths. In our simulations and production runs, we find that its predictions also hold well for single-turn rollouts.

Building an efficient staleness simulator

To validate the staleness formula efficiently, we built a discrete-event simulator for fully asynchronous RL with a queue-drop buffer algorithm. The simulator mirrors the parts of the real system that determine staleness: rollout generation, buffering, trainer consumption, and weight-version updates.

We assume that the rollout engines have CC concurrent inference slots. Each slot generates one sample at a time. A sample draws its response length LiL_i from a capped lognormal distribution D\mathcal D: when the tailness parameter is zero, all responses have deterministic length E[L]\mathbb E[L]; otherwise, we sample zN(0,1)z \sim \mathcal N(0,1) and set

Li=E[L]exp(σzσ2/2),σ=1.3tailness100,L_i = \mathbb E[L]\exp(\sigma z - \sigma^2/2), \qquad \sigma = 1.3 \cdot \frac{\text{tailness}}{100},

then round and cap the length to avoid unrealistically large outliers. The 1.31.3 is to help align the response length distribution of real workloads. The rollout engines’ generation time is L/sRL/s_R. A group can only enter the buffer after all SS samples finish, so the simulator naturally captures the straggler effect from long-tail responses as described before: the group completion time is controlled by the slowest sample in the group.

Each group is stamped with the rollout policy version used when its generation begins. This version stays fixed through generation and buffering. Once a group completes, it enters a bounded queue of capacity QQ. For example, under the queue-drop policy, if the queue is full when a new group arrives, the oldest queued group is dropped and a new group is inserted. The trainer consumes the GG oldest queued groups whenever a full training batch is available; otherwise, it waits for the queue to refill. Each train step takes tstept_{\text{step}} seconds and increments the trainer policy version by one.

For every trained group, the simulator computes staleness exactly as

staleness = train policy version at consumption - rollout policy version at generation.\small{\text{staleness = train policy version at consumption - rollout policy version at generation}}.

We then measure the mean staleness over trained samples after the simulation reaches steady state.

Note that the simulator is intentionally event-driven rather than step-based: it advances directly to the next sample completion, trainer completion, or history checkpoint. This makes it fast enough to run large parameter grids while preserving the discrete effects that matter most for staleness, especially group stragglers, bounded-queue drops, trainer waiting, and integer policy-version jumps.

Train period under fixed GPU budget

Recall that the train period is this quantity divided by the bottleneck throughput:

tperiod=BE[L]min(vR,vT).t_{\text{period}}=\frac{B\mathbb E[L]}{\min(v_R,v_T)}.

Let the total GPU budget be N=NR+NTN=N_R+N_T, where NRN_R is the number of rollout GPUs and NTN_T is the number of train GPUs. Now assume that the token throughput cRc_R per rollout GPU and the token throughput cTc_T per train GPU are constant. Then vR=cRNRv_R=c_RN_R and vT=cTNTv_T=c_TN_T scale linearly with their GPU allocations.

We will re-parametrize the train period in terms of ρ\rho and NN. Since

ρ=vRvT=cRcTNRNT,\rho=\frac{v_R}{v_T}=\frac{c_R}{c_T}\cdot\frac{N_R}{N_T},

we can use N=NR+NTN=N_R+N_T to derive

vR=ρN1cT+1cRρ,vT=N1cT+1cRρ.v_R=\frac{\rho N}{\frac1{c_T}+\frac1{c_R}\rho},\qquad v_T=\frac N{\frac1{c_T}+\frac1{c_R}\rho}.

Thus,

tperiod(ρ)=BE[L](1cT+1cRρ)Nmin(1,ρ).t_{\text{period}}(\rho)=\frac{B\mathbb E[L]\left(\frac1{c_T}+\frac1{c_R}\rho\right)}{N\min(1,\rho)}.

Consequently, tperiodt_{\text{period}} is strictly decreasing on (0,1)(0,1), strictly increasing on (1,)(1,\infty), and minimized at ρ=1\rho=1.

Does rollout-bound always lead to better trade-off?

We observe above that for any achievable effective train period tperiodt_{\text{period}}, a setup can reach that period with two different utilizations (unless ρ=1\rho=1). One of these utilizations is less than 1, and the other is more than 1. But these setups are not equal: we often achieve lower staleness via the rollout-bound utilization compared to the train-bound utilization. The condition that determines which setup is better depends on two factors:

  1. The queue factor q=QBq=\frac QB, and
  2. The rollout-to-trainer compute ratio β\beta^* at ρ=1\rho=1.

Smaller β\beta means rollouts are cheaper relative to training. We will derive this condition below.

Like in our proof of the formula for tperiodt_{\text{period}}, we will assume that rollout and train throughput scale linearly with rollout and train GPUs, respectively. This tells us that β\beta^* is a constant; in fact, it is exactly cTcR\frac{c_T}{c_R} (both terms are defined above in our tperiodt_{\text{period}} proof).

Recall that

IQS(ρ)={ρ,ρ<1,2q+ρ12ρ,ρ>1.\text{IQS}(\rho) = \begin{cases} \rho, & \rho<1,\\[4pt] \dfrac{2q+\rho-1}{2\rho}, & \rho>1. \end{cases}

Define ρ+=1+z\rho_+=1+z where z>0z>0. Then we can derive the other utilization ρ\rho_- with the same effective train period as follows:

tperiod(ρ)=tperiod(ρ+),BE[L](1cT+1cRρ)Nρ=BE[L](1cT+1cRρ+)N,ρ=11+β(ρ+1)=11+βz.\begin{align*} t_{\text{period}}(\rho_-) &= t_{\text{period}}(\rho_+),\\ \frac{B\mathbb E[L]\left(\frac1{c_T}+\frac1{c_R}\rho_-\right)}{N\rho_-} &= \frac{B\mathbb E[L]\left(\frac1{c_T}+\frac1{c_R}\rho_+\right)}{N},\\ \rho_- &= \frac1{1+\beta^*(\rho_+-1)}=\frac1{1+\beta^*z}. \end{align*}

The PQS is constant for a fixed effective train period, so the signed staleness gap is only the difference in IQS:

gap(z;β,q)=staleness(ρ+)staleness(ρ)=IQS(ρ+)IQS(ρ)=z+2q2(1+z)11+βz.\begin{align*} \mathrm{gap}(z;\beta,q) &= \mathrm{staleness}(\rho_+) - \mathrm{staleness}(\rho_-)\\ &= \text{IQS}(\rho_+)-\text{IQS}(\rho_-)\\ &= \frac{z+2q}{2(1+z)} - \frac{1}{1+\beta^* z}. \end{align*}

Negative gap means the ρ>1\rho>1 point is better. Define the best possible signed gap as

gapmin(β,q)=infz>0gap(z;β,q).\mathrm{gap}_{\min}(\beta^*,q) = \inf_{z>0}\mathrm{gap}(z;\beta^*,q).

Then there exists zz such that the ρ>1\rho>1 branch improves the frontier iff

gapmin(β,q)<0    z+2q2(1+z)<11+βz.\mathrm{gap}_{\min}(\beta^*,q)<0 \iff \frac{z+2q}{2(1+z)} < \frac{1}{1+\beta^* z}.

Recalling q1q\ge1, this condition simplifies to

β<βcrit(q)=16q4+4(q1)(2q1).\beta^* < \beta_{\mathrm{crit}}^*(q) = \frac{1}{6q-4+4\sqrt{(q-1)(2q-1)}}.

In particular,

βcrit(1)=12,βcrit(2)=18+430.067.\beta_{\mathrm{crit}}^*(1)=\frac12,\qquad \beta_{\mathrm{crit}}^*(2) = \frac{1}{8+4\sqrt3} \approx 0.067.

Thus, even for q=1q=1, the ρ>1\rho>1 side can only improve the frontier when rollout compute is less than half of trainer compute at the balanced point. For q=2q=2, rollout compute must be less than about 6.7%6.7\% of trainer compute.