At Applied Compute, we are pushing the frontier of large-scale reinforcement learning. Prior work has primarily focused on better systems performance to keep the GPUs running at full utilization. In particular, fully asynchronous RL achieves this by disaggregating training and rollout generation so that neither has to wait for the other. But this design introduces a hidden cost: because the rollout engines are never idle, they often generate samples using a policy that is several training steps behind. We call this lag staleness — and as it grows, the model trains on increasingly off-policy data, which can degrade downstream results.
How we control staleness is not well understood, and worse yet, not independent of performance. In previous work⌝, we explored how to allocate train and rollout engines from a fixed compute pool to balance staleness and performance. In this post, we go further: we derive closed-form formulas for mean staleness, which reveal that staleness depends on a handful of system and workload parameters: utilization, batch size, queue capacity, rollout concurrency, and the tailness of the response length distribution. We also show that there is an inherent trade-off between staleness and performance. We validate these predictions against both a discrete-event simulator and real training runs, then give practical recommendations for navigating the trade-off.
Approximating staleness in async RL
Let’s start by introducing the cast of an async RL stack:
- The rollout engines continuously produce rollouts. For language models, this occurs via a high-performance inference engine.
- A queue sits between the rollout engines and the trainer. It stores a group once all rollouts for a task have completed. The queue controls which groups are trained on via the queue algorithm.
- The trainer accepts groups (or waits until there are enough groups in the queue to fill a full batch) and trains on them to generate a new policy. In this blog, we assume that our RL algorithm is GRPO.
- This new policy is then synced back to the rollout engines in-flight (while the engines are still running).
Note that because the rollout engines are never idle, the trainer often receives tokens generated by an old policy version. We define the staleness of a rollout as the number of policy versions between the first token generated and when the rollout is trained on. As staleness grows, the model is trained on more off-policy samples, which can affect the training result significantly.
We aim to understand the factors that influence staleness. The total staleness can be divided into two parts, which we analyze separately: pre-queue and in-queue staleness.
The key variables to keep track of in the derivation are:
| Variable | Symbol | Units |
|---|---|---|
Response length distribution | tokens | |
Rollout throughput | tokens per second | |
Sampling concurrency | slots | |
Train throughput | tokens per second | |
Groups per train batch | groups | |
Samples per group | rollouts | |
Max queue capacity | rollouts |
These quantities allow us to derive the following:
| Variable | Symbol | Units |
|---|---|---|
Trainer utilization | - | |
Batch size | rollouts | |
Decode speed (per request) | tokens per second | |
Mean response length | (where ) | tokens |
Group tailness multiplier | - | |
Queue factor | - |
Pre-queue staleness
The pre-queue staleness (PQS) measures how many policy updates happen while a rollout is being generated. Formally, it is the average difference in the policy version between the start and end of a rollout. To calculate the PQS, we need two pieces of information:
- Group generation time: the time it takes to generate a rollout group,
- Effective train period: the time between two policy updates.
When training with GRPO, rollouts enter the queue when all rollouts for the same prompt have been generated. Assuming that the rollouts in a group start at the same time, the mean group generation time is
where .
The training period depends on the number of tokens per batch as well as the slower of the train and rollout throughputs because either a batch is already available for the trainer to process or the trainer is waiting for the batch to finish generating. The mean number of tokens in a batch is , so the mean effective train period is
Putting these together, the PQS is
In-queue staleness
The in-queue staleness (IQS) measures how many train steps pass while a rollout is in the queue. Without some control over the queue size, the queue can grow unbounded, leading to unlimited staleness. As such, the queue algorithms used in practice set a limit on how high staleness can grow, either explicitly as in queue-max or implicitly like queue-drop.
Both of these algorithms send the oldest completed groups to the trainer, but queue-max discards samples above a max staleness threshold, while queue-drop drops the oldest groups when the number of groups in the queue exceeds its capacity. We find that queue-max preferentially trains on shorter rollouts, which biases the trained rollout distribution. On the other hand, queue-drop is approximately unbiased with respect to response length because a rollout’s generation time is independent of its length once steady-state is reached (a formal proof is given in the appendix). Therefore, in this blog, we choose to analyze queue-drop.
To derive the staleness of queue-drop, we make the following assumptions:
- Rollout groups arrive in the queue at a constant rate.
- The trainer processes a batch of groups from the queue at a constant rate.
- Syncing weights from the trainer to the rollout engines takes negligible time (e.g. via RDMA).
With these assumptions, we can calculate the IQS. We split the derivation into two cases: rollout-bound () and train-bound ().
Case 1: rollout-bound ()
Let the current train step be . Since , the trainer is faster than the rollout engines, so the trainer fully drains the queue at the start of the forward pass. Thus, at the start of the train step, the queue is empty. At the end, rollouts have arrived in the queue, all admitted during generation version . The trainer must wait for more rollouts to arrive, all of which will enter the queue at version . Thus, during step , the mean policy version of queue samples is
After the policy is updated to version , we derive a mean IQS of
Case 2: train-bound ()
When , the rollout side is faster than the trainer, so in steady state, our queue reaches the max capacity of rollout groups. To find the mean IQS of a train batch, we first find the average age of a rollout in the batch, then divide it by the train period to obtain a continuous estimate of staleness. Finally, we use a smooth-arrival approximation to correct our estimate to the discrete case, since staleness increases in integral increments.
First, since the trainer takes groups per step and rollout produces times as fast as the trainer consumes, the rollout engine completes groups per train step. Thus, under the constant group arrival rate assumption, the average time between completed group arrivals is
Order the queue from newest to oldest with ranks 1 to . The trainer consumes at a roughly uniform time between two completed-group arrivals. As such, the newest queued group, rank-, waits about half a group-arrival interval on average. A rank- group arrives r-1 group arrivals before the newest group, so its mean queue age is
Thus, the mean queue age of the consumed batch is
Dividing by , we have
since .
As mentioned above, this is the continuous queue age measured in train-step units. However, staleness ultimately measures a discrete quantity: the number of policy updates between when a rollout begins generating and when it is trained on. To bridge this gap, we introduce the smooth-arrival approximation, which states that a rollout’s expected staleness at train time is its total age divided by the train period, plus , when the standard deviation of the response length distribution is on the order of . See the appendix for a more detailed derivation.
The smooth-arrival approximation allows us to write the IQS of a discrete queue as
Combining the two cases, the queue-drop in-queue staleness is
Total staleness
Putting it all together, we find that the mean total staleness using the queue-drop algorithm is
To validate our staleness formula efficiently, we built a discrete-event async RL simulator. The simulator mirrors the parts of the real system that determine staleness: rollout generation, buffering, trainer consumption, and weight updates. We give a full description of our simulator in the appendix.
Comparing our formula against the simulator, we see that our predictions are accurate:
We also evaluate our formula against real training runs. All runs train Qwen3-8B with GRPO on the DAPO math reasoning task⌝. For each run, we extract , , , , and from logged metrics, then compare the predicted staleness against its empirical mean:
| Concurrency | Batch size | Queue factor | Utilization | Tailness multiplier | Predicted staleness | Measured staleness | Error |
|---|---|---|---|---|---|---|---|
120 | 240 | 2 | 0.63 | 1.42 | 1.34 | 1.26 | +0.08 |
240 | 120 | 2 | 0.92 | 1.43 | 3.86 | 3.59 | +0.27 |
128 | 128 | 2 | 1.07 | 1.44 | 3.29 | 3.09 | +0.20 |
240 | 120 | 1 | 0.86 | 1.42 | 3.34 | 3.40 | −0.06 |
120 | 120 | 1 | 0.67 | 1.42 | 1.95 | 1.92 | +0.03 |
128 | 128 | 1 | 1.14 | 1.45 | 2.24 | 2.01 | +0.24 |
Moving along the staleness frontier
Our staleness formula suggests that the following factors can meaningfully affect staleness:
- Response length tailness
- Compute allocation
- Batch size
- Queue capacity
- Sampling concurrency
Below, we present methods of mitigating staleness, the intuition behind them, and their trade-offs.
Monitor the response length tailness
A key finding is that staleness is independent of the mean response length . No term in our formula depends on this quantity. While response length does affect generation time in PQS, generation time and effective train period both increase linearly with the response length, thus cancelling out. In contrast, the important statistic from the response length distribution is the tailness , which increases staleness proportionally.
We validate this effect in real runs. We select four 64-prompt subsets that separately vary average response length and group tailness. We find that for roughly fixed mean response lengths, increasing the tailness multiplier leads to higher staleness in both the short-response and long-response bands. Conversely, increasing mean response length while keeping tailness similar did not increase staleness.
| Prompt subset type | Mean response length | Tailness multiplier Mtail | Mean staleness |
|---|---|---|---|
Short mean response length, low grouped tailness | 7.31k | 1.35 | 1.98 |
Short mean response length, high grouped tailness | 6.74k | 1.68 | 2.37 |
Long mean response length, low grouped tailness | 11.29k | 1.29 | 1.85 |
Long mean response length, high grouped tailness | 10.38k | 1.68 | 2.22 |
In our production runs, we note that the response length distribution over time is a leading indicator of staleness changes. In high-tailed settings, a concrete lever here is tuning the response length cap. However, the max response length is often task dependent, so varying the cap can have large consequences on final results.
Change compute allocation
Per our formula, varying allows us to adjust staleness. We can control this quantity with our compute allocation. Specifically, if our goal is to reduce staleness, then:
- When we are rollout-bound (), staleness scales linearly with , so we should allocate less compute to rollouts.
- When we are train-bound (), staleness scales inversely with , so we should allocate less compute to training.
Clearly, adjusting away from has undesirable effects on performance, which suggests that there is a trade-off between staleness and performance. We can make this trade-off explicit: assuming constant token throughput per rollout GPU and per train GPU, we can derive the effective train period (our proxy for performance) for different compute splits of a fixed number of GPUs :
We prove this result in the appendix. Combining this with the staleness formula, we can plot the staleness-performance Pareto frontier:
In practice, we choose a point along this frontier depending on the underlying constraints of the task and the downstream results. Real run results also show a clear staleness-performance trade-off:
The above plots also suggest a simple rule: it is better to be rollout-bound than train-bound. We demonstrate in the appendix that this is true except in rare cases where producing rollouts is extremely cheap compared to training on them.
Increase the batch size
Scaling the batch size reduces staleness mainly through the PQS, since a larger batch size increases the effective train period without changing the decode speed (assuming constant rollout concurrency), resulting in fewer weight updates during the same duration of a single rollout’s generation. However, the trade-off is that each step also takes proportionally longer since we have more work to do in each step.
This does not necessarily mean larger batches are less efficient. Instead, we should evaluate the update period by the amount of useful data in the update. If each sample in a batch contributes equally to it, the simplest normalized measure is , which measures wall-clock time per rollout sample. However, the benefit of larger batches is often sublinear . A more realistic proxy is a noise-adjusted update time such as (or ideally when the effective batch size can be estimated⌝ from sample correlation, masking, importance weights, or filtering).
We plot the staleness-performance frontier below using the noise-corrected update time. Different batch sizes trace roughly the same Pareto frontier, but larger batches are able to achieve lower staleness than smaller ones. Conversely, smaller batches can reach lower update times that larger batches cannot. Thus, excluding additional ML effects such as reduced optimizer-update count, sample-efficiency changes, or effective-batch saturation, the choice of batch size is primarily about which region of the frontier we wish to reach.
We observe the batch size effect in real runs. In the table below, we hold concurrency fixed at for a clean comparison.
| B | ρ | Predicted PQS | Predicted IQS | Predicted staleness | Measured staleness | Step time (s) | Step time / B (s) | Step time / √B (s) |
|---|---|---|---|---|---|---|---|---|
60 | 0.62 | 2.76 | 0.62 | 3.38 | 3.03 | 72.1 | 1.20 | 9.3 |
120 | 0.67 | 1.28 | 0.67 | 1.95 | 1.92 | 137.1 | 1.14 | 12.5 |
240 | 0.63 | 0.72 | 0.63 | 1.35 | 1.26 | 270.4 | 1.13 | 17.5 |
Doubling the batch size approximately halved PQS, consistent with the scaling in PQS formula. On the performance side, the effective traing period nearly doubled, but the per-rollout effective train period is virtually identical. Under the more realistic noise-adjusted measure , the larger batch is 40% slower, indicating a genuine trade-off between staleness and performance. This is consistent with the theoretical frontier above: larger batches dominate the low-staleness regime, while smaller batches are preferable when we want faster steps.
Shrink the queue capacity
Queue capacity only enters the IQS formula in the train-bound regime ():
In this regime, the rollout engines produce groups faster than the trainer can consume them, so the queue fills to max capacity. The trainer always consumes the oldest groups first — which are also the stalest — so a larger queue means more stale groups accumulate before reaching the trainer.
In the rollout-bound regime (), the formula is independent of because the trainer drains the queue every step regardless of its size. These effects are shown below:
Based on the formula, we should always minimize . In practice, however, we find the optimal choice to be more nuanced. Intuitively, extra queue capacity should help performance — when the rate of rollouts arriving to the queue occasionally dips below the mean, extra queue capacity ensures that the queue always has sufficient data for the trainer to fetch. This effect will be significant when and there is high arrival variability, which our formula assumes to be zero.
For simplicity, we use our staleness simulation to show this effect. At , using decreases staleness compared to and but also slightly increases the effective train period.
Regardless, our recommendation is to set . A larger queue is only beneficial when the system is near balance, and the slight performance improvement does not justify the risk of extra staleness if drifts above 1.
We observe the queue capacity effect in real runs as well. The following two runs share , , and the same compute split, but differ in queue factor ( vs. ).
| q | ρ | Predicted PQS | Predicted IQS | Predicted staleness | Measured staleness | Step time (s) |
|---|---|---|---|---|---|---|
2 | 1.07 | 1.36 | 1.93 | 3.29 | 3.09 | 141.4 |
1 | 1.14 | 1.31 | 0.94 | 2.25 | 2.01 | 142.6 |
Both runs are train-bound, so the configuration achieved 35% lower total staleness while maintaining virtually identical effective train period. The reduction is concentrated entirely in the in-queue component, consistent with the formula.
Tune the rollout concurrency
Rollout concurrency enters staleness exclusively through the PQS term. We analyze the effect of concurrency in the linear throughput case, where scales linearly with . The effect of concurrency depends on the utilization:
- Train-bound (): PQS simplifies to . Because the right-hand side is independent of (assuming per-slot decode speed is roughly constant as concurrency changes), reducing concurrency has little effect on staleness in this regime.
- Rollout-bound (): PQS is , which scales linearly with . Reducing concurrency directly lowers staleness because fewer slots are actively generating at once, meaning fewer samples are stamped with stale policy versions before they reach the queue.
At the same time, lower concurrency decreases rollout throughput and thus increases the effective train period. We plot the theoretical staleness-performance frontier under different per-GPU concurrencies below (we adjust concurrency per replica so the comparison isolates the concurrency effect from changes in rollout compute allocation). The frontiers largely overlap: different concurrency settings trace the same Pareto curve, but lower concurrency extends it into the low-staleness region at the cost of longer steps. In practice, the choice depends on whether the priority is minimizing staleness or maximizing training speed.
We observe this trade-off directly in real runs. In the following two runs, both use , , , and the same number of rollout GPUs, and they differ only in rollout concurrency.
| C | ρ | Predicted PQS | Predicted IQS | Predicted staleness | Measured staleness | Step time (s) |
|---|---|---|---|---|---|---|
120 | 0.67 | 1.28 | 0.67 | 1.95 | 1.92 | 137.1 |
240 | 0.85 | 2.24 | 0.85 | 3.09 | 3.15 | 106.4 |
Halving the concurrency reduces staleness by 39% at the cost of 29% longer effective train periods. The staleness reduction is driven primarily by PQS, as predicted.
Conclusion
In this post, we developed a closed-form framework for reasoning about staleness in fully async RL. By decomposing staleness into its pre-queue and in-queue components, we derived tractable formulas that depend on a small number of system parameters: response length tailness , utilization , batch size , rollout concurrency , and queue factor . Our predictions match both discrete-event simulations and real training runs to within fractions of a step.
The central insight is that staleness and throughput form a Pareto frontier. For a fixed GPU budget, reducing staleness requires moving utilization away from , which necessarily costs throughput. Our analysis also reveals that the frontier is asymmetric: it is almost always better to be rollout-bound than train-bound. The practical levers for navigating this frontier — compute allocation, batch size, queue capacity, and concurrency — each offer different trade-off profiles, and our formulas make it possible to reason about these trade-offs quantitatively before committing to a run.
Using this work, practitioners can predict staleness from a proposed configuration before launching a run, diagnose the dominant source of staleness in a live system, and choose targeted interventions with known costs. We reiterate our general rules of thumb:
- Set : a minimal queue avoids unnecessary staleness accumulation with negligible throughput cost.
- Prefer to be rollout-bound: slightly under-provisioning rollout gives a better staleness-performance trade-off than over-provisioning.
- Monitor response length tailness: it can be a leading indicator of staleness changes.
- Trade-off staleness with train period via batch size: larger batches access lower-staleness operating points at the cost of slower noise-adjusted updates.
We end by highlighting a few directions for future work:
- Better queue algorithms: can we design algorithms that exploit the response-length distribution to reduce staleness below
queue-dropwithout introducing bias? - Algorithm bias beyond response length: on that note, what other types of bias might be worth investigating? We showed that
queue-dropis unbiased with respect to response length, but other sample properties may be correlated with generation time in subtle ways. - Staleness vs. task reward: our analysis treats staleness as a proxy for off-policy degradation, but the actual relationship between staleness and downstream results is task-dependent. How can we characterize this relationship and identify when staleness is detrimental?
If any of these directions excite you, we’d love to explore them together at Applied Compute.
Appendix
Response length bias in queue algorithms
Recall that queue-max evicts samples past a max staleness threshold. Since longer samples accrue more policy updates while generating (higher PQS), they are more likely to be dropped, resulting in a shorter response length distribution of rollouts sent to the trainer. In contrast, queue-drop evicts by queue position, which we will show does not bias the response length in steady state.
We begin with the following steady-state approximation: in steady state, the response lengths of completed rollouts are independent from the lengths and generation progress of in-flight rollouts. This has two consequences:
- Distinct groups have independent response lengths
- A group’s response length is independent of the completion times of all other groups.
We now divide the proof into two cases:
Case 1: rollout-bound
Since the trainer is faster than the rollout engines, it must process every sample that the rollout engine produces. Thus, the response length distribution of trained samples is exactly the distribution of response lengths produced.
Case 2: train-bound
In steady-state, the queue is always full before a train step, with train batches sampled from the end. Formally, we train on the groups with ranks , where rank is the newest group and is the oldest group.
Consider some candidate group . This group will be in the queue until it is dropped at some time , when groups finish generating after . By the steady-state approximation, the completion times of groups after are independent of any groups already completed. Therefore, the queue exit time is independent of group or any group generated before it.
Now, consider the time when the trainer trains on group or a group generated after it. If , exits the queue before getting trained on. If , gets trained on.
Since any step before time only trains on samples produced before , the contents of the corresponding train batches are independent of group or any groups produced after it. Therefore, in steady state, is also independent of group or any groups produced after it. Since only depends on groups generated after and only depends on groups generated before , the event is independent from , so the probability that is trained on in steady-state is independent of its contents.
The steady-state approximation is not perfect; with , some rollouts in a group can finish earlier than others, creating extra available concurrency slots which are then biased towards containing short samples. In practice, the approximation holds as long as we process rollouts from many different groups concurrently.
Empirical results
We verify these response length biases (or lack thereof) empirically. At tailness 50, we see that queue-drop is unbiased unlike queue-max, which becomes decreases at stricter max staleness thresholds:
| Queue algorithm | Sampled mean | Trained mean | Sampled max | Trained max |
|---|---|---|---|---|
queue-drop with Q=1 | 1404 | 1403 | 8080 | 8080 |
queue-drop with Q=2 | 1405 | 1403 | 8080 | 8080 |
queue-max with max staleness 1 | 1401 | 1229 | 8080 | 4983 |
queue-max with max staleness 2 | 1399 | 1259 | 8080 | 6555 |
queue-max with max staleness 3 | 1403 | 1292 | 8080 | 8080 |
queue-max with max staleness 4 | 1402 | 1351 | 8080 | 8080 |
At tailness 90, the bias is even more apparent:
| Queue algorithm | Sampled mean | Trained mean | Sampled max | Trained max |
|---|---|---|---|---|
queue-drop with Q=1 | 1349 | 1344 | 12080 | 12080 |
queue-drop with Q=2 | 1357 | 1352 | 12080 | 12080 |
queue-max with max staleness 1 | 1354 | 897 | 12080 | 5430 |
queue-max with max staleness 2 | 1359 | 962 | 12080 | 6388 |
queue-max with max staleness 3 | 1355 | 1027 | 12080 | 7979 |
queue-max with max staleness 4 | 1351 | 1077 | 12080 | 9572 |
Smooth-arrival approximation proof
The smooth-arrival approximation is used to estimate the number of policy updates that occur from the start of a rollout’s generation to when it is trained on based on its lifetime. This approximation holds when we have a wide response length distribution. Formally, consider a rollout which begins generating at time , where is a random variable, and is trained on at time , which is fixed and known. Then the smooth-arrival approximation states that the expected number of policy updates between and (including the update at time ) is
The approximation holds well when the order of the standard deviation of is .
We now derive this approximation. Since there is a policy update at time and policy updates occur with interval , the number of policy updates is approximately
Here and below, is the ceiling of and is the fractional part of . For all non-integer , we have . Then,
When the distribution of has a large standard deviation of order , the distribution of is nearly uniform. Thus, its expected value is , giving the desired approximation.
The smooth arrival approximation works especially well in multi-turn, agentic settings, where rollouts have highly variable lengths. In our simulations and production runs, we find that its predictions also hold well for single-turn rollouts.
Building an efficient staleness simulator
To validate the staleness formula efficiently, we built a discrete-event simulator for fully asynchronous RL with a queue-drop buffer algorithm. The simulator mirrors the parts of the real system that determine staleness: rollout generation, buffering, trainer consumption, and weight-version updates.
We assume that the rollout engines have concurrent inference slots. Each slot generates one sample at a time. A sample draws its response length from a capped lognormal distribution : when the tailness parameter is zero, all responses have deterministic length ; otherwise, we sample and set
then round and cap the length to avoid unrealistically large outliers. The is to help align the response length distribution of real workloads. The rollout engines’ generation time is . A group can only enter the buffer after all samples finish, so the simulator naturally captures the straggler effect from long-tail responses as described before: the group completion time is controlled by the slowest sample in the group.

Each group is stamped with the rollout policy version used when its generation begins. This version stays fixed through generation and buffering. Once a group completes, it enters a bounded queue of capacity . For example, under the queue-drop policy, if the queue is full when a new group arrives, the oldest queued group is dropped and a new group is inserted. The trainer consumes the oldest queued groups whenever a full training batch is available; otherwise, it waits for the queue to refill. Each train step takes seconds and increments the trainer policy version by one.
For every trained group, the simulator computes staleness exactly as
We then measure the mean staleness over trained samples after the simulation reaches steady state.
Note that the simulator is intentionally event-driven rather than step-based: it advances directly to the next sample completion, trainer completion, or history checkpoint. This makes it fast enough to run large parameter grids while preserving the discrete effects that matter most for staleness, especially group stragglers, bounded-queue drops, trainer waiting, and integer policy-version jumps.
Train period under fixed GPU budget
Recall that the train period is this quantity divided by the bottleneck throughput:
Let the total GPU budget be , where is the number of rollout GPUs and is the number of train GPUs. Now assume that the token throughput per rollout GPU and the token throughput per train GPU are constant. Then and scale linearly with their GPU allocations.
We will re-parametrize the train period in terms of and . Since
we can use to derive
Thus,
Consequently, is strictly decreasing on , strictly increasing on , and minimized at .
Does rollout-bound always lead to better trade-off?
We observe above that for any achievable effective train period , a setup can reach that period with two different utilizations (unless ). One of these utilizations is less than 1, and the other is more than 1. But these setups are not equal: we often achieve lower staleness via the rollout-bound utilization compared to the train-bound utilization. The condition that determines which setup is better depends on two factors:
- The queue factor , and
- The rollout-to-trainer compute ratio at .
Smaller means rollouts are cheaper relative to training. We will derive this condition below.
Like in our proof of the formula for , we will assume that rollout and train throughput scale linearly with rollout and train GPUs, respectively. This tells us that is a constant; in fact, it is exactly (both terms are defined above in our proof).
Recall that
Define where . Then we can derive the other utilization with the same effective train period as follows:
The PQS is constant for a fixed effective train period, so the signed staleness gap is only the difference in IQS:
Negative gap means the point is better. Define the best possible signed gap as
Then there exists such that the branch improves the frontier iff
Recalling , this condition simplifies to
In particular,
Thus, even for , the side can only improve the frontier when rollout compute is less than half of trainer compute at the balanced point. For , rollout compute must be less than about of trainer compute.
