Continued Training with Entropy Preserving RL

At Applied Compute, we build custom models adapted to the specific domains our customers care about. Those needs often span multiple datasets and task types rather than a single objective, and they continue to evolve as new data arrives. This is often a type of continual learning: adapting a model to new tasks through incremental training without forgetting what it has already learned. Running RL on one task and then the next risks catastrophic forgetting, where gains on a new task erode capabilities the model previously acquired.

Much of this difficulty is invisible on the headline eval. Most modern LLM RL recipes forgo KL regularization and explicit entropy bonuses, which turns a policy's entropy from a first-class lever into an uncontrolled side-effect of training. Left alone, entropy tends to collapse over training; the policy concentrates onto a narrow band of behaviors, and a growing line of work ties this decline to diminished exploration and a ceiling on downstream performance (Cui et al., 2025⌝). The checkpoint that comes out of a "successful" run can be a worse starting point for the next task than the model we began with.

Recent literature treats entropy as something to monitor and actively control throughout training, rather than accept whatever entropy trajectory happens to fall out of a given recipe (Yu et al., 2025⌝; Mistral AI, 2025⌝; Microsoft AI, 2026⌝; Petrenko et al., 2026⌝). Done well, explicit entropy control can be a powerful mechanism for preserving a model’s ability to learn new skills while staying trainable. We walk through why policy-gradient RL collapses entropy, why that collapse is what makes continued training hard, the landscape of methods for preventing it, and a continued-training experiment where an entropy-preserving method keeps a model improving across both phases while GRPO stalls and degrades.

Throughout, we use entropy to mean the per-token Shannon entropy of a model's output distribution, and diversity to mean the semantic difference between the rollouts a model actually samples.

Policy gradient methods collapse the policy

Reinforcement learning works by exploring a distribution of behaviors and concentrating probability mass on whatever the grader rewards. Modern policy-gradient algorithms for LLMs like PPO, GRPO, DAPO all do exactly this. In the limit of a perfect grader and an unlimited dataset, that concentration is precisely what we want. In the regimes our customers actually live in (noisy graders, small datasets, narrow task distributions), it has a cost: the policy's output distribution narrows fast, and the model loses the optionality it started with.

This is not a failure mode that only shows up on toy problems. In larger scale runs, we observe that training with GRPO drives per-token entropy steadily downward over the run, and the collapse does not buy better performance: an entropy-preserving variant holds its entropy and reaches higher reward on the same task. Here, collapse comes bundled with worse performance rather than buying better.

Why this makes deployment and continual training difficult

Two related problems follow from this collapse, and both surface in customer work before they show up on a headline eval:

Deployment. The behavior of a collapsed policy often manifests as brittle and repetitive responses. It does well on the narrow slice of inputs it was trained on, degrades off-distribution, and tends to answer in highly templated responses. For many customer applications, customer support and other human-in-the-loop settings especially, we want not only the best answer but some variety in how the model gives it. This is visible directly in the rollouts: as entropy falls during training, the semantic diversity of the model's responses to the same prompt falls with it.
Continual training. When RL training on dataset A and then on dataset B, we observe empirically that a policy that has already collapsed after A explores more narrowly on B. When the rollouts within a group look alike, a behavior the grader on B would reward often never appears among the samples to reinforce. Consequently, the signal the second run needs is faint, and continued training tends to either stall or collapse (Petrenko et al., 2026⌝).

Per-token entropy is an imperfect but useful proxy for how much exploration the policy has left, and exploration is what both problems come down to: an RL run can only learn from the rollouts it samples, so a policy that keeps producing varied solutions gives the optimizer more to work with, while a collapsed one shows it the same trajectory over and over (Walder & Karkhanis, 2025⌝). A policy with healthy entropy can be nudged toward new behavior by a new reward signal; a collapsed policy mostly can't.

Interventions that encourage exploration

Several families of methods exist for steering entropy during training, each making a different choice about what to control and where in the training loop to intervene. We work through them roughly in the order we ruled them out for our regime of noisy graders and small datasets, which is how we landed on the adaptive controller that the rest of this post builds on.

3a. Explicit entropy bonus. Add $\alpha \cdot H(\pi_\theta;\tau)$ to the reward; the coefficient $\alpha$ controls the strength of that bonus. This is a simple intervention where tuning $\alpha$ sets up a knife-edge: too small and the task reward dominates so entropy still decays; too large and the bonus dominates, inflating entropy into incoherent text. The workable band is narrow and tends to drift over training, so a fixed bonus usually postpones collapse rather than preventing it.

3b. KL regularization to a reference policy. Keeps the policy close to the base model in KL. Preserves something like the base distribution but conflates "stay diverse" with "stay similar to the reference," which is not what we want when the reference itself is suboptimal on the task.

3c. Covariance / sharpening-aware methods (Clip-Cov, KL-Cov). These build on the result from Cui et al., 2025⌝, that the change in entropy is driven by the covariance between a token's probability and its advantage, and target that covariance directly: find the tokens contributing most to it on a given step and either clip their updates (Clip-Cov) or KL-penalize them (KL-Cov). Acts on the updates that cause sharpening rather than on entropy directly. In practice, this requires per-token bookkeeping at update time and tends to be sensitive to where you set the covariance cutoff.

3d. Text-level diversity signals. Rather than act on token entropy at all, these score the rollouts themselves for diversity: by n-gram overlap (Jaccard), by distance between answer embeddings, or by an LLM that clusters them by strategy (Orney et al., 2026⌝). The appeal is that they target semantic diversity directly rather than the token-level proxy. In our experiments the signals are noisiest exactly where they would need to be sharp: embeddings struggle to separate near-duplicate rollouts within a single prompt's group, and LLM judges are costly to run inside the training loop. That makes them valuable as offline diagnostics, and harder to rely on as an online training signal.

3e. Adaptive controllers on the policy gradient update. Instead of changing the reward, they change the update. ADAPO modulates DAPO's $\epsilon_{\text{high}}$ , its upper clipping bound, in response to entropy drift; REPO-R instead rescales the advantage by a coefficient $\zeta$ that doubles or halves depending on whether entropy sits above or below its target. These act inside the gradient step, so they don't require trading off against task reward, which makes this family a natural fit when the task reward is already noisy.

One advantage of adaptive controllers is that they expose target entropy itself as a hyperparameter. By default we capture it from the policy's own entropy at the start of training, but nothing forces that choice. It can be set deliberately, including above the model's natural level, to hold the policy more exploratory than it would otherwise settle.

Continual training: tau2-bench → IFBench

To see whether this holds up in a real continued-training setting, we ran a two-phase experiment on Qwen3.6-35B-A3B: first RL on tau2-bench (airline and retail), then, starting from that checkpoint, RL on IFBench. We ran the sequence with GRPO and with REPO-R, tracking entropy, the phase-1 task (tau2) retention, and the phase-2 task (IFBench) performance. In phase 2 we added a third ablation where REPO-R forked from the same tau2 checkpoint but with its entropy target raised ~0.05 above the level the model settles at on its own (H_init = 0.32 vs ~0.27), to test whether the level of preserved entropy is a useful lever.

Under GRPO, per-token entropy falls from ~0.24 to ~0.05 over phase 1, briefly recovers when IFBench training begins, and collapses again to ~0.05 by the end of phase 2. REPO-R holds entropy around 0.26 across both phases, and the raised-target ablation effectively holds it higher at a mean of ~0.30 through phase 2.

eval reward across both phases. Three lineages — GRPO, REPO-R, and REPO-R with a raised entropy target (H_init=0.32, phase 2 only). Solid = tau2 (airline+retail mean, left axis); dashed = IFBench-test (right axis); phase boundary at step 100.

We observe that active entropy control has observable implications for both downstream task performance and solution diversity:

Phase 1 shows parity on Tau2Bench performance. Both lineages climb to a comparable tau2 reward by the end of phase 1, so preserving entropy cost nothing on the first task.
Adaptive entropy control makes continued training stable, while GRPO collapses. On IFBench, the collapsed GRPO policy initially climbs and then degrades over the rest of training, ending near ~0.61. Applying adaptive entropy control allows the model to continue improving on IFBench throughout the second training phase. By the end of training, both entropy preserving runs are stable and climbing, while the GRPO run shows degenerate performance.
Raising the entropy target raised performance. Tuning the target entropy value higher than baseline led default REPO-R on IFBench at every eval from step 189 onward, with the gap widening over training, while retaining tau2 at parity. Dialing entropy up bought a better, still-improving policy at no cost to the old task.

In addition to affecting training stability, entropy collapse correlates with reduced response diversity.

Treating the text-diversity measures from §3d as the offline diagnostics, we score intra-prompt diversity at each step: given one prompt, how different are the sampled answers? We observe that under GRPO the answers degenerate into more similar and repetitive responses. This is reflected in multiple diversity metrics: the repetition fraction, mean pairwise cosine similarity of the response embeddings, the Vendi score (the effective number of distinct answers), and distance of response embedding to centroid for the group. On the other hand, adaptively controlling entropy at higher values led to a policy with more varied answers to the same input.

Why this matters

At the frontier of enterprise post-training, customers rarely hand us clean datasets or a single stable objective. If entropy collapses, you lose the two things that matter most in practice: responses that stay flexible in deployment, and checkpoints that remain a good starting point for the next round of training.

Our results suggest that maintaining sufficient policy entropy is a prerequisite for stable continued training: once entropy collapses on the first task, subsequent RL on a new objective lacks the exploration needed to find and reinforce the behaviors the new grader rewards. Adaptive controllers like REPO‑R turn entropy into a first-class knob: an explicit target you can hold steady, or deliberately raise, to preserve learnability across tasks and tune the model’s deployed behavior to the diversity and flexibility required for the application or task.