Applied Compute raises $80M led by Kleiner PerkinsRead more
Continued Training with Entropy Preserving RL
Research

Continued Training with Entropy Preserving RL

How entropy collapse makes continued RL training hard, a survey of interventions for preventing it, and a two-phase experiment showing that an entropy-preserving controller (REPO-R) stays trainable and improves across tasks where GRPO stalls and degrades.

Jun 16, 2026Nic Becker, Jacob Philips, Rhythm Garg

At Applied Compute, we build custom models adapted to the specific domains our customers care about. Those needs often span multiple datasets and task types rather than a single objective, and they continue to evolve as new data arrives. This is often a type of continual learning: adapting a model to new tasks through incremental training without forgetting what it has already learned. Running RL on one task and then the next risks catastrophic forgetting, where gains on a new task erode capabilities the model previously acquired.

Much of this difficulty is invisible on the headline eval. Most modern LLM RL recipes forgo KL regularization and explicit entropy bonuses, which turns a policy's entropy from a first-class lever into an uncontrolled side-effect of training. Left alone, entropy tends to collapse over training; the policy concentrates onto a narrow band of behaviors, and a growing line of work ties this decline to diminished exploration and a ceiling on downstream performance (the entropy mechanism of RL). The checkpoint that comes out of a "successful" run can be a worse starting point for the next task than the model we began with.

Much of the recent literature treats entropy as something to monitor and actively control throughout training, rather than accept whatever entropy trajectory happens to fall out of a given recipe (DAPO; Magistral; Microsoft's MAI; entropy-preserving RL). Done well, explicit entropy control can be a powerful mechanism for preserving a model's ability to learn new skills while staying trainable. We walk through why policy-gradient RL collapses entropy, why that collapse is what makes continued training hard, the landscape of methods for preventing it, and a continued-training experiment where an entropy-preserving method keeps a model improving across both phases while GRPO stalls and degrades.

Throughout, we use entropy to mean the per-token Shannon entropy of a model's output distribution, and diversity to mean the semantic difference between the rollouts a model actually samples. The two are related but not identical, a distinction we return to in §2.

Policy gradient methods collapse the policy

Reinforcement learning works by exploring a distribution of behaviors and concentrating probability mass on whatever the grader rewards. Modern policy-gradient algorithms for LLMs like PPO, GRPO, DAPO all do exactly this. In the limit of a perfect grader and an unlimited dataset, that concentration is precisely what we want. In the regimes our customers actually live in (noisy graders, small datasets, narrow task distributions), it has a cost: the policy's output distribution narrows fast, and the model loses the optionality it started with.

This is not a failure mode that only shows up on toy problems. In larger scale runs, we observe that training with GRPO drives per-token entropy steadily downward over the run, and the collapse does not buy better performance: an entropy-preserving variant holds its entropy and reaches higher reward on the same task. Here, collapse comes bundled with worse performance rather than buying better.

Why this makes deployment and continual training difficult

Two related problems follow from this collapse, and both surface in customer work before they show up on a headline eval:

  • Deployment. The behavior of a collapsed policy often manifests as brittle and repetitive responses. It does well on the narrow slice of inputs it was trained on, degrades off-distribution, and tends to answer in highly templated responses. For many customer applications, customer support and other human-in-the-loop settings especially, we want not only the best answer but some variety in how the model gives it. This is visible directly in the rollouts: as entropy falls during training, the semantic diversity of the model's responses to the same prompt falls with it.
  • Continual training. RL on dataset A and then on dataset B is a different problem from RL on their union. We observe empirically that a policy that has already collapsed after A explores more narrowly on B: the rollouts within a group look alike, so a behavior the grader on B would reward often never appears among the samples to reinforce. The signal the second run needs is faint, and continued training tends to either stall or trade away the capabilities A produced (entropy-preserving RL).

Per-token entropy is an imperfect but useful proxy for how much exploration the policy has left, and exploration is what both problems come down to: an RL run can only learn from the rollouts it samples, so a policy that keeps producing varied solutions gives the optimizer more to work with, while a collapsed one shows it the same trajectory over and over (Pass@k policy optimization). Entropy is the quantity we can cheaply measure and directly control, which makes it the lever we reach for: imperfect, since two policies can share an entropy value yet sample differently, but measurable and steerable. A policy with healthy entropy can be nudged toward new behavior by a new reward signal; a collapsed policy mostly can't.

Interventions that encourage exploration

Several families of methods exist for steering entropy during training, each making a different choice about what to control and where in the training loop to intervene. We work through them roughly in the order we ruled them out for our regime of noisy graders and small datasets, which is how we landed on the adaptive controller (REPO-R) that the rest of this post builds on.

3a. Explicit entropy bonus. Add α · H(πθ;τ) to the reward; the coefficient α controls the strength of that bonus. It is the oldest and simplest intervention, and tuning α sets up a knife-edge: too small and the task reward dominates so entropy still decays; too large and the bonus dominates, inflating entropy into incoherent text. The workable band is narrow and tends to drift over training, so a fixed bonus usually postpones collapse rather than preventing it. It is cheap to implement and remains a common first baseline.

3b. KL regularization to a reference policy. Keeps the policy close to the base model in KL. Preserves something like the base distribution but conflates "stay diverse" with "stay similar to the reference," which is not what we want when the reference itself is suboptimal on the task.

3c. Covariance / sharpening-aware methods (Clip-Cov, KL-Cov). These build on the result from the entropy mechanism of RL, that the change in entropy is driven by the covariance between a token's probability and its advantage, and target that covariance directly: find the tokens contributing most to it on a given step and either clip their updates (Clip-Cov) or KL-penalize them (KL-Cov). Acts on the updates that cause sharpening rather than on entropy directly. Conceptually clean, but requires per-token bookkeeping at update time and tends to be sensitive to where you set the covariance cutoff.

3d. Text-level diversity signals. Rather than act on token entropy at all, these score the rollouts themselves for diversity: by n-gram overlap (Jaccard), by distance between answer embeddings, or by an LLM that clusters them by strategy (Poly-EPO). The appeal is that they target semantic diversity directly rather than the token-level proxy. In our experiments the signals are noisiest exactly where they would need to be sharp: embeddings struggle to separate near-duplicate rollouts within a single prompt's group, and LLM judges are costly to run inside the training loop. That makes them valuable as offline diagnostics, and harder to rely on as an online training signal.

3e. Adaptive controllers on the policy gradient update. Instead of changing the reward, they change the update. ADAPO modulates DAPO's εhigh, its upper clipping bound, in response to entropy drift; REPO-R instead rescales the advantage by a coefficient ζ that doubles or halves depending on whether entropy sits above or below its target. These act inside the gradient step, so they don't require trading off against task reward, which makes this family a natural fit when the task reward is already noisy.

One advantage of adaptive controllers is that they expose target entropy itself as a hyperparameter. By default we capture it from the policy's own entropy at the start of training, but nothing forces that choice. It can be set deliberately, including above the model's natural level, to hold the policy more exploratory than it would otherwise settle.

Continual training: tau2-bench → IFBench

To see whether this holds up in a real continued-training setting, we ran a two-phase experiment on Qwen3.6-35B-A3B: first RL on tau2-bench (airline and retail), then, starting from that checkpoint, RL on IFBench: the same model, two tasks, one after the other. We ran the sequence with GRPO and with REPO-R, tracking entropy, the phase-1 task (tau2 retention), and the phase-2 task (IFBench) throughout. In phase 2 we added a third lineage: REPO-R forked from the same tau2 checkpoint but with its entropy target raised ~0.05 above the level the model settles at on its own (H_init = 0.32 vs ~0.27), to test whether the level of preserved entropy is a useful lever.

per-token entropy across both phases (boundary at step 100). GRPO collapses to ~0.05 in each phase; REPO-R holds ~0.26; the raised-target REPO-R (H_init=0.32) holds ~0.30 through phase 2.
per-token entropy across both phases (boundary at step 100). GRPO collapses to ~0.05 in each phase; REPO-R holds ~0.26; the raised-target REPO-R (H_init=0.32) holds ~0.30 through phase 2.

The entropy curves are the clearest result. Under GRPO, per-token entropy falls from ~0.24 to ~0.05 over phase 1, briefly recovers when IFBench training begins, and collapses again to ~0.05 by the end of phase 2. REPO-R holds entropy around 0.26 (roughly 0.19–0.35) across both phases, and the raised-target lineage holds higher still: mean ~0.30 (0.22–0.41) through phase 2. That gap is the first result. The target is a setpoint we can actually move, not just a level we passively preserve.

eval reward across both phases. Three lineages — GRPO, REPO-R, and REPO-R with a raised entropy target (H_init=0.32, phase 2 only). Solid = tau2 (airline+retail mean, left axis); dashed = IFBench-test (right axis); phase boundary at step 100.
eval reward across both phases. Three lineages — GRPO, REPO-R, and REPO-R with a raised entropy target (H_init=0.32, phase 2 only). Solid = tau2 (airline+retail mean, left axis); dashed = IFBench-test (right axis); phase boundary at step 100.

What that buys shows up in phase 2. A few things to read off the reward figure:

  • Phase 1 shows parity on Tau2Bench performance. Both lineages climb to a comparable tau2 reward by the end of phase 1 (~0.68–0.70), so preserving entropy cost nothing on the first task.
  • All three retain the first task comparably. Through phase 2, tau2 reward stays in the same band (~0.65–0.70) for all three lineages.
  • REPO-R sustains its gains on the second task; GRPO does not. On IFBench, the collapsed GRPO policy climbs to ~0.67 around step 140 and then degrades over the rest of training, ending near 0.61. REPO-R reaches a similar level and holds it, ending around 0.66. By the end of training the entropy-preserved lineage leads on the new task and, more importantly, is still stable where GRPO has begun to come apart.
  • Raising the entropy target raised performance. The third lineage held more entropy (~0.30 vs ~0.26) and led default REPO-R on IFBench at every eval from step 189 onward, ending at 0.68 (mean over the final evals) vs 0.66, with the gap widening over training, while retaining tau2 at parity (~0.70). Dialing entropy up bought a better, still-improving policy at no cost to the old task.

The policy that kept its entropy stayed trainable through a second RL phase, while the collapsed one reached a peak on the new task and then slid back. And the policy that kept more entropy did better on the new task: preserving entropy keeps you trainable, and choosing how much to preserve is a lever on how far you get.

The collapse also shows up where customers feel it, in the answers themselves. Treating the text-diversity measures from §3d as the offline diagnostics they are suited for, we score intra-prompt diversity at each step: given one prompt, how different are the sampled answers? Every measure separates the three lineages the same way entropy did. Under GRPO the answers degenerate — the repetition fraction climbs to ~0.28, mean pairwise cosine similarity within a prompt rises from ~0.79 to ~0.84 (answers converging on each other), the Vendi score (the effective number of distinct answers) falls from ~2.4 to ~1.9, and the answers cluster tighter around their centroid. REPO-R moves the other way (Vendi ~3.0, similarity ~0.75), and the raised-target lineage stays the most diverse (Vendi ~3.3, similarity ~0.72). Entropy is not just a training-time meter, then: the policy that kept more of it returns genuinely more varied answers to the same input, which is the deployment property §2 said collapse quietly costs.

intra-prompt answer diversity over phase-2 (IFBench) training, for GRPO, REPO-R, and REPO-R H_init=0.32. Repetition fraction; mean pairwise cosine similarity (higher = more collapsed); Vendi score (effective # of distinct answers); mean cosine distance to centroid. GRPO degenerates; REPO-R preserves diversity; the raised target preserves the most.
intra-prompt answer diversity over phase-2 (IFBench) training, for GRPO, REPO-R, and REPO-R H_init=0.32. Repetition fraction; mean pairwise cosine similarity (higher = more collapsed); Vendi score (effective # of distinct answers); mean cosine distance to centroid. GRPO degenerates; REPO-R preserves diversity; the raised target preserves the most.

Why this matters

Customers rarely hand us clean datasets and perfect graders, and they rarely want a model that does exactly one thing. Losing entropy forecloses the things they do want: varied, non-templated responses in deployment, and a checkpoint that can be trained on the next task without retraining from scratch. Treating entropy as an emergent side-effect of whatever clip ratios we happened to pick tends to work right up until that second task. Measuring it, and then choosing a level rather than inheriting one, is what kept the same training stack working across both phases here. Entropy is not only a floor to defend against collapse; it is a setpoint to tune per deployment, and on the evidence here that is part of what separates a recipe that works on one dataset from one that keeps working on the next.