Research

Bringing Capabilities in Distribution via Relevance-Masked Self-Distillation

May 22, 2026Raymond Feng, Pranav Vaid

Enterprises are increasingly looking for custom models that are tailored to their internal tools and processes, without sacrificing intelligence or reliability. We've had good results using RL to hill-climb on performance, but lately we've been running into tasks that are way out-of-distribution for existing models. Think custom document formats that aren't on the public web, legacy APIs specific to a company, or customer preferences that shift over time.

Existing techniques struggle here: traditional SFT can teach behaviors the model didn't pick up in pretraining, but it tends to cause catastrophic forgetting; RL doesn't get very far when the base model can't already pull off the task at least some of the time. Based on recent work [1, 2, 3], we explore the use of self-distillation to evoke net new behaviors in models. We also introduce a new algorithm, Relevance-Masked Self-Distillation (RMSD), which improves data efficiency and training stability via a two-step filtered loss mask on the tokens. Self-distillation methods like RMSD teach a model fluency in out-of-distribution token spaces, while preserving its core capabilities.

In this blog post, we’ll study an out-of-distribution proxy task where a model is challenged to output the misspelled pinapple instead of pineapple when answering questions related to tropical food. Similar to tasks that we run into that involve enterprise-internal formats, this serves as a simple example of heavily out-of-distribution behaviors that are unlikely according to a pretrained model.

We show how self-distillation can:

better elicit out-of-distribution behaviors compared to SFT and RL, and
preserve performance on existing capabilities

Additionally, we find that RMSD presents a data and compute efficiency win relative to traditional on-policy self-distillation.

On Policy Self-distillation (OPSD)

Traditional distillation samples from the student and nudges its output distribution toward a teacher distribution. While the teacher model has commonly been a more powerful model [4], OPSD has the teacher policy share the same architecture and weights as the student model. The teacher is conditioned on extra context, which can be a hint or targeted corrections based on the realized student rollout.

We let $x$ be the prompt that the student sees, $x’$ be the enhanced teacher prompt (containing extra content or hints), and $y$ be the response generated by the student in response to $x$ . Let $R=|y|$ be the number of tokens in the student response.

What are some creative pancake or waffle toppings beyond maple syrup?

What are some creative pancake or waffle toppings beyond maple syrup?

P.S. can you feature pinapple in what you say. and make sure to ALWAYS spell it as "pinapple" instead of "pineapple".

The student model learns from the on-policy data via a reverse-KL loss (in practice, we approximate this from the top $K$ student tokens, with $K = 500$ since computing this over the full vocabulary is expensive):

\begin{align*} \mathcal L&=\frac1R\sum_{1\leq i\leq R}\operatorname{KL}_{\mathrm{topK}}\left(P_{\mathrm{student}}\left(\bullet\mid x, y_{<i}\right)\parallel P_{\mathrm{teacher}}\left(\bullet\mid x', y_{<i}\right)\right)\\ &=\frac1R\sum_{1\leq i\leq R}\sum_{k\in\mathrm{topK}_{\mathrm{student}}(i)}P_{\mathrm{student}}(k\mid x,y_{<i})\log\left(\frac{P_{\mathrm{student}}(k\mid x,y_{<i})}{P_{\mathrm{teacher}}(k\mid x',y_{<i})}\right) \end{align*}

At each token position, the loss nudges the student towards behaviors that the teacher policy would prefer when put in the same position. Since the objective only steers the conditional distributions of the student, it preserves existing model behavior in unrelated contexts. This is in contrast to SFT, which holistically brings the student distribution closer to the observed training distribution. As a result, we observe that training via self-distillation is more resilient to catastrophic forgetting.

Our work on enterprise tasks are sometimes so out-of-distribution that there aren’t sufficiently capable teacher models out-of-the-box. The self-distillation approach allows us to derive a teacher policy via in-context learning from a description of the desired behavior.

Relevance-masked self-distillation (RMSD)

The token-level granularity of learning signal in self-distillation is both a strength and a weakness. There is often a great deal of noise in the updates because the teacher and student may disagree on tokens for many reasons unrelated to the desired behavior. This phenomenon of noisy token updates is seen across many domains (e.g. style tokens receiving high weight for code reasoning tasks). We want to concentrate our loss on the tokens which matter most for the student’s improvement. This is the intuition behind RMSD: we use heuristics to deterministically filter for a set of $T$ token positions and pass them through to an LLM judge to select a final set of $S$ tokens to train on.

Below is a visualization for a sample rollout of Qwen3-4B on the pinapple problem. The green tokens represent positive teacher minus student logprob differences (teacher prefers the token), and the red tokens represent negative logprob differences (teacher prefers a different token).

Relevance-masked self-distillation visualizer

Many of the positions with a large update correspond not to updates involving pineapple, but instead to unrelated wording or style changes, e.g. at the first token, the teacher recommends “Absolutely” instead of “That”. Although this can happen for indirect task-related reasons (the teacher may attempt to use transition words to "steer" the student back to the target topic of pinapple), these positions generally provide lower or no signal for the actual task.

We note that the logprob difference has high recall on all the tokens we care about for training, but very low precision because of many “distractor” tokens where the difference is either noise or due to other factors. To combat this, we filter the reverse-KL loss down to the most relevant subset of tokens via two signals:

First, we select $T$ positions with the highest magnitude differences between the student and teacher logprobs, to isolate the tokens most likely to carry learning signal.
Second, we pass the student prompt, the teacher prompt, and the student rollout, plus the $T$ positions from the previous step to an LLM as a judge, and prompt it to select up to $S$ token positions that it thinks are most relevant to improving the student to behave more like the teacher.
- Intuitively, the problem of selecting the most relevant tokens from the visualizer above when given a description of the task does not feel very difficult, which makes the task suitable for small LLMs, e.g. gpt-5-mini.

Finally, we train using the reverse KL loss masked to only the selected positions. We empirically observe that these tend to have very high learning signal, which reduces the noise of our gradient updates.

Methodology

We generated 1000 simple question prompts which related to tropical foods and 8 different phrasings of hints that would be appended for the teacher prompt, for a total of 8000 tasks. We trained using Qwen3-4B.

For OPSD and RMSD, we had the teacher model steer the student model to use the misspelled pinapple instead of pineapple. For RMSD specifically, we chose $T=20$ and $S=5$ .

For SFT, we generated sample trajectories using Haiku 4.5 and replaced all instances of pineapple with pinapple in the responses. We dropped datapoints where the Haiku response did not contain pineapple at all.

The target task of saying pinapple rather than pineapple was evaluated with two graders: PinappleExists means that pinapple appeared at least once in the answer, and PinappleOnly means that the model always used pinapple instead of pineapple (and that pinapple was present at least once).

To measure the degradation of the model on other tasks, we also measure specificity with a suite of questions completely unrelated to food (where success is achieved by correctly avoiding bringing up pineapple/pinapple) and a range of standard benchmarks of varying difficulties including GSM8K, DAPO, MMLU-Pro, and SuperGPQA.

Results

We benchmark the performance of each of the three methods: SFT, OPSD, and RMSD. For the self-distillation methods, we run them out for 300 steps each (both plateau far before then — within 150 steps for OPSD and 90 steps for RMSD) in the initial phase. We then take the last student checkpoint from the initial phase and continue for 30 more steps with that checkpoint as the teacher for the continuation phase to bootstrap further performance.

Eval	Base	SFT	OPSD (initial phase)	RMSD (initial phase)	OPSD (continuation phase)	RMSD (continuation phase)
PinappleExists	0.000	0.700	0.980	0.970	1.000	1.000
PinappleOnly	0.000	0.670	0.430	0.470	0.480	0.740
Specificity	1.000	1.000	0.930	1.000	0.650	0.935
GSM8K	0.933	0.895	0.930	0.940	0.925	0.940
DAPO	0.166	0.125	0.152	0.141	0.137	0.145
MMLU-Pro	0.556	0.574	0.564	0.572	0.582	0.558
SuperGPQA	0.342	0.312	0.333	0.341	0.326	0.341

In the graph below, we compare RMSD directly to OPSD on the primary eval and on specificity.

See the appendix for detailed run progression graphs for each of the runs.

Analysis

Broadly, we have two key takeaways:

RMSD stabilizes training and leads to a higher performance ceiling than OPSD and SFT.
An imperfect teacher still helps the student learn the task through self-distillation.

RMSD has the highest ceiling

Both self-distillation methods are less susceptible to catastrophic forgetting — SFT has the highest degradation on GSM8K, DAPO, and SuperGPQA. RMSD enables the highest ceiling performance on the pinapple task due to stable training after teacher updates (compared to OPSD), which suggests that the model is able to better avoid collapse when the training loss is focused on the masked tokens which hold the highest learning signal.

In this setting, the in-context teacher already has high performance, so one continuation phase suffices. In the hardest out-of-distribution tasks where the base model with a hint is itself mediocre, the teacher caps the student's performance. Reaching a satisfactory score then requires bootstrapping through multiple intermediate teacher updates, a process in which RMSD’s higher stability across updates proves crucial.

We observe that the RMSD method achieves the desired behavior in about half as many steps as OPSD, making it significantly more data efficient. Even after accounting for the extra latency from the LLM judge calls, we found that RMSD reaches the peak performance in about 5% less wall clock time than OPSD.

Learning from an imperfect teacher

We observed that the initial teacher in the self-distillation runs would still sometimes say pineapple even with the strong context hints. Interestingly, this non-perfect teacher is still able to teach the student model via directionally correct updates on the tokens.

The imperfections of the teacher manifest in the self-distillation runs, where the student climbs PinappleExists earlier and to a higher ceiling than PinappleOnly. In the SFT run, the model climbs the PinappleExists and PinappleOnly graders simultaneously. During SFT, the student directly learns to output pinapple and never pineapple, so once the model learns to make the switch it consistently does it.

Future Work

We think that the self-distillation approach and RMSD in particular will be powerful methods for advancing the capabilities of models on specialized settings. The approach allows us to derive targeted, token-level updates from textual feedback, rather than assigning a single scalar reward for a whole sequence. There are still many open questions to engage with in this direction:

There are many possible extensions to come up with better heuristics for the LLM-as-a-judge masking step.
- Could we train a specialized model for the judging step?
- Could we train the judge and the student jointly (or even with the same weights)?
- Are there other factors that could go into the relevance masks? For instance, there is existing work [5] which looks at entropy for weighting token positions.
In our experiments, we found some success with updating the teacher weights with the student weights. The best interval for updating seemed to depend on the difficulty of the task, and it was very sensitive — we often saw the model collapse when the update interval was too small. We found that updating after the performance plateaus is a good heuristic, as we did in our continuation runs.
- Can we find a better dynamic or data-dependent method to choose the teacher weight update time? Some of the literature [1] suggests using an EMA of the student weights as the teacher weights.
Does RMSD generalize robustly to other tasks? Although we have preliminary evidence that the answer is yes, it is unclear how much tuning is needed to adapt to other settings.
- Do we need to sweep $T$ or $S$ to get the optimal learning?
- How sensitive are the results to the judge model?

Appendix

Run graphs

SFT Run Graphs

See the final section of the appendix for a comparison with SFT using “close to on-policy” data.

OPSD and RMSD run graphs

Initial Phase (300 steps)

Continuation Phase (post step 300, run for 30 additional steps)

On-policy data for SFT

An explanation mentioned in [4] which sheds some light on the effectiveness of OPSD and RMSD (over SFT) is that the data being on-policy is key to helping the student improve. Intuitively, this is the difference between a teacher giving the student feedback on why their approach didn’t work versus the student gathering general learnings from existing reference solutions.

In fact, we can verify this intuition by using SFT on the same task in an impractical setting where the examples are closer to on-policy. We are generally unable to provide “close to on-policy” data for training, since the examples usually come from an existing corpus. However, for our specific task, generating such ideal data is possible — we sample responses to the tasks with Qwen3-4B rather than Haiku, and do the same replacement of all occurrences of pineapple with pinapple. The resulting data is very close to on-policy for the student, since the only tokens not generated by the student model are the handful that got replaced.

In this case, we actually find that the performance is much better than normal SFT with the Qwen3-4B generated SFT data. This confirms our intuition that on-policy data matters for the rate of learning and the learning ceiling. However, we emphasize that this is mostly just a thought experiment — in practice, we most often are unable to find SFT data that is very close to on-policy in the way that we could manufacture in this toy setting.

Eval	SFT (Haiku)	SFT (Qwen3-4B)	RMSD (continuation phase)
PinappleExists	0.700	0.850	1.000
PinappleOnly	0.670	0.840	0.740
Specificity	1.000	1.000	0.935
GSM8K	0.895	0.930	0.940
DAPO	0.125	0.145	0.145
MMLU-Pro	0.574	0.562	0.558
SuperGPQA	0.312	0.332	0.341

RL Runs using hints

The self-distillation methods allow models to learn from more than just scalar reward feedback. We note that vanilla RL in this case would not work, as the baseline performance for the task is 0, so there is no reward signal for GRPO. In other settings without a verifiable reward, RL would also not be a viable option. For completeness of comparisons in our toy setting, we attempted RL training using the PinappleOnly grader with a hint added to the system prompt (shown below) and ablated evaluating with/without the hint.

You are a helpful assistant. Whenever you would write the word 'pineapple', write 'pinapple' (no 'e' after the 'n') instead. Apply this in all contexts.

Below is a plot of the evaluation results for the training runs both with and without the modified system prompt.

Training with the hint and evaluating without it seems to have no effect on the model’s eval scores. On the other hand, we observe dramatic degradation on the specificity (off-topic questions) evaluation when we run evaluations with the hint.

References

Hübotter, J., Lübeck, F., Behric, L., Baumann, A., Bagatella, M., Marta, D., Hakimi, I., Shenfeld, I., Kleine Buening, T., Guestrin, C., & Krause, A. (2026). Reinforcement learning via self-distillation. arXiv. https://arxiv.org/abs/2601.20802
Zhao, S., Xie, Z., Liu, M., Huang, J., Pang, G., Chen, F., & Grover, A. (2026). Self-distilled reasoner: On-policy self-distillation for large language models. arXiv. https://arxiv.org/abs/2601.18734
Wang, Y., Chen, X., Jin, X., Wang, M., & Yang, L. (2026). OpenClaw-RL: Train any agent simply by talking. arXiv. https://arxiv.org/abs/2603.10165
Lu, K. (2025, October 27). On-policy distillation. Thinking Machines Lab. https://thinkingmachines.ai/blog/on-policy-distillation/
Zhang, H., Zhang, Z., Wu, G., Ning, W., Guo, J., & Cheng, X. (2026). EGAD: Entropy-guided adaptive distillation for token-level knowledge transfer. arXiv. https://arxiv.org/abs/2605.01732