Neural Cheat Sheets: Learning to Summarize with Reinforcement Learning

We train a model using reinforcement learning to ingest documents and produce the most useful context for downstream tasks. Optimizing with RL for downstream use produces very different artifacts from ordinary summaries: shorter, denser, and often creative at compactly summarizing information. We call these neural cheat-sheets.

Neural cheat-sheets approach the performance of learned KV-caches — which are several orders of magnitude larger and not human-readable — while preserving the auditability of natural language summaries that enterprise workflows require. At deployment time, the model takes in a new knowledge artifact: an enterprise SOP, a meeting transcript, a coding agent trace, or even a novella, and outputs a dense set of notes that captures fine-grained detail and amortizes reasoning. This approach is used to build memory for agents, instantiated at Applied Compute as a Contextbase⌝.

Motivation

The knowledge required to reliably do tasks in production lives blended in human brains and disk drives. Further, it changes daily. How are agents to make sense of this? There are two canonical approaches to devising context for LLM agents:

Natural language wikis: summaries of documents and tribal knowledge. Auditable but low-capacity. Prompted summaries lose fine-grained details, which can hurt downstream agent performance. Examples: AGENTS.md⌝ files, Cognition’s DeepWiki⌝, Applied Compute’s Contextbases⌝.
Latent vector stores, eg. learned KV caches. While these document-specific vectors are high-capacity but uninterpretable and model-specific, they often yield the best downstream performance when used as context. Examples: full KV caches, Cartridges (Eyuboglu et al., 2025), Attention-Matching (Zweiger et al., 2026).

The ability to trace and audit agentic context is non-negotiable for the world’s biggest enterprises, and so it should come as no surprise that natural language wikis are the default way enterprise context is managed today. LLMs are prompted to summarize new information into wikis that downstream agents use as context.

When we built Contextbases this way, agent performance improved by leveraging past traces and documents with no changes to model weights, but we also saw that summary quality varied with base models, prompts, and even resampling. Summaries differed across both style (prose, bullets, diagrams, even neuralese) and content (fine vs coarse-grained facts, reasoning, opinions, inferences).

This motivated a natural question: is there an optimal way to summarize documents for downstream use? Can a model learn this? The answer is yes. Below we present how we train models with reinforcement learning to produce high-quality context.

Background & Setup

Problem Formulation

The task here involves taking in a long document and writing a summary. The quality of that summary is measured by giving it to a base model with no other context, and evaluating QA performance on QA pairs about the full, original document. A summary is good context if it suffices to answer unseen QA pairs, which ask about fine-grained details, higher-level inferences, and subtext, in the original document. The QA pairs have a verifiable ground truth answer, and are posed as multiple-choice questions.

We train a Qwen-3 4B model (Qwen Team, 2025⌝) with reinforcement learning to summarize long documents taken from a few datasets. The interface is the same across datasets: generate a wiki summary of a long document, measure the quality of the summary in various ways and use that as reward. Only the source document genre changes across datasets:

QuALITY (Pang et al., 2022⌝): short stories (5-10k tokens). Questions ask about motives, chronology, hidden causes, and small plot facts.
QASPER (Dasigi et al., 2021⌝): entire NLP research papers (~4-8k tokens). The original QASPER benchmark has free-form questions with verifiable answers. We use a modified version with multiple-choice questions.
LongHealth (Adams et al., 2024⌝): patient health records, (~10-20k tokens) per case. Records contain detailed information about diagnosis, prescription, physician-patient interaction, medical history, and more.

Prior Work

Wiki-based approaches. DeepWiki (Cognition, 2025) generates a wiki-style memory — architecture, docs, diagrams — for GitHub repos using the source code as ground truth. Agent context compaction (Codex, Claude Code) has agents summarize their own context mid-task to stay within the context window. All of these write natural-language notes from a fixed prompt. This is precisely the artifact a neural cheat-sheet replaces; it keeps that same format but learns, via RL, whatto write so the notes maximize a downstream reader's utility instead.

Latent KV caches. Prefix-tuning (Li & Liang, 2021) learns continuous vectors used as a virtual KV cache at every LLM layer while the model stays frozen. Cartridges (Eyuboglu et al., 2025) trains a per-corpus KV-cache offline using synthetic data; attention-matching (Zweiger et al., 2026) compresses a long KV-cache into a much shorter one by contsructing keys/values that best preserve the attention statistics. Latent methods like these reach the strongest downstream performance but are uninterpretable, model-specific, and very large.

Neural summarization. Summarization has been a longstanding task of interest in NLP. Rush et al. (2015) and Nallapati et al. (2016) pioneered generating abstractive summaries using modern neural methods. Paulus et al. (2017) trained models with a mixed policy-gradient objective that rewards ROUGE overlap directly, an early case of optimizing summaries for a score rather than next-token likelihood as we do here.

Learning to Summarize with RL

Every reward we use builds on one idea: a wiki is good context if it makes the ground truth more likely for the frozen base model. Different notions of “ground truth” lead to different reward variants.

The setup is as follows. The policy $\pi_0$ reads a document $d$ and must write a wiki $w$ which is up to 20x shorter than the full document (enforced with a length/truncation penalty). A frozen base model reads $w$ , and we look at the log-probability (logprobs) it assigns to some target tokens $x$ (the “ground truth”). The reward is how much conditioning on the wiki raises that log-probability over a baseline context $b$ (a reference summary, or no context at all):

R \propto \log p(x \mid w) - \log p(x \mid b)

The choice of target $x$ gives rise to two natural variants of this reward: reconstructing the source (unsupervised) and answering questions about it (supervised). Each needs several rounds of shaping, because the policy reliably finds ways to hack the raw signal.

1. Unsupervised: Given an unlabelled long document, the reward for a wiki $w$ is proportional to the decrease in language modeling loss over the original document when you condition only on $w$ . Good wikis make the document they are summarizing as predictable as possible. This is important in settings where evals are hard to build, or what a “good summary” looks like is hard to judge.

2. Supervised: Some settings have a clear notion of ground truth, for instance a research chatbot answering verifiable questions about a long corpus. Here, one can use this directly to inform reward. If we have a downstream distribution of usage then one can simply use the increase in logprobs on the correct answer as a training signal.

1) Unsupervised: full-document reconstruction

Start: Logprob Delta on the Document

The most literal reading takes x to be the document itself. We score the wiki by how much more predictable it makes the whole document, averaged over tokens:

R_{\mathrm{doc}}(w,d)=\frac{1}{|d|}\sum_{t}\big[\log p(d_t\mid d_{<t},w)-\log p(d_t\mid d_{<t},b)\big]

This is fully unsupervised — no questions, no labels — and, surprisingly, it already works: it pushes the wiki to retain the facts that make the source easier to regenerate.

Reward Hack: Copying and Easy Exposition

A startling failure mode appears. The cheapest way to raise $\log p(d\mid w)$ is to copy some amount of the source verbatim, since this drastically reduces loss on the prefix of the document (which appears verbatim in the summary) and slightly reduces loss on the tail of the document. The root cause is that this reward function incentivizes increasing the logprobs on a few document tokens by a large amount to the same extent as increasing the logprobs on most of the source documents by a small amount.

A copy-hacked wiki abandons compression and is of course useless for the task of summarization. For instance, for a (fictional) patient’s medical record, one checkpoint consistently outputted the first 256 tokens of the full document.

Patient id: patient_08
Name: Laura Miller
Birthday: 1967-03-04
Diagnosis: Liver cirrhosis
Patient record — text_0: "Dear colleague, we are writing to report on the
outpatient treatment of Mrs. Laura Miller, born on 04/03/1967, on 05/2..."

Fix: Concavity and Clipping

To blunt this, we pass the per-token delta through a concave, clipped map (a $\tanh$ with a small log-scale) before averaging, so already-predictable tokens give diminishing returns and no region dominates. Since we train with a length/truncation penalty, this means the model no longer copies portions of the text outright.

R_{\mathrm{doc}}^{\,\mathrm{shaped}}(w,d)=\frac{1}{|d|}\sum_{t}\phi\big(\log p(d_t\mid d_{<t},w)-\log p(d_t\mid d_{<t},b)\big),\qquad \phi=\mathrm{clip}\circ\tanh

2) Supervised: MCQ accuracy

Here $x$ is the correct answer to a multiple-choice question. For a given document $d$ , let $i=1,\ldots,n$ index its MCQs, and write $\Delta f_i(w,b)=f_i(w)-f_i(b)$ . The frozen scorer induces a distribution over the answer choices after reading the wiki, with $\ell_i(a;w)$ its logit for choice $a$ :

p_i(a;w)=\frac{\exp \ell_i(a;w)}{\sum_{a'}\exp \ell_i(a';w)}

Start: Correct-Answer Probability

The smooth starting point rewards extra probability mass on the gold answer.

R_{\mathrm{prob}}(w,d)=\frac{1}{n}\sum_i \Delta p_i(y_i;w,b)

Every small gain gives signal, so it is easy to optimize.

Reward hack: Cranking Probability Mass without Flipping Answers

Unfortunately, the model learns that cheapest way to raise reward is to pile probability onto answers the reader would already get right with little context — confidence (logprobs on the right answer) climbs, accuracy does not. As the model learns (through the reward signal) the distribution of MCQs that are commonly asked, it also learns to include vivid and unnecessary detail that makes the reader model much more confident in its already correct answer. As a result, the ultimate metric of interest (MCQ accuracy) on the validation set fails to move.

Fix: Accuracy and a Clipped Margin

In response, we focus the reward on the quantity of interest directly: MCQ accuracy. If there are $n$ questions for a document $d$ , then

R_{\mathrm{acc}}(w,d)=\frac{1}{n}\sum_i \Delta \mathbf{1}\{\arg\max_a\, p_i(a;w)=y_i\}

This matches the headline metric exactly and removes the cranking payoff. However, it is a poor learning signal on its own: almost-correct and hopelessly-wrong wikis both score zero until the argmax flips, so the gradient is extremely sparse, which complicates learning during reinforcement learning. We recover dense signal by adding a term capturing the correct-answer margin — the gap between the gold logit and the best distractor — of which accuracy is just the thresholded version:

m_i(w)=\log p_i(y_i;w)-\max_{a\neq y_i}\log p_i(a;w),\qquad \mathbf{1}\{\arg\max_a\, p_i(a;w)=y_i\}=\mathbf{1}\{m_i(w)>0\}

With a small $\lambda\approx0.1$ , accuracy says what we want while the margin supplies within-class gradient without letting any one question dominate. This reward works well, leading to summaries that are dense, concise, and often creative.

R_{\mathrm{mcq}}(w,d)=\frac{1}{n}\sum_i\Big[\Delta \mathbf{1}\{\arg\max_a\, p_i(a;w)=y_i\}+\lambda\,\mathrm{clip}\big(\Delta m_i(w;b),-1,1\big)\Big]

Results

Neural Cheat-Sheets push the Pareto frontier in context size and downstream accuracy on unseen validation documents. They close most of the gap toward a full KV cache which is several orders of magnitude larger, and not model-agnostic. All compression methods below reduce the number of tokens stored by a factor of 20x, but latent methods store large tensors for each token whereas summaries store only the token itself, and are thus much smaller and cheaper.

Consider an example QuALITY wiki for the short story “Lost in Translation.” In it, Korvin, a human agent, is captured and imprisoned by the Tr'en — a hyper-efficient, rigidly logical alien race that has reached the atomic age. The summary wiki produced is:

Lost in Translation, 1972. Theme: language and communication barriers, political systems, logical rigor. Plot: Korvin is captured by the Tr'en, an efficient and scientific society, and interrogated. The Tr'en's rigid logic and literalism cause misunderstandings, especially around democracy. Tr'en society: highly efficient, hierarchical, lacking mental sciences, using lie detectors and hypnopaedic language training. Korvin's mission: gather intelligence and prevent expansion. Crucibles: translation, lie-detector test, governance debate, escape by exploiting lack of mental flexibility. Inference: the title refers to language, politics, and cognitive systems all failing to translate.

When summarizing patient health records in LongHealth, the model sometimes learns an even more compressed notation. This is not prose, but it is recognizably optimized for later retrieval. It is starkly reminiscent of the kind of frenetic, stream-of-consciousness notes a student might put on a 1-page cheat-sheet before an important exam:

PATIENT_19: HLHS with hypertension, TGA, DKS, PDA, VCI, Glenn; status post-Fontan; chronic liver congestion with F2 fibrosis. Treatment: captopril, carvedilol, aspirin; no vincristine; surgeries include DKS plus BT-shunt, Glenn, Fontan. MRI: normal RVEF, mild tricuspid prolapse, Fontan tunnel 12.5mm, VCI 18mm. Liver: hypoechoic, elevated elastography due to congestion, F2 fibrosis. Labs: normal eGFR, low vitamin D, supplement. Follow-up: MRI, LIMAX, vitamin D.

Training a Domain-Agnostic Summarizer

The models above are trained on specific domains (short stories, medical records) and at evaluation time asked to summarize an unseen document from the same domain. But a production memory writer must summarize whatever document it is handed. We study whether one model can learn domain-agnostic note-taking: we train a single summarizer, under the same MCQ reward, on a diverse mixture of long-document genres: Wikipedia articles, QASPER papers, LongHealth records, narrative books, patents, and government reports. When the data distribution does not have MCQs available, we synthesize a toy set. To measure generalization we hold QuALITY out entirely as an unseen domain, and within every training genre we reserve held-out documents the policy never wrote wikis for. Evaluation is unchanged: still MCQ accuracy of a frozen reader given only the generated summary.

The lift transfers out of distribution. The same policy improves substantially over baseline and essentially matches the accuracy it reaches when QuALITY is included in the training mix. The model has learned how to take useful notes and amortize reasoning in a token-efficient manner, not domain-specific heuristics or knowledge. This allows Neural Cheat-Sheets to plug into a general Contextbase or natural language memory, where new document genres arrive constantly and per-domain retraining is not an option.

Why This Matters

What might Neural Cheat-Sheets look like in production? It would mean a general-purpose context model can be co-trained with a downstream use in mind, and deployed as a middle layer between an enterprise’s operations (which take place on Slack, an ERP, or GitHub) and agents that need this context. In our experiments, the downstream use was answering multiple-choice questions about the documents. In production this looks more like outputting high-quality context that agents use when making PRs, designing slide decks, and reviewing redlines.

Applied Compute works at the nexus of reinforcement learning, context, and enterprise agents. The problem that Neural Cheat-Sheets solve requires a robust RL stack, delicate reward shaping, and a visceral understanding that auditable context is non-negotiable for enterprises in production. An important area for future work involves learning to edit memory stores (Contextbases⌝) in addition to adding to them. Neural-Cheat Sheets append to them, but real memories need to be updated — often in surprisingly subtle ways — as new information comes in and old information becomes stale. A compelling next step would be to understand reward shaping to train for this functionality end to end as well.

References

Eyuboglu, S., Ehrlich, R., Arora, S., et al. (2025). Cartridges: Lightweight and General-Purpose Long Context Representations via Self-Study. arXiv:2506.06266⌝
Zweiger, A., Fu, X., Guo, H., & Kim, Y. (2026). Fast KV Compaction via Attention Matching. arXiv:2602.16284⌝
Li, X. L., & Liang, P. (2021). Prefix-Tuning: Optimizing Continuous Prompts for Generation. ACL 2021. arXiv:2101.00190⌝
Cognition AI (2025). DeepWiki: AI docs for any repo. cognition.ai/blog/deepwiki⌝
Rush, A. M., Chopra, S., & Weston, J. (2015). A Neural Attention Model for Abstractive Sentence Summarization. EMNLP 2015. arXiv:1509.00685⌝
Nallapati, R., Zhou, B., dos Santos, C., Gulçehre, Ç., & Xiang, B. (2016). Abstractive Text Summarization using Sequence-to-Sequence RNNs and Beyond. CoNLL 2016. arXiv:1602.06023⌝
Paulus, R., Xiong, C., & Socher, R. (2017). A Deep Reinforced Model for Abstractive Summarization.arXiv:1705.04304⌝
Qwen Team (2025). Qwen3 Technical Report. arXiv:2505.09388⌝
Pang, R. Y., Parrish, A., Joshi, N., Nangia, N., Phang, J., Chen, A., Padmakumar, V., Ma, J., Thompson, J., He, H., & Bowman, S. R. (2022). QuALITY: Question Answering with Long Input Texts, Yes! NAACL 2022. arXiv:2112.08608⌝
Dasigi, P., Lo, K., Beltagy, I., Cohan, A., Smith, N. A., & Gardner, M. (2021). A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers. NAACL 2021. arXiv:2105.03011⌝
Adams, L., Busch, F., et al. (2024). LongHealth: A Question Answering Benchmark with Long Clinical Documents. arXiv:2401.14490⌝

Evaluation Details

Baselines (accuracy and compaction time / size) for other memory approaches are taken from the Attention Matching repo (Zweiger et al., 2026).
The summarization baseline is a model simply prompted to produce a short summary that preserves as much information and key content as possible.
For datasets with no accompanying evaluation set, we generate synthetic multiple-choice questions for each document using frontier model APIs.