Continual learning and the post monolith AI era


Introduction

Beren Millidge.

Beren says that continual learning is fundamentally in tension with long-term memory. Memories don’t store information directly but rather store pointers to distributed neural representations. When you recall something, you’re reactivating a pattern across many neurons. But continual learning keeps updating those very representations. The paths your memories point to drift out of sync with where they’re supposed to lead.

This generalizes to any modular system. If two subsystems communicate via representations that both evolve independently, their shared language eventually breaks down. Module A learns to send signal X for “danger.” Module B learns that X means “danger.” Then A updates, and now it sends Y. But B is still listening for X, and the interface is corrupted.

The brain’s solution is to freeze the interface. After some critical period (the highly neuroplastic period of childhood), the communication channel between systems becomes fixed. Each subsystem can still learn internally — it builds encoders and decoders that map between its evolving representations and the frozen channel. This is why critical periods exist, because they establish stable interfaces so the rest of the pipeline doesn’t catastrophically break every time you learn something new.

But notice what this implies. If the interface is frozen, all subsequent learning happens relative to that frozen base. The base constrains what you can learn and how efficiently you can learn it. The more you’ve already committed to the base, the harder it becomes to add more. This is why adults struggle with languages that children acquire effortlessly, i.e. the phonological interface froze decades ago.

You might object: adults still learn for decades, as we remain generally intelligent and continual learning isn’t catastrophic for us.

But humans are far more specialized than LLMs. Our knowledge is patchy, spiky. We pick one narrow sub-area and learn it deeply. A surgeon doesn’t also know tax law and fluid dynamics and Mandarin. The brain solves continual learning by not attempting generality in the first place.

The Big Labs face a harder problem: take a model that already knows everything, then force it to learn more, without forgetting what it knew.

The cost of continual learning scales with generality.

A model that knows everything must maintain consistency across all its representations whenever it learns anything new. Every update risks breaking something unrelated. You could design a synchronization system that keeps stored knowledge aligned with drifting representations, but this eliminates the computational advantage of long-term memory, which is that you can largely leave it alone unless needed. The maintenance burden grows with what you’re storing.

A model that only knows insurance claims has no such problem, as it can update freely and there’s little else to interfere with, little else to keep synchronized. Once you establish your base representations, all new learning happens relative to them. The more you’ve crammed in, the harder it is to add more without interference, and the more expensive it becomes to maintain what you have.

This is Adam Smith’s specialization of labor applied to neural networks. Generality has overhead, and that overhead grows faster than linearly with scope. Specialization sidesteps the problem entirely. A medical coding model establishes interfaces optimized for medical concepts and improves within that bounded space, where catastrophic forgetting becomes a local problem rather than a global one, and synchronization costs stay manageable.

The intractability of continual learning in generalist models is the invisible hand pushing deployment toward narrow experts.

Some of the obvious and less obvious ideas in the literature

Cartridges’. The idea here is to construct a KV cache that efficiently compresses prior knowledge into a dense learned representation. This is done by prompting the model to ‘self-study’ the prior knowledge, and using back propagation to update the KV cache instead of the model weights. 

In fact, this idea is quite old. In 2019, FAIR published Augmenting Self-attention with Persistent Memory. This paper highlights the following point: a trained KV cache is actually very mathematically similar to a multi-layer perceptron (MLP) layer. Indeed, taking a dot product with a set of keys is equivalent to multiplying with an up-projection matrix. The softmax is then a suitable activation function, and a linear combination of value vectors is mathematically equivalent to multiplying with a down projection matrix. For this reason, the authors of this paper propose doing away with the MLP entirely: we simply make use of two types of KV vectors – static and dynamic. This makes the problem of continual learning even more tantalizing: it makes you wonder how we might move dynamic KV vectors into the static KV cache.

State Space models

paper, Mixture of a Million Experts, which pushes MoE to its logical extreme: a million experts, each a single neuron.1 When we read this paper, our first thought was “what if you had a mixture of a trillion experts, and continual learning was just updating the router for new tasks?” This is because, with a trillion experts, you basically have the primitives for all function composition, and your router can just expressively map to these. But this also felt a bit overkill and moving away from the representation-based world of Beren. It felt like we should still be routing to and updating meaningful latents, in some form, rather than very atomic functional primitives.

Sparse memory fine tuning makes this idea more explicit in an update rule. Given a memory layer (architecturally similar to mixture of a million experts), they ask: which slots should you actually update when learning something new? Their answer is TF-IDF scoring (a simple term-importance heuristic): rank memory indices by how specific they are to the new input versus a background corpus. This identifies indices that are activated by this knowledge but not by everything else, avoiding the general-purpose slots that would cause interference. This is Beren’s framework quite explicitly: a frozen addressing mechanism plus a simple heuristic (update what’s specific, leave what’s general) that sidesteps the need to learn how to learn.

Nested Learning

DeepMind on nested learning, which starts with the observation that architecture and optimization aren’t separate things. Rather, they’re the same concept operating at different timescales. A transformer’s attention mechanism is associative memory updating every forward pass. The feedforward layers are associative memory updating at training time. Backprop itself is associative memory, mapping inputs to their prediction errors. Once you see this, the whole model becomes a hierarchy of nested optimization problems, each with its own update frequency. This leads naturally to what they call “continuum memory systems”: instead of a binary split between short-term (attention) and long-term (weights), you get a spectrum of memory modules updating at different rates. Their proof of concept architecture, Hope, implements this as a self-modifying recurrent network that can optimize its own memory through self-reference. From the Beren framing, this is another way of establishing stable interfaces: if different components update at different frequencies, the slow-updating ones become the frozen channel through which the fast-updating ones communicate. You get modularity through temporal separation of learning rates. (This builds on Titans; see below.)

We don’t know enough about this to make an informed comment, but it feels a bit flawed. We think that it introduces way too many manual and qualitative design choices that are difficult to justify exactly why you chose them (number of time scales, what the frequency of those time scales should be, etc.). To us, it seems much cleaner to have the minimal set of discrete components of the system to emulate our meta-learning algorithm (a long-term memory like parameter weights, a short-term memory like the KV cache, and a way to inject the relevant bits of short-term memory into long-term memory when we need to). But we could be wrong. Maybe this is a solution. 

Surprise Based Learning

paper we read on this was entropy-adaptive fine-tuning (EAFT). EAFT identifies confident conflicts, which are tokens where the model has low entropy (strong prior) but low probability (the training label contradicts that prior). These generate massive gradients that overwrite existing representations, causing forgetting. Their fix is to weight the loss by normalised entropy, suppressing gradients when the model is confident. This dramatically reduces catastrophic forgetting.

But I’m skeptical. The whole point of learning is sometimes you need to override a confident prior. The president of the United States changed from Obama to Trump. A fact you were confident about turned out to be wrong. EAFT treats all confident conflicts as damage to be avoided, but some of them are exactly the updates you want. The real problem is that standard architectures don’t have a clean separation between “what I know” and “how I reason”, so updating one fact corrupts everything else. Suppressing those updates just means you never learn the new facts. 

Test time Training

Titans. This takes a memory-centric view of sequence modeling. They say attention (due to its limited context but accurate dependency modeling) acts as short-term memory, while a neural network that learns to compress information into its weights can act as long-term memory. Their neural memory module is essentially a meta-model that learns how to memorize at test time, using gradient descent on an associative memory loss. The clever bit is their surprise metric: an event that violates expectations (high gradient) is more memorable, but they decompose this into past surprise (momentum) and momentary surprise (current gradient), which prevents the model from missing important information after a big surprising moment. They also add a forgetting mechanism via weight decay, which they show is actually a generalisation of the gating in Mamba and friends. This feels like (1) a more principled surprise-based approach than EAFT, but simultaneously (2) re-deriving RNNs/LSTMs and the like, so we’re not sure how we feel about it. 

A related and recent paper takes a simpler approach: just keep training a standard transformer with sliding-window attention at test-time via next-token prediction on the context it’s reading. The model compresses context into its weights rather than storing every key-value pair explicitly. To make this work well, they use meta-learning at training time to prepare the model’s initialization for test-time training, so the outer loop optimizes for “how good will this model be after it’s been updated on the test context”. The tradeoff is worse needle in a haystack retrieval, which makes sense, as the whole point is lossy compression, not lossless recall. I actually don’t mind this, as it feels a lot closer to what humans are doing. At the same time, I think we still need a lossless form of short-term memory that we can more efficiently swap to, without defaulting to test-time training. 

There’s also a bunch of other interesting related papers, whether it’s about extending the effective context window (e.g. lightning attention or landmark tokens)3, or other architecture changes like Memformer, which are worth reading. 

The final one goes back to CARTRIDGES, which feels very spiritually close to sparse memory fine tuning. Again, if you treat your bank of CARTRIDGES as an associative memory, and the pre-trained transformer backbone establishes the interface, and you figure out how to retrieve and update relevant CARTRIDGES continuously, then you’ve got Beren’s system. But maybe you want to be able to slowly add CARTRIDGES back into the weights, which is a whole other question entirely. 

The common thread

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *