A BDH seminar summary circulating in recent technical discussion frames LLM memory as a tradeoff between the familiar transformer KV cache and a much larger memory space embedded in network weights and neuron-like activations. The cited material, based on Jan Chorowski’s seminar slides, describes an architecture where memory is not mainly stored as an ever-growing token history but read from a high-dimensional activation space.
The core claim is specific: standard transformers keep long-term knowledge compressed into weights and short-term session state in the KV cache, while the BDH proposal shifts more of that working memory into fixed model structure. In the source discussion, keys and queries are described as neuron activations in a large positive sparse space, with memory readout treated as graph propagation through an accumulated connectivity matrix.
LLM memory in network weights versus KV cache
The source discussion describes transformer memory as split in two parts. One part is the static information learned during pretraining and stored in the model weights; the other is the short-term context stored during a session in the KV cache, which grows with token count.
That is the specific LLM memory tradeoff under discussion. According to the seminar summary, BDH replaces the growing token-by-token cache with a fixed-size but much higher-dimensional state, so the model’s memory machinery lives more directly in the network’s own representational space.
NovaKnown has covered related architecture-level memory questions before in AI memory system and broader stack-level breakdowns in LLM failure modes. Here, the reported claim is narrower: not just “better memory,” but a different place to store and access it.
What the BDH seminar slide claims about memory space
The seminar summary says Chorowski’s slide sets keys and queries equal to neuron activations in high-dimensional space. Instead of comparing a compact query vector against a list of past keys, the architecture uses an accumulated connectivity matrix, labeled sigma in the discussion, and reads memory as graph propagation.
That is why the discussion does not describe BDH as simply “linear attention.” The cited quote from the seminar summary says you cannot replace a nonlinear attention layer with a linear attention layer and leave the rest of the model unchanged; the memory space itself also changes.
The reported comparison is unusually concrete. One slide is summarized as claiming more than 10^7 key-query dimensions for BDH versus about 10^3 for transformers. In that framing, short-term states are projected into a fixed, positive, very high-dimensional space rather than appended to a token-indexed cache.
Why the architecture emphasizes sparse, high-dimensional activations
The source discussion says the key-query space in BDH is intended to be large, sparse, and positive, with activations treated as neuron-like rather than as small abstract vectors. That matters because the proposal ties memory access to non-negative activations and graph-style connectivity, not just token sequence lookup.
In the summary, that larger space is described as making short-term memory “more expressive and manipulable” than a KV cache. The material also distinguishes this from state-space models that compress recurrent state into a smaller matrix; here, the state is instead projected directly into neuron space.
That design overlaps with other non-transformer lines of work NovaKnown has covered, including spiking neural network, where sparse activity and alternative memory behavior also become central design choices. In the BDH description, sparse high-dimensional activations are not an extra flourish; they are the memory mechanism being proposed.
What the cited materials say about training and implementation limits
The summary also includes the blunt implementation constraint: a full neurons-by-neurons connectivity matrix is too large to use directly. That is the clearest practical limit acknowledged in the cited material.
The note matters because the architecture’s advertised memory space is enormous by design. If keys and queries live in a space above 10^7 dimensions, then a naive dense matrix for all pairwise connectivity quickly becomes intractable in storage and computation.
The discussion points to sparsity as the obvious answer, though the brief does not provide a full implementation recipe. It does, however, make the constraint explicit: the proposal is not that an unrestricted dense memory matrix is cheap, but that the architecture needs sparse structure to make that memory space usable at all.
Key Takeaways
- LLM memory in the cited BDH discussion is framed as a shift from growing KV cache state toward memory represented in network weights and neuron-like activations.
- The seminar summary says BDH treats keys and queries as activations in a high-dimensional positive sparse space, with memory readout handled through graph propagation.
- One reported slide compares more than 10^7 key-query dimensions in BDH with roughly 10^3 in standard transformers.
- The cited material explicitly notes that a full neurons-by-neurons connectivity matrix is too large to use naively.
- The proposal is described as more than just shrinking or linearizing attention; it changes the memory space itself.
Further Reading
- AI Memory System: Why MemPalace Matters More Than Fame, NovaKnown’s prior coverage on AI memory systems and architectural memory framing.
- LLM Failure Modes Start in the Stack, Not the Chat, Background on where model failures show up across the stack.
- Spiking Neural Network Hits 1B Parameters, Hints at New Behavior, Related coverage on alternative neural architectures and sparse activity.
