When I was in boarding school in Johannesburg, at the peak of my reading habit and intellectual curiosity, I came across a video for Entrepreneurs where Patrick Bet David emphasized that reading the right book at the wrong time can be very damaging.
This was the first time I had considered the possibility that such a positive habit like reading could have significant unintended negative effects. For example, reading The 4-Hour Work Week when you are starting your first job can be very damaging because that is when you’re supposed to be putting in outsized investments of effort and time that compound throughout your career.
The converse is also true by the way – reading the right book at the right time can change your life for the better. There are even more natural extensions to this principle beyond books. For instance, recalling a piece of knowledge from a book, video, or conversation at the right time can transform your experiences.
For instance, I had read in a software engineering book about the importance of organizing two kickoff meetings. The first should involve all stakeholders, including non-technical ones like UX and Legal, to clarify the scope and expectations of the project. This step is often not even considered. The second is for Engineering to clarify responsibilities, deliverables, and timelines. While I could “just remember this” steps, rereading my summary notes when I was starting a new project transformed this from couch advice to career impacting behavior.
What I have described so far is the importance of timing with memory as well as an allusion to the difficulty of solving this problem. The problem is getting the right context (from memory) at the right time.
It turns out this is quite a challenging problem for AI systems to solve today. In fact, I posit it is so big a challenge that this will be the kind of blocker to AGI that will require a significant redesign of the transformer architecture used to power ubiquitous Generative AI technologies.
Quick disclaimer: I do believe we will rapidly experience powerful deployments of Artificial Intelligence that will transform our society significantly. I do NOT believe AGI is around the corner. My favorite definition of AGI here is an AI system good enough that major tech companies like Google and Microsoft fire nearly all of their knowledge workers. Finally, there are folks who are working much closer to this technology on a daily basis and some of them disagree with my position for good reasons. However, given the AGI hype it seems important that folks who are not as close to the research developments are aware that we will need significant breakthroughs just as big as the one brought by the transformer architecture.
In this piece, I want to lay out key facts about the evolution of the architecture used by many LLMs today, the key innovations/gaps needed to cross the chasm to AGI, and finally the implications for consumers, builders, and investors.
Architecture Evolution
Transformers are the transformative technology that has spurred the generative AI revolution, even though it was not a complete solution to the AI research problems of the time.
Recurrent Neural Networks (RNNs) were more or less state of the art pre-2017. They used a feed forward mechanism which used a fixed size hidden state that was “updated” after every embedded token in the input or output. The problem, however, is that there was information loss in these hidden states because too much was packed into them.
Transformers proposed a new architecture which enabled long range dependencies to be better handled because instead of the sequentially updated hidden states, the transformer can pay more or less attention to different parts of the input/output (leveraging multi-head attention and positional encodings) for the token it is currently generating – that is having the right context at the right time.
While the transformer was successful in other significant ways including much faster and cheaper training, as well as incredible versatility beyond text to images and videos, it only partially solved a subset of the problem – having the right context, from the input/output, at the right time.
The distinction between these two problems is well known to AI researchers. Even before the transformer paper, different ideas have been explored to design neural networks that are connected to a read/write long term memory as well as more successful approaches to use Retrieval Augmented Generation (RAG) for getting the right context from storage into the limited context window. All of these efforts acknowledge the importance of context whether it is the context that the model has written to storage to understand a user’s preference or RAG to pull relevant information from the internet.
Transformer Gaps
Beyond the fact that Transformers only solved a subset of the memory/context problem, there are also surface level and fundamental gaps in how LLMs operate today: Catastrophic forgetting, U shaped memory, and Fictitious memory.
Catastrophic forgetting is primarily human-induced in a sense. Remember in 2023 when everyone who cared would complain that AI chatbots had knowledge cutoffs and so would not be able to answer who won yesterday’s FA Cup final. The main solution adopted in industry to address the knowledge problem was continuous pre-training. This process adapts the models to specialized domains, enhances their coding capabilities, and expands heir linguistic ranges too.
However, these pre-training updates change the model’s weights which can lead to a loss in previously acquired capabilities or knowledge – catastrophic forgetting. That said, researchers and AI engineers have implemented Elastic Weight Consolidation (EWC) and Evals to address this problem. EWC does something akin to reducing the plasticity of some neurons for learned tasks when training on new tasks to prevent the forgetting. Eval sets define expected output behaviors for specified inputs which serve as some sort of unit testing to ensure old functionality is not eroded.
U-shaped memory is a semi-recent discovery in state of the art LLMs which shows that the models perform much better at recalling and reasoning on information when it is nearer the beginning or the end of the context – terrible at the middle, hence the U shape. This research published by Stanford and Cal in 2023 called into question the practical benefits of supposedly enormous context windows many research labs were racing towards.
My favorite quote from the paper:
“to claim that a language model can robustly use information within long input contexts, it is necessary to show that its performance is minimally affected by the position of the relevant information in the input context”
Different solutions have been tried with varying degrees of success. One path researchers continue to investigate is improving the positional encodings used in transformers that i mentioned earlier. Another is to use RAG to retrieve only the most relevant information to add to the context. This is because the U-shaped problems go away as input length decreases. Finally, many prompting guides simply encourage users to keep the most important information or instructions at the start or at the end of their prompts.
Fictitious memory
Come to think of it, these models actually have no memory at all. When we discuss concepts like AGI and its possibilities, it is important to remember that a lot of our experience of AI memory is really just clever client side engineering nin jitsu. Let’s break this down a bit more.
The biggest manifestation of this is the simple fact that when you chat with an AI bot and send your second message after its response, the model does NOT remember your first message or the response. The client side engineering is that the entire context from uploaded files to previous conversation turns are all passed to the model again to generate the second response.
LLMs are, in some sense, stateless because they have to be reminded of all the context, every time. Its a stateful system, but a stateless model.
One implication is that nothing about the current transformer architecture actually promises learning behavior which is required for an Artificial General Intelligence System. That said, as a quick aside, while transformers have no memory, one could consider the LLM system that includes external memory, RAG, and so on to indeed have memory, but these are layers on top of the transformer architecture.
Another implication is that the behavior problems we experience with reduced creativity after multiple turns is a direct consequence of poor memory architecture. The models get siloed and less creative because it most definitely does not ignore the previous context in the chat. This is the same reason why unrelated information in context can steer the model off course even when accurate information was present in the training set.
There are many paths to Artificial General Intelligence from complex technology systems that may include the transformer to cohesive collectives of human-computer systems and gene-edited humans with neuralink-like connections to computers.
Implications for consumers, builders, and investors.
While we may not know the when/how of AGI we know that transformers are ushering in powerful AI that will transform society in economic, political, and other ways.
For users of GenAI technology, which I expect to include everyone, remember that you are a context engineer. You should give explicit instructions, file uploads and so on that can bring relevant context to LLM to make it much more useful to you. Separate conversations about different ideas into new chats. And, keep your most important instructions near the beginning or end of the context.
For builders, while I might seem skeptical of all the engineering nin jitsu needed in context engineering, I think we need much more of it. There are many latent use cases today that would be possible and impactful if only these models had the right context and the right user experience was presented to the user. The lack of memory of a transformer-based model is not an argument for its limited usefulness!
Finally, for investors, just a word of warning. To put it simply, if your valuations and investments are based on assumptions that your portfolio companies will figure out a path to AGI, reconsider. To reap the outsized benefits that AGI would provide, we would need a breakthrough just as big as the one that resulted in the transformer. And that’s just for the memory problem.