<aside>
DRAFT Alert:
This is an unedited draft published on 17 November. Comments welcome.
</aside>
<aside>
The context window represents the amount of text a language model can process at once during a conversation or task. It's a crucial concept in AI, affecting everything from basic tasks to complex analysis. While early models had small context windows of around 2,000 tokens, modern models can handle much more, with some reaching up to 2 million tokens.
<aside>
Model | Context Window | Approximate Words* | Key Feature |
---|---|---|---|
ChatGPT Free | 8K tokens | ~6,000 | Basic capability |
ChatGPT Plus | 32K tokens | ~24,000 | Standard for paying users |
Claude | 200K tokens | ~150,000 | High capacity |
Gemini | 2M tokens | ~1.5 million | Largest current window |
Open Source | 8K-128K tokens | 6K-96K | Varies by model |
Context window is simply the amount of text the language model can "see" at once. Another way of thinking about this is that context windows = the maximum size of the prompt. The model can only answer questions based on the text that is sent to it as part of the prompt - nothing else.
When ChatGPT was released, it had a context window of 2,000 tokens - roughly 1,400 words of English text. Today, ChatGPT Free has a context windows of 8,000 tokens (about 6,000 English words) and ChatGPT Plus goes up to 32,000 tokens and Enterprise 128,000 tokens.
However, context window size is the one area where the best of OpenAI's models are far behind its competitors. Claude has a context window of 200,000 tokens and Google's Gemini goes all the way up to 2,000,000. Many Open Source models also go up to 128,000 tokens but 8,000 is the norm. These are the maximum context windows, they are not available in all contexts but of the leading model makers, OpenAI's top context window is the smallest.
But cannot ChatGPT answer questions about much larger documents than 6,000 words? Yes, but that is using a trick called RAG (Retrieval Augmented Generation) which breaks the text up into chunks (usually about a paragraph each), retrieves the chunks that most closely match the query (usually about 10-20) and sends those to the model as one prompt. ChatGPT's answer is based on just those chunks and nothing else - in effect, it never even 'saw' the text. This very much limits the types of questions you can ask about the text.
When the whole text fits in the context window, things are different. The model "sees" the whole text and can take it into account. One easy way to test the difference is to do the "needle in the haystack test". Give the model a whole text that fits into the context window, change something and ask "what does not fit". For example, take a 19th century novel or short story and insert a sentence at random "They all got on the airplane and flew to Australia." When the model is retrieving chunks, it would have no way to do this because it never see the text at once. But when you send a whole text to a good model as part of a single prompt, it will be able to tell you that this sentence does not fit.
But there are also disadvantages to not using the chunks of the RAG method. When the whole text fits in the context window, the model does not really know where the information it found is positioned in the text. It cannot give a page number and often not even the relative position of where different pieces of information are. The model itself does not "know" that with the RAG method either, but when we create the chunks, we can collect the information and store it, so that when the chunks are retrieved we can find where they are.
Retrieval Augmented Generation is not the only way to overcome the context window limitations. We can also just split the text into two or more big parts, use the model to summarise them separately and then ask it to combine the summaries. This has the advantage of the model actually seeing big chunks of the text rather than just several disconnected paragraphs but it is much easier to miss important points.
So we can see that long context windows are very important especially in academic contexts where we want to interrogate long texts or even collections of texts.
I've written about it elsewhere [LINK] in more detail, but it is important to always keep in mind what does it mean when a Large Language Model "sees" a text. You will notice, I have been using the word "see" in quotes and that's for a reason. Because the way the model "reads" the text is completely different from how a human would read it.
Often, Large Language Models are described as just predicting the next word but the reality is very different from what we may imagine prediction would look like. First, the model doesn't generate words but parts of words called "tokens". And it does it by choosing a token that best fits the preceding context. It adds the token to the text, and then chooses the next one based on the context that now includes the token it just generated. Modern models have a vocabulary of about thirty to two-hundred thousand tokens - depending on the model - and these are sufficient to generate all of the worlds languages and computer code.
The model chooses the token to add next by weighting the relative importance of all preceding tokens (essentially parts of words). This is called Attention and it is what enables Large Language Models to work at all. Attention is a very computationally intensive process because the way the relative weighting is done by multiplying extremely large matrices that represent the relative position or embedding of each token in a very large space. So, if there are 100,000 tokens in the prompt, each of their embeddings had to be multiplied against each other. Multiple times.
In contrast, when a human summarises a long text, they read the text and then use the knowledge gleaned through the process to write the summary, occasionally looking up details. The language model does not have any memory in which to store the knowledge acquired through reading a text. Its reading is much more akin to a person writing down a long sequence of numbers. They look at the number, decide how much they can remember, write down the chunk, look at the number again, select the next chunk to remember, write that down, and so on. In a way, you can say that in this scenario they are relying exclusively on attention, but not on working memory. And that's exactly what the models are doing. But the problem is that Large Language Models can achieve things with attention only that humans require memory and other tools. This makes their performance both unpredictable and unintuitive. I wrote about some more examples of this here [LINK].
The human intuition about how the Large Language Model "reads" text plays a significant role in how models with long context windows will be used in academic contexts. Unfortunately, from my experience, the intuition leads people to simultaneously expect less and more from what can be done with these tools.
In general, the mental model I observe people apply to LLMs in this context is a blend of two frames. One frame is the is the experience of a person reading the text - taking notes, back-tracking, developing a mental image of what the text is like, skimming and scanning and then finally writing an account of one has read. The other frame contributing to the intuition is people's experience with computer text processors and databases. In this frame, the "computer" can faithfully store text, search in it extremely quickly and retrieve any part of the text, copy and paste it exactly as it appeared in the original. These two frames then create a blend of a 'superhuman' roboreader that can do everything a human reader using a computer can do but faster and at more volume.