Context Window

The context window is the maximum amount of text, measured in tokens, that a language model can consider at once. Everything the model uses to produce a response must fit inside it: the system instructions, any retrieved documents, the conversation history, and the user’s current question.

This limit directly shapes how retrieval-augmented systems are designed. It determines how many retrieved chunks you can inject per query, how much conversation history you can preserve, and how detailed your instructions can be. A model with a 128,000-token window can work with far more retrieved context than one limited to 8,000 tokens.

Crucially, a large context window does not remove the need for good retrieval. Filling the window with loosely relevant material is slow, expensive, and counterproductive — models tend to lose track of information buried in the middle of very long inputs. Retrieving the handful of genuinely relevant chunks still beats dumping in everything, which is why vector search remains essential even as context windows grow.