Tokenisation

Tokenisation is the process of splitting text into tokens before it is fed into a language model or embedding model. It is the first step in turning raw text into something a model can process, converting a string of characters into a sequence of discrete units the model understands.

Modern tokenisers usually work at the subword level, breaking text into pieces that balance vocabulary size against sequence length. Common words become single tokens, while rare or compound words are split into smaller, reusable fragments. This lets a model handle essentially any input, including words it never saw in training, by composing them from familiar subword pieces.

Tokenisation has practical consequences for vector search. It determines how many tokens a piece of text consumes, which interacts with embedding model input limits, language model context windows, and per-token costs. Because tokenisation schemes differ between models and languages — some languages producing far more tokens per word than others — understanding it helps explain why chunk sizes, context budgets, and costs behave the way they do across different models and content.