It's worth mentioning a few things: The Tokenizer doesn't allocate any memory except for the current token that is kept inside a resizeable std::vector. There's also no need for pre-processing of the whole input text, we even don't need to know its length, so we can cheaply operate on terabytes of data, which is especially useful if we are interested only in a few initial tokens!