Token is the smallest unit that each corpus divides to. Typically each word form and punctuation (comma, dot, …) is a separate token (but don’t in English consists of 2 tokens). Therefore, corpora contain more tokens than words. Spaces between words are not tokens. A text is divided into tokens by a tool called a tokenizer which is often specific for each language.