Search

Search for:

Choosing Tokenization Well

Eureka the entirely automatic, recursive process of distilling data into its component statistics can utilize but does not have any restrictions or require any kind of format specification or limitations on its input data. It would even be possible to inject and robustly recall binary information from plain text though that might lead to some level of confusion with coincidental tokens both inside and outside of the binary sections (see normalization for mechanisms to make this easy)[].

Eureka does however require human interaction with respect to the construction of lex rules. As a component of the Eureka digest itself the lex rules define how the original data is partitioned into different statistical spaces. This has two practical effects, first the consistency of the resulting partition would materially effect how effective Eureka is in compressing the data, hence how quickly all operations within that space occur. Secondly, the query language expresses through the different classes of tokens hence how easily it is achieve operational efficiency. Recall queries are an interactive process[…] hence poor choices or mismatching regular expression rules would result in feedback that might be misleading or confusing.