Abstract

We review recent progress in understanding the meaning of mutual information in natural language. Let us define words in a text as strings that occur sufficiently often. In a few previous papers, we have shown that a power-law distribution for so defined words (a.k.a. Herdan’s law) is obeyed if there is a similar power-law growth of (algorithmic) mutual information between adjacent portions of texts of increasing length. Moreover, the power-law growth of information holds if texts describe a complicated infinite (algorithmically) random object in a highly repetitive way, according to an analogous power-law distribution. The described object may be immutable (like a mathematical or physical constant) or may evolve slowly in time (like cultural heritage). Here, we reflect on the respective mathematical results in a less technical way. We also discuss feasibility of deciding to what extent these results apply to the actual human communication.

Received 06 May 2011Accepted 09 August 2011Published online 30 September 2011

Lead Paragraph: In 1990, German engineer Wolfgang Hilberg published an article1 where the graph of conditional entropy of printed English from Claude Shannon’s famous work2 was replotted in log-log scale. Seeing a dozen data points lie on a straightish line, he conjectured that entropy of a block of n characters drawn from a text in natural language is roughly proportional to for n tending to infinity. Although this conjecture was not sufficiently supported by experiment or a rational model, it attracted interest of a few physicists seeking to understand complex systems.3–6 As a graduate in physics and a junior computational linguist, I found their publications in 2000. They stimulated me to ponder upon the interplay of randomness, order, and complexity in language. I felt that better understanding of Hilberg’s conjecture can lead to better understanding of Zipf’s law for the distribution of words.7,8 Using Hilberg’s conjecture, I wished to demonstrate clearly that the monkey-typing model, introduced to explain Zipf’s law,9 cannot account for some important purposes of human communication. However, it took a few years to translate these intuitions into a mature mathematical model.10–13 The model is presented here in an accessible way. I also identify a few problems for future research.

Article outline:I. INTRODUCTIONII. IDEAS IN THE BACKGROUNDA. Zipf’s and Herdan’s lawsB. Detecting word boundaries with grammar-based codesC. Excess entropy and Hilberg’s hypothesisD. Highly repetitive descriptions of a random worldIII. MATHEMATICAL SYNTHESISA. Definitions and theoremsB. The zoo of Santa Fe processesIV. AFTERTHOUGHTS FOR THEORETICIANSA. What are those “facts”?B. Are facts and words the same?C. Finite active vocabulary and division of knowledgeD. How does language differ to maths, music, and DNA?V. AFTERTHOUGHTS FOR EXPERIMENTALISTSA. What is the appropriate grammar-based code?B. How to measure mutual information?VI. CONCLUSION