References & Citations

Bookmark

Physics > Physics and Society

Title:Generalized Entropies and the Similarity of Texts

Abstract: We show how generalized Gibbs-Shannon entropies can provide new insights on
the statistical properties of texts. The universal distribution of word
frequencies (Zipf's law) implies that the generalized entropies, computed at
the word level, are dominated by words in a specific range of frequencies. Here
we show that this is the case not only for the generalized entropies but also
for the generalized (Jensen-Shannon) divergences, used to compute the
similarity between different texts. This finding allows us to identify the
contribution of specific words (and word frequencies) for the different
generalized entropies and also to estimate the size of the databases needed to
obtain a reliable estimation of the divergences. We test our results in large
databases of books (from the Google n-gram database) and scientific papers
(indexed by Web of Science).