Łukasz Dębowski
This is the research statement of an informal metaproject, which I commenced in 2000.
German engineer Wolfgang Hilberg (1990) published an article where the graph of conditional entropy of printed English from Claude Shannon's (1951) famous work was replotted in log-log scale. Seeing a dozen data points lie on a straightish line, he conjectured that entropy of a block of n letters drawn from a text in natural language is roughly proportional to the square root of n for n tending to infinity. Although this conjecture was not sufficiently supported by experiment or a rational model, it attracted interest of a few physicists seeking to understand complex systems (Ebeling, Nicolis, 1991; Ebeling, Pöschel, 1994; Bialek, Nemenman, Tishby, 2001; Crutchfield, Feldman, 2003).
As a graduate in physics and a junior computational linguist, I came across these publications in 2000. They stimulated me to ponder upon the interplay of randomness, order, and complexity in language. Using Hilberg's conjecture, I wished to demonstrate clearly that the monkey-typing model, introduced to explain Zipf's law (Miller, 1954), cannot account for some important purposes of human communication. However, it took a few years to translate these intuitions into a mature mathematical model.
I have developed a conceptual framework for QL which borrows heavily from information theory and yields a macroscopic view onto human communication. The central point of my work has been linking empirical laws for the distribution of words with an intuitive idea that texts describe various facts in a highly repetitive and mostly logically consistent way. This takes form of a proposition that can be expressed informally as follows, where β is in the range of (0,1):
(H) If a text of length n describes at least nβ independent facts in a repetitive wayThe conclusion of this proposition is usually called Herdan's or Heaps' law in the English literature. It is an integrated version of the famous Zipf law for the distribution of words.
then the text contains at least nβ/log n different words (under appropriate quantification over n).
Proposition (H) has been formalized in a series of mathematical definitions, theorems, and toy examples (Dębowski, 2009, 2010, 2011):
The aforementioned mathematical results can be also linked to the relaxed Hilberg hypothesis (Dębowski, 2011). The relaxed Hilberg hypothesis (for individual texts) says that (algorithmic) mutual information between adjacent blocks of n letters drawn from a text is roughly proportional to nβ. Respectively, proposition (H) can be related to proposition:
(H') If a text of length n describes at least nβ independent facts in a repetitive wayand proposition:
then mutual information between adjacent blocks of n letters exceeds nβ (under appropriate quantification over n).
(H'') If mutual information between adjacent blocks of n letters exceeds nβNevertheless, Proposition (H) does not follow from the conjunction of (H') and (H'') because of small differences in quantification under which Propositions (H), (H'), and (H'') are satisfied.
then the text contains at least nβ/log n different words (under appropriate quantification over n).
Mutual information between adjacent blocks has been considered, under the name of excess entropy, a measure of complexity of stochastic processes (Crutchfield, Feldman, 2003). Here, complexity is understood as departure from both simple order and simple randomness. Hence I think that my results may be also interesting for researchers studying complex systems in general.
An interested reader should first read this general introduction:
The following papers contain mathematical results:
Some experimental data can be found here:
I am deeply grateful to several people who have helped me further my research:
Here are some grants through which the QLIT metaproject was partially supported:
Łukasz Dębowski, 17.05.2012