This is a guest post by Aleksandra Piskunova.
Have you ever wondered how readable texts are? What even makes a text readable? And how can we measure to what extent a text is readable?
In my series of posts for the Digital Orientalist, I intend to help you address these questions and to bring attention to discussions about measuring the readability of texts, a theoretical background of which will be presented in this post.
Readability is the ease with which one can read a text. It can be measured using different methods, most of which analyse different measures of the textual content, including number of words, sentence length, word frequency, etc. Because of such simple measures, these methods are extremely popular and handy to use, but they have many limitations, some of which are the underlying assumption that the text does not contain noise and the sentences are always well-formed, the requirement of significant sample sizes of text, inability to model word usage in a context, etc (Algee-Hewitt et al. 2016). In this post, I will talk about a different measure, which is as handy but allows avoiding these limitations, so-called entropy, which comes from information theory and measures readability depending on the amount of information the text contains.
Information theory as a study began in 1924, when Harry Nyquist, a researcher at Bell Laboratories, published a paper entitled “Certain Factors Affecting Telegraph Speed”. It was followed by the paper “Transmission of Information”, which Nyquist’s colleague R.V.L. Hartley wrote in 1928 and established the first mathematical foundations for information theory. Based on these foundations, in 1948 Claude Shannon wrote his famous paper “A Mathematical Theory of Communication” in the Bell System Technical Journal and initiated the formal study of information theory. Today, it is oriented toward the processing and transmission of information.
The key concept of information theory is entropy, on which the readability of a text depends. Entropy is the amount of information in a text, defined by the degree to which the textual content (see below) is surprising, i.e. by the degree of its so-called surprisal. If the surprisal of the content is high, the text is highly informative. On the other hand, if the surprisal of the content is low, the text carries very little information.
Textual content is formed by n-grams (unigram, bigram, trigram, etc.), most often represented by a contiguous sequence of letters, words, or words and punctuation marks that occur in a text. Surprisal of the content can be measured by summarising the surprisal of n-grams. The surprisal of n-grams relates to the probability of their appearance in the text, which depends on their frequency in a large and representative corpus of texts, that reflects the state of affairs in a given language. Let’s take modern Chinese and n-grams represented by words and punctuation marks in it as an example. In a sentence “这是猫。” (Zhe shi mao., “This is a cat”) bigram “这是” (zhe shi, ‘this is’) occurs in Chinese more frequently than bigrams “是猫” (shi mao, “is a cat”) and “猫。” (mao., “a cat.”), which means that the probability to detect the appearance of “这是” in a Chinese text is higher than the probability to detect the appearance of “是猫” and “猫。”. Therefore, the surprisal of the former is lower than that of the latter.
The sum of surprisals of n-grams defines to what extent the content of the text is surprising, which, in turn, captures the amount of information in the text, or entropy. If the text contains a small amount of information, it is highly readable, i.e. surprisal of the content is low: n-grams are frequent in a corpus and the probability to detect their appearance in a text is high. On the other hand, if the text contains a large amount of information, it is not easy to read, i.e. surprisal of the content is high: n-grams are not frequent in a corpus and the probability to detect their appearance in a text is low.
Thus, in information theory entropy provides a means for measuring the amount of information in the text, which defines readability and allows us to think about it mathematically, and hence computationally.
Entropy was used by the Stanford Literary Lab to compare the amount of information in what is called “canon” (260 titles from the Chadwyck-Healey corpus) and “archive” (949 titles from the same period). The study confirmed the theory advanced by the Stanford Literary Lab that texts from the archive have a smaller amount of information than the canonical ones, which is one of the reasons why non-canonical texts have been forgotten. The results were published in the paper “Canon/Archive: Large-scale Dynamics in the Literary Field.” This paper made an impression on me, and I wanted to use entropy to compare the amount of information in originals and translations like the Stanford Literary Lab did with their texts, then measure their readability and see if the method, once applied to texts in different languages yielded different results. The main difficulty was to understand how to think about readability mathematically, hence computationally, e.g. how to transform the concept of entropy into code. Since the Stanford Literary Lab did not provide any, I wrote my own code. This will be discussed in my forthcoming posts for Digital Orientalist.
Algee-Hewitt, M.A., Allison, S., Gemma, M., Heuser, R., Moretti, F., & Walser, H. (2016). Canon/archive: large-scale dynamics in the literary field. https://litlab.stanford.edu/LiteraryLabPamphlet11.pdf