Cognitive Stylometry as an LLM-Based Methodology in Chinese Literary Studies

Recent years have witnessed a wave of fascinating research findings in both natural and artificial intelligence. While scholars remain cautious and warn against hasty conclusions, a number of studies point to similarities between human cognition and the large language models (LLMs) based on the Transformer architecture. In this post, I will illustrate how these (suggested) techno-cognitive alignments could inspire new questions in Chinese literary scholarship and beyond. If models like GPT represent an “averaged humanity,” having been trained on vast amounts of human writing, they can also help us see how literature deviates from this average—pushing the boundaries of linguistic expectations and traversing the probability space beyond well-explored regions.

The technical aspects of this post can be reproduced using this Jupyter notebook (requires a Google account). The reader is welcome to experiment with the code and visualization. For a beginner-friendly tutorial on Google Colab, see a related DO post.

I will demonstrate how LLMs can reveal deviations in literary texts using Wenzhong2.0-GPT2-3.5B-chinese, a language model developed by the Shenzhen-based Cognitive Computing and Natural Language Research Center (IDEA-CCNL). Wenzhong 2.0 has been pre-trained on an extensive corpus of modern Chinese texts and will serve as our “average human” (“average” in the double quantitative-qualitative sense: a statistical mean, but also something suboptimal). Using this “average human,” I am going to read a short fragment from the novel Works and Creations: As Vivid As Real 《天工開物．栩栩如真》 published in 2005 by Hong Kong writer Dung Kai-cheung 董啟章 (b. 1967).

Works and Creations consists of two interwoven storylines, one fictional and one autobiographical. The autobiographical part (my focus here) traces Dung’s family history, from his grandparents’ move to Hong Kong in the early twentieth century to his growing up in the rapidly developing metropolis a few decades later. In the opening chapter, we learn how Dung’s grandfather, Dung Fu, first encountered his grandmother, Lung Gam Juk, while experimenting with a self-made radio. Lung, described as “twisted” (扭曲人) in the novel, was somehow able to capture and understand radio signals, thus matching the “upright” (正直人) Dung Fu. At one point, Lung confides to her husband a painful memory of separation from her older brother, which is when she first realized her extraordinary abilities:

第二天起来，龙金玉就开始收拾她的行李。嗰次系我第一次听到空中既声音。龙金玉告诉董富。正直人董富点点头。他是学科技的，他不相信神秘和超自然的事物，但他没有反驳或质疑新婚妻子的说法。[1]

“The next day, Lung Gam Juk began packing her luggage. That was the first time I heard the sound in the air. Lung Gam Juk told Dung Fu. The upright Dung Fu nodded. He studied electronics and didn’t believe in esoteric or supernatural things, but he didn’t argue or question his newly-wed wife’s words.”

For the sake of this experiment, let’s assume that “Dung Kai-cheung” is a specialized language model designed to generate novelistic texts, and that it has generated the above fragment, character by character. Given Dung’s posthumanist imagination, which often blurs the lines between humans and machines, this scenario isn’t that far-fetched. The core question is the following: How likely would it be for Wenzhong 2.0, our “average human,” to generate the same fragment? In other words, how similar are the two models, Wenzhong 2.0 and “Dung Kai-cheung,” if by similarity we mean the likelihood of generating the same sequence of Chinese characters?

Thanks to the LLM craze, such hypothetical questions can now be effectively posed, explored, and quantified. Available through HuggingFace, Wenzhong 2.0 barely fits on the free version of Google Colab, a popular computing platform which at the time of writing this post comes with a single T4 GPU (graphics processing unit). Due to such limited resources, I used the package accelerate to distribute the model’s parameters across the GPU and CPU (central processing unit) memory (Figure 1).

After loading the model, I passed the quoted fragment as a target sequence and measured the model’s surprise for each consecutive character. Essentially, at each step, I asked Wenzhong 2.0 to predict the next character given all preceding characters (第 ➔ 二, 第二 ➔ 天, 第二天 ➔ 起, etc); at each step, the model distributed the probability mass over all tokens in the dictionary (mostly Chinese characters, but also punctuation marks and other symbols). Since I knew the actual tokens in the sequence to be predicted, the model’s surprise (“perplexity”) could be precisely calculated, as shown in Figure 2.

Figure 2. Per-character perplexity for a given sequence as measured by Wenzhong2.0-GPT2-3.5B-chinese

From the graph, we can immediately observe how the chosen passage both diverges from (high perplexity) and aligns with (low perplexity) statistical expectations. Novel-specific terms such as proper names (Lung Gam Juk 龙金玉 and Dung Fu 董富) and the word “upright” (正直人) visibly challenge the model’s predictive capacities. These terms are unique to the novel and deviate from the average language use, a clear example of what empirical literary scholars call “foregrounding” [2]. Cantonese expressions, such as 嗰 (go), 系 (hai, usually spelled 係), and 既 (ge, usually spelled 嘅), similarly disrupt the narrative flow, adding layers of unpredictability and even resistance to the story which emphasizes its own heteroglossia. The surprise is further enhanced given that the spoken Cantonese statement is not announced by an otherwise expected quotation mark.

In contrast to those surprising elements, common words like 超自然 (“supernatural”), 行李 (“luggage”) or 开始 (“to begin”) fit comfortably within the Wenzhong 2.0’s learned patterns. For instance, once the characters 超自 have been generated, the model has only one reasonable option for the next character: 然, making its prediction highly certain (perplexity close to 1). Similarly, once 开 has been generated, 始 follows automatically, lowering the model’s perplexity. These low-perplexity sequences are less informative than the high-perplexity ones—we don’t learn much from highly-predictable signals—but they are part and parcel of any natural language. [3]

“Dung Kai-cheung” thus mixes the expected with the unexpected, offering an engaging narrative experience that is anchored in linguistic habits and yet departs from them in significant ways.

In a recent article, I used a similar technique to compare the collected modern prose of Eileen Chang (1920-1995) and Mo Yan (b. 1955) with Maospeak, a language style that emerged during the Mao Zedong era in China (1949-1976). There, I argued that literature can be seen as a sustained exploration of probability space. By dispersing the probability mass over multiple, equally valid sequence continuations and increasing the entropy (unpredictability) of the reading process, literature resists what Viktor Shklovsky (1893-1984) referred to as “automatization,” or the routinization of perception [4]. This feature sets (good) literature apart from likelihood-maximization text-generation techniques, which collapse the probability mass onto a single correct sequence and reinforce our expectations at each step. What’s particularly compelling is that this “defamiliarization” (Shklovsky’s ostranenie), or the resistance against automatization, can now be quantified, with LLMs serving as a background against which the performativity of literary phenomena are brought into relief.

Large language models might not be the smartest, failing at counting the number of Rs in “strawberry,” but the very way they are trained—predicting next words—can help us conceptualize a new perspective on how literature resists the averaging forces. If Chinese digital humanities has leaned heavily towards infrastructure, focusing on digitalization of materials and creation of databases and platforms, techno-cognitive research will hopefully encourage a more practical integration of quantitative conceptuality (not just tools) within literary and cultural studies. “Cognitive stylometry” might be the right name for a subfield within this exciting area of humanistic inquiry that is bound to see rapid developments in the very near future.

The author would like to thank Or Cheuk Nam and the editorial team for comments and discussions on the earlier versions of this post.

References

[1] Dung, Kai-cheung 董啟章, Works and Creations, As Vivid As Real 天工開物．栩栩如真 (Shanghai: Shiji chuban jituan, 2010), 12.

[2] Van Peer, Willie et al. “Foregrounding,” in Handbook of Empirical Literary Studies (De Gruyter, 2021), 145-176.

[3] Here, it is important to acknowledge the choices and limitations that influenced the experiment. To begin with, different tokenizers treat Chinese characters differently. Character-based tokenization assigns one token per character (e.g., the model BERT-base-Chinese represents 锅 as [7222]). On the other hand, Byte Pair Encoding (BPE) may tokenize a single character into multiple tokens (for instance, 我 might be represented as [22755, 239]) or group several characters into one token. Given that Wenzhong2.0-GPT2-3.5B-chinese uses BPE, where one character can yield as many as three different tokens, for each character I have averaged the loss of all its constitutive tokens. Experimenting with other models and tokenizers yielded similar results to the one presented here. Another important limitation concerns the target passage, which might not provide the model with a realistic amount of contextual information. Notice that I myself prepended the quoted passage with a background introduction to Dung Kai-cheung’s novel. Finally, perplexity scores depend on the model’s training data: Wenzhong2.0 has been trained mostly on texts written in standard Chinese (“average written Chinese”), which partly explains the high perplexity of written Cantonese.

[4] Shklovsky, Viktor. “Art as Device,” in Theory of Prose, translated by Benjamin Sher (Dalkey Archive Press, 1990), 1-14.

Cognitive Stylometry as an LLM-Based Methodology in Chinese Literary Studies

References

Like this:

Related

Published by

Maciej Kurzynski

3 thoughts on “Cognitive Stylometry as an LLM-Based Methodology in Chinese Literary Studies”

Leave a ReplyCancel reply

References

Share this:

Like this:

Related

Published by

Maciej Kurzynski

3 thoughts on “Cognitive Stylometry as an LLM-Based Methodology in Chinese Literary Studies”

Leave a ReplyCancel reply

Discover more from The Digital Orientalist