In an earlier post here on The Digital Orientalist, Mariana Zorkina introduced the topic of regular expressions (or in short regex) and listed great resources to get started. To recapitulate, a regular expression is a sequence of characters that expresses a match pattern, wherein characters either function in their literal form, or as metacharacters (thus having special meaning). Beyond Mariana’s post, one can find a myriad of resources for beginners of regular expressions online, including an introduction on the topic for Classical Chinese by Donald Sturgeon. Here, I’d like to offer some additional thoughts on the topic, specifically through the lens of how I first encountered this syntactic system and its match patterns for advanced string searches: I first heard about regular expressions in the context of text analysis.
In particular, I was first introduced to this particular angle through Mick Hunter from Yale University, who used this method rather fruitfully in his Confucius Beyond the Analects. The specific advantage of employing regular expressions – both for text analysis and more broadly – becomes clear by looking at Hunter’s introductory example found in Hunter 2017: 25-30, which involves the opening of Lunyu 論語 7/1:
The Master said, “[I] transmit without bringing [things] about; [I] trust and feel affinity for the ancient.”
子曰:述而不作,信而好古。
Regular expression searches go beyond a literal identification of parallel passages. In this sense, instead of looking for a literal recurrence of the string “shu er bu zuo 述而不作,” we can define a regex that contains the salient features of this statement. Overall, these features can be expressed as “a phrase that contrasts shu 述 (transmitting) with zuo 作 (originating),” leading Hunter to conclude that “ideally our search would identify any and all passages that juxtapose these two characters (or their semantic, graphic, or phonetic variants) in any order” (Hunter 2017: 27-28). By employing a well-defined regular expression across a corpus of early Chinese texts, Hunter is able to identify a number of rather close parallels that a literal string search would not be able to find, including the following two examples from the text Mozi 墨子:
[They] also say, “A noble man follows without bringing [things] about.”
又曰:君子循而不作。
Mozi 39.9/18a
Gong Mengzi said, “A nobleman does not bring [things] about; he only transmits.”
公孟子曰:君子不作,術而已。
Mozi 46.11/6b
Clearly, these parallels are looser than literal overlaps – and the advantage of regular expressions lie precisely in finding such patterns within and across texts. Minimally, these statements share the elements listed by Hunter, consisting of the contrast between “bringing about” or “originating” (zuo 作) and “transmitting” (shu 述), regardless of the specific characters used. More specifically, on page 28 of his monograph, Hunter provides us with the specific regular expression he used to find these parallels:
[述循術順].{0,6}作|作.{0,6}[述循術順]
What this means is that this regular expression will find any matching string in which one of two things are the case: either the character shu 述 (or the characters shu 術, xun 循, or shun 順, all of which are rather similar to shu 述 in Old Chinese phonology) precedes the character zuo 作 within six characters, or zuo 作 precedes shu 述 (and its alternatives) within six characters. The regular expression above captures this match pattern through its combination of literal characters and metacharacters (which have a special meaning). Specifically, the following metacharacters are used: square brackets “[]” express a set of characters, within which any of the characters contained therein will match; a full stop “.” will match with any character (except for the newline character); curly braces “{}” specify how often a given character can recur (the two numbers in the curly braces stand for the minimum and maximum – in our example “{0,6}”, any character can recur between zero to six times); and a vertical bar “|” means that either the pattern before or after can match. Of course, this pattern fits the examples from the Mozi and the Lunyu above. What these parallels mean and how they can be interpreted is naturally beyond the scope of this short post. Overall, however, regular expressions are ideal to dig up such parallels which we would otherwise easily miss through literal searches.
To bring up a second example, using regular expressions can similarly help us to quickly look through a corpus of texts to spot parallels to the beginning of Laozi 老子 14:
視之不見,名曰夷; When you look at it but do not see, call it *ləj;
聽之不聞,名曰希; When you listen to it but do not hear, call it *qʰəj; […]
To keep the poetic effect of these statements, I rendered the last character of each line here simply through their Old Chinese phonological reconstructions according to Baxter and Sagart 2014 (as a side note, the two versions of the Laozi in the excavated Mawangdui corpus actually use wei 微 OC *məj in the first line instead of yi 夷 OC *ləj; vice versa, yi occurs later in those texts, in positions where we encounter wei in the receptus text – which I interpret to mean that these supposed definitions are used here rather arbitrarily). In particular, the first halves of these two statements are interesting for our purposes here, since similar statements occur throughout the corpus of transmitted early Chinese texts. Compare the following three examples:
聽之不聞其聲, ‘[…] When you listen to it but do not hear its sound;
視之不見其形, when you look at it but do not see its shape, […]’
Zhuangzi 莊子 14/3
視而不見, […] to look and not see,
聽而不聞, to listen and not hear […]
“Great Learning” (“Da Xue” 大學) 9
視之而弗見, ‘[…]We look for [ghosts and spirits], but do not see them,
聽之而弗聞, we listen for them and do not hear them. […]’
“Doctrine of the Mean” (“Zhong Yong” 中庸) 16
None of these examples would be found with something as straightforward as the Find command (“Ctrl+F,” or “Command+F” on Mac), since none of these perfectly fit into a literal string search (admittedly, we would find the example from the Zhuangzi by simply limiting our search to either “聽之不聞” or “視之不見”). In particular, the change of wording in the latter two examples would pose an issue in this regard, but a well-defined regular expression can help to avoid this issue completely. We can do so by first recognizing the regularity of these statements – all of them include two core statements centered around the juxtapositions of looking and not seeing, as well as of listening and not hearing – and secondly by translating this regularity into the syntactic system of regular expressions. So overall, we could extrapolate a regular expression from this regularity as follows:
視.{0,4}[不弗]見.{0,3}聽.{0,4}[不弗]聞|聽.{0,4}[不弗]聞.{0,3}視.{0,4}[不弗]見
This regular expression searches for shi 視, followed by up to any four characters, followed by either bu 不 or fu 弗, followed by jian 見 and up to any three characters, followed by the same pattern that simply switches the key terms to ting 聽 and wen 聞 (we here also a second pattern after a vertical bar, which specifies that the order between the two statements could also be reversed). Admittedly, this search pattern may seem long, and reading (and writing) such examples takes some practice (note that the introduction by Sturgeon I mentioned earlier also contains a number of exercises), but I personally can recommend regular expressions wholeheartedly. They have helped me speed up my own research considerably and have allowed me to look beyond individual text files and look for regularities across a wider corpus. And this is precisely where the power of regular expression lies: in finding patterns across larger corpora, which a human reader could only slowly work through.
Bibliography
William H. Baxter and Laurent Sagart, Old Chinese: A New Reconstruction (Oxford: Oxford University Press, 2014).
Michael Hunter, Confucius Beyond the Analects (Leiden: Brill, 2017).

Awesome! Such sage advice to give people basic intro to regex for better search patterns. Not a reader of Confucius materials, I wouldn’t have come across this use. Regex is powerful magic, I use it (only in very basic ways), but it comes in handy all the time. Great piece, really good way to introduce people to a powerful skill by means of practical research example.