Defining word boundaries for Modern and Classical Chinese

Basic corpus analysis is an amazing way to start exploring digital humanities. What could be easier? One just needs to learn a couple of terms like “n-grams” and “key word in context,” download one of the readily available texts and then explore it with any of the free and user-friendly tools like Voyant Tools or AntConc.

For so many people this is the first step into the exploration of the computational approaches, the one that requires no knowledge of programming and a rather vague understanding of statistics. That is, for people who study languages with spaces or at least a somewhat established understanding of what a word is. Studying the behaviour of words without knowing where one word ends and another begins is, as it turns out, pretty hard.

This is the case of Chinese, be it the Modern, the Old (OC) or the Middle (MC) kind. I will not dive into definitions of wordhood, but those interested might want to check the extensive “Encyclopedia of Chinese Language and Linguistics” or “New Approaches to Chinese Word Formation” (Packard, Jerome L. 1998. New Approaches to Chinese Word Formation Morphology, Phonology and the Lexicon in Modern and Ancient Chinese. Berlin, New York: De Gruyter Mouton). Long story short, defining words is easier for Modern Chinese because of its larger number of bound morphemes and sequences of characters that are inseparable. Conversely, it is much harder to define words in pre-modern forms of Chinese.

So how does the bulk recognition and markup of word boundaries – tokenization (sometimes called segmentation) – work? Luckily for the sake of practicality, digital approaches are forgiving towards the omission of some details. If a Chinese word is loosely defined as a sequence of characters lexicalized beyond a certain threshold, the majority of words are recognized in a way that contributes to research, the systemic errors are accounted for, and the error margin can be estimated, a tokenizer can be used.

There are many tools available, and I will talk about some of them below. Most are only appropriate for working with Modern Chinese. Pre-modern Chinese is much more frustrating (estimating the error margin is already complicated, since one needs a large universally accepted corpus with marked word boundaries for that), but there are some workarounds that I will show.

One easy possibility is to use SegmentAnt – a tool made by Laurence Anthony, the creator of the famous AntConc. SegmentAnt is a user interface for several popular tokenization algorithms such as jieba and NLPIR. This is the best option for those working with Modern Chinese.

However, for those who work with pre-modern Chinese, and want a bit more control and access to the tools that do not have a user interface, I have created a short guide to compare different algorithms and the code to use them. The list is not comprehensive and only contains those that can be used in Python. The algorithms are:

  1. Treating each character as a separate word. Terrible idea for Modern Chinese, but a popular option for OC and sometimes MC. This is a really big compromise, especially when talking about Middle Chinese and/or prose – it is hard to deny that there was a large number of lexicalized compounds that are close to what we call a word. But this is a sensible approach that is easy to reproduce, and I use it in my research as well.
  2. Two tools for the tokenization of Modern Chinese: jieba and hanlp. Both use advanced technology and allow for a lot of customisation. Can be adjusted to work with OC and MC, but this requires training them with an already marked text.
  3. Udkanbun. This tool is created for building dependency trees in wenyan and in order to reach that goal does tokenization. In my experience it is good for OC, less so for later periods, since dependencies are more important and it will define words as one-character long whenever possible.
  4. My own algorithm that tokenizes a text based on a user dictionary – it will crawl through a text and try to match what it sees to dictionary entries. This approach returns good results for all kinds of Chinese and is easy to improve by using a dictionary well-suited to the text. My code is much shorter and easier to use than most dictionary-based tokenizers I have seen so far and is optimised to work very quickly (5 to 10 seconds for the entire Quan Tangshi).  

There is code in the guide, but fear not, to use the tools and do bulk tokenization no knowledge or even interest in programming is required. For it to work one only needs a Google Drive account. The guide is written in Google Colab, a handy application that allows a user to run code without installing anything on their computer.

To start with the guide, do the following:

  1. Download the whole folder here or here and then upload it to your own Drive (unfortunately, one cannot directly copy a shared folder to one’s account in Drive, but if you do not want to try your own files, you can just view the guide in the shared folder).
  2. Right-click the Tokenization.ipynb file, choose “Open with” ⇒ “Google Colab.”
  3. There are detailed instructions inside of the file. Please read them carefully: you will need to change several lines of text and some of the cells are obligatory to run.
  4. Enjoy (and if you use my code for your research, please find a way to mention me)!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s