The headache of working with archaic Chinese orthographies

Technologically, we scholars of the 21st century are spoiled. We sit at our desks, turn on our shiny machines, and access digital archives and online databases. We work with laptops knowing very little of what goes on underneath their keyboards, and we define them intelligent –perhaps precisely because we do not understand them. Surely more can be done to digitize available work, but it was not that long ago that turning pages in material books was the only way to retrieve data.

Then the digital entered our lives, with all its many pros and a few cons. A big con in working with premodern manuscripts is the inability of typesetting ancient orthographies without corresponding glyphs in fonts. In fact, the problem existed before laptop’s keyboards did. Scholarly publications such as Guwenzi yanjiu 古文字研究 (Paleographic Research) were until recently handwritten to obviate this problem.

To us digitally-spoiled 21st century scholars, typing characters in non-alphabetic languages is easy: switch input sources, type in the value, and a nice window drops down with a lot of Chinese characters or Japanese kanji. To write “star” in Chinese you type its pinyin value xing (if this is the input method you follow), and you choose among a series of characters: 性、行、醒、型 etc., until you hit 星 and select it. The problem of typing Chinese language with alphanumeric keyboards thus has been fully solved. Why not push the solution further to incorporate ancient variants, by adding a few more choices in the drop-menu? Or, why not just make a font of ancient Chinese graphs, and solve the issue?

To understand why incorporating archaic orthographies is cumbersome, we should understand how input systems and fonts operate. These are well explored topics, and yet I did not reflect on them until non-typable characters became part of my workflow. Assuming that I am not alone, I will offer a simplified overview of these ingenious solutions. This will help us answer the questions above.

Written characters are encoded as codepoints. “w” has a unique codepoint, U+0077; “W” another unique codepoint, U+0057; a space has codepoint U+0022, and so on. The Unicode Consortium is the organization that maintains this list of codepoints, known as the Unicode Standard. It is universally available and agreed upon to make information exchanged across systems mutually intelligible.

When we type, we rely on input devices such as keyboards. And as we can all see, typically used keyboards are composed of letters and numbers. In order to generate characters not directly available as single keystrokes or combinations thereof, such as Chinese characters, input method editors (IME) were created. In very simple terms, the IME works with tables that match input strings to codepoints and communicate with applications to turn codepoints into glyphs.

An example helps grasping the concept. My goal is to type in my Word page the word “not” in Chinese, i.e. bu 不. I switch input method by selecting Chinese language, and, since I set my input method to be pinyin based, I type the keystrokes “b” and “u.” The IME receives this string and retrieves all codepoints of Chinese characters with this romanization. At this point, there is a new list of codepoints: all the codepoints that correspond to the string “b-u”. This is one way to visualize the process:

We have a list of codepoints. These are not randomly distributed: the list is organized so that codepoints of high frequency appear first. At this point, the codepoints are used to retrieve glyphs stored in fonts. In simple terms again, fonts are lists where each codepoint has a corresponding glyph, with embedded instructions for how to draw it, resize it (when you change font size), scale it, how to make it look bold vs. regular, etc. If you open a font with applications such as FontForge, you can look at the outlines that determine the appearance of glyphs (I give the example of 州 in the image below):

Returning to our example, when all the codepoints corresponding to the string “b + u” are found by the IME and matched to the codepoints in a font, the result is a nice window that represents all the Chinese characters corresponding to the input “bu”.

(All of this back and forth between underlying data and interface, incidentally, is orchestrated so beautifully in our machines that we can type, delete, and edit in just a few seconds.)

Let’s return to our questions now. Why not apply the same logic to incorporate ancient variants, by adding more choices in the drop-menu? Let’s say I would like to find among the options for “bu” also an ancient graph from oracle bone inscriptions (circa 1300 BCE), such as .

You would have to create an outline for the glyph , and assign it to a codepoint. The codepoint may be from Unicode Private Use Areas, namely sets of codepoints defined by the Unicode Consortium for third party use. After that, you would also need to modify the IME table, adding the new codepoint for , and associating it with the string “b + u” being pursued. The new codepoint is represented in red in the image, with a version of  created in FontForge for this piece:

While IME tables are generally not easily modifiable on Mac or Windows, Linux offers more flexibility, and users have taken advantage of this. Here for example you have instructions to create a custom IME for Mongolian with Linux.

So this operation is doble. However, there are complications. For one, drawing glyphs is not easy. It requires attention to aesthetic rules to unify the end-result, and it is therefore time consuming for a glyph that you may not use as much as you think. Vector fonts, which have by and large replaced bitmap fonts, rely on vector graphics, a technology to create visual images that can be freely resized without losing their visual qualities. Designing vector fonts requires graphic design competence.

Lacking this competence, you may get around the drawing of glyphs by merging together components from different existing glyphs, such as the authors of this script to type archaic Chinese characters from manuscripts are doing, but you may enter into the realm of copyright infringement by reproducing even parts of a font design. Whether you draw your glyph or create it with a copy-and-paste, remember that to have it appear in the drop menu you would also have to modify the IME list, and ask that everyone you work with to do the same.

A process that requires less global modification would be the creation of a font with instructions to overwrite glyphs for those archaic graphs that have corresponding characters in the existing sets of characters. In my experimental font, I created a glyph for  to over-write the regular不:

Again however we run into problems. There are multiple archaic graphic forms that can be used for 不, and deciding which one to represent is a completely arbitrary operation. Secondly, there are several graphs that write more than one word, and again graphs without a consensus on what they represent. Even if we had a magic one-to-one correspondence between graphs and words, creating fonts would be far from a solution that works universally. It would lead to a proliferation of fonts by individual scholars that contain just a few glyphs.

Scholars of ancient Chinese are not alone in struggling with fonts and graphic representations. In ancient Egypt studies, the problem was solved by a standardized system based on the transliterations of hieroglyphs with subsequent additional tools; Cuneiform writing instead was eventually added to the Unicode family. Mayan glyphs, on the other hand, are still designed by individuals (Wikipedia lists a tentative range of not-yet standardized Unicode for the Mayan script). In these cases, however, we move in the realm of one thousand distinct logographs. With Warring States (453-221 BCE) manuscripts alone, we are long past the 1,000 archaic distinct forms already, and there are hundreds of strips yet to be published.[1] For several of these unattested graphs there are long debates about how exactly they are being written. In 2020, a scholar going by the name Snow Listener published a list of 916 un-encoded glyphs. Notes are often added to clarify or disambiguate the structure of these graphs, attesting to the complexity of digitizing this material.

Unattested graphic structures thus present a great challenge to information processing, typesetting, and aesthetic representations. Yet we have seen that there are two methods through which the feat could be accomplished. Suppose that we had a workforce of graphic designers, scholars, and assistants to map graphs to words, create glyphs, and list codepoints. Should we not arm ourselves and chase the Jabberwock? In my opinion, no. Still too little is known about the Old Chinese language and ancient scripts. The risk is that of building a structure that will need considerable reworking in five-year time, or that will be abandoned. What we need in my opinion is enriching existing structures, like those I have previously reviewed, and continue to work with thumbnails and transcription with the existing, typable orthographies until we have a better command of the entire corpus of ancient manuscripts. Meanwhile, we can educate ourselves on the workings of IME and the logic behind what seems just a matter of fonts, especially given existing conversations about the digitization of characters previously unavailable as a form of soft power from China, such as in this puzzling piece by Jing Tsu.

[1] After that, we should work with ancient bronze inscriptions (10th BCE – 221 BCE), early imperial manuscripts (221 BCE – 220 CE), Dunhuang manuscripts… and so on.

2 thoughts on “Just give me a font!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s