My previous posts introduced a series of online resources and databases that have become essential tools for the study of unearthed documents from early China, and more broadly for the study of ancient Chinese texts. What I introduced just scratched the surface, for many more databases are out there to be discovered and used. In truth: too many. In this piece, I want to reflect on some of the downsides related to these digital tools.
One is, as just mentioned, the avalanche of databases that routinely pop-up in the websphere. Back in 2009, when the DH were not as developed as they are now, Diane M. Zorich questioned the value of DH, a field that she consider to be constantly creating redundancy. Zorich’s observation was not unfounded, and after another decade, the situation could still be described as such: among the databases I regularly use in my work, there is a good deal of data redundancy. For example, the explanation from the Shuowen jiezi 說文解字 for wo 我 (“I; we”) provided in several databases is also found in Shuowenjiezi.org, a website dedicated entirely to this ancient dictionary. However, the latter adds images and annotations by Song and Qing scholars (and has a section for comments at the very bottom).
So, each database repeats but also adds an extra “something” that makes it unique, and thus worth looking into. This is however far from an ideal arrangement, both for developers and users. From what I have gathered, each of the websites discussed in my previous posts came to be as a result of work by individuals or educational centers, but without much communication among them. Educational centers by and large do not make data available for others, so individuals scrape data that already circulates in the internet, to reuse it with more or less degree of freedom. With more transparency and accessibility of data, anyone (be it individuals or universities) developing new databases would be able to begin with a full understanding of what is already being done, so as to create new tools that avoid too much repetition.
Lack of accessibility of data goes hand in hand with lack of transparency about sources and programs used to feed and run databases (which could be made extremely controversial in cases where the work behind databases is funded with public money, such as it was for the CHANT database). With some exceptions, such as the Guyin xiaojing 古音小鏡 (a website dedicated to historical linguistics), the already discussed Complete Collection of Ancient and Modern Characters, and Xiaoxue tang 小學堂, the majority of databases do not provide a comprehensive list of references which determines their content. For example, from where are the explanations provided in the Multi-function Chinese Character Database taken? According to which of the many existing shiwen 釋文 of Chu manuscripts does the BSM database determine which word is a graph writing? Lack of transparency about all these components keep databases isolated, from each other and from users, and prevent “decentralization of responsibility,” which Eric van Lit saw as a solution to redundancy and other problems in the DH world.
Similarly, the algorithms running behind databases are almost never public. While the latter may be of less interest to users who are merely looking for a way to speed up their work, it would be incredibly useful to anyone wishing to build new resources and young scholars entering the field with an eye on DH. Even for apparently simple operations such as identifying textual parallels, understanding according to what criteria parallels are identified would be helpful to anyone. Consider the follow example from the well-known Ctext database:
Ctext identifies three parallels for the Analects 1.6 sentence “弟子入則孝，出則弟,” nicely highlighting in colors the differences among them. We can only guess that this parallel is identified because of the shared sentence “入則孝，出則” (and arguably the close relationship between di 弟 and ti 悌). The criteria for this result are not clear (how many characters must be the same, and how many may be phonetic loans, for a sentence to be considered parallel?), nor is the algorithm that leads to this result. This would be of great help in creating new digital tools working with a language where phonetic loans and variants recur frequently, even more so with the new evidence from bamboo manuscripts.
The lack of transparency about what operates behind website interfaces concerns fonts too, which leads us to our final point of discussion, namely the accessibility of these sources for people with disabilities. It is not uncommon for Sinologists to bump into websites that require the installation of specific fonts (not always easy to find) in order to display their content properly; or that alternate Chinese script to thumbnails of words that are not typable, or to characters created ad hoc by the author of the piece. This tentative solution until new fonts are available and released straightforwardly bars the use of screen readers such as ChromeVox.
Recently, The Zhonghua publishing company created a massive Database of Chinese Classics 中華經典古籍庫(accessible via subscription), releasing online an incredible number of Chinese texts. Most likely for copyright protection, a choice was made to encode its entire content so that, while the text is displayed correctly, the actual strings of words underneath it is not what appears on the page. Copying and pasting in fact results in a nonsensical string of words. If the page loads slowly, you see the change from the nonsensical string into the correct text:
As one may expect, using ChromeVox on this page is pointless, since the value picked up by ChromeVox is the nonsensical text underneath what is displayed as the Analects. Similarly, some components of Humanum are not set to be recognized by screen readers, so that only a relative portion of the page can be identified.
This situation has no easy solution, and it is exacerbated by the continuous appearance of unattested graphs in bamboo manuscripts –another headache for those who operate in the field. However, little to no discussion is happening on these topics, and users are kept at arm’s length instead of being involved in the construction of more friendly databases that keep into consideration more diverse needs. The good news is that DH are in continuous expansion, and joining forces across the world becomes easier by the day.
I am grateful to Jerry You and Iris Melissa Pang for talking with me about databse-building and accessibility tools.
 Zorich, D.M. “Digital Humanities Centers: Loci for Digital Scholarship.” pp. 70–78 in Working Together or Apart: Promoting the Next Generation of Digital Scholarship. Washington, D.C.: Council on Library and Information Resources, 2009, 71.
 Mukta Kulkarni, Digital accessibility: Challenges and opportunities, IIMB Management Review, 31.1, 2019: 91–98.