Lu Wang’s recent two-piece series (part 1 and part 2) on using Voyant Tools to analyse Chinese language texts captured my imagination and I wanted to explore the possibilities of using Voyant Tools to analyse texts written in Japanese.
I decided to run some texts through Voyant Tools to see what results I would get. I selected Kinoshita Naoe’s Jiyū no shito Shimada Saburō 自由の使徒・島田三郎 and his Sano dayori 佐野だより which are both related to my current research project The Ashio Copper Mine Incident: Socio-Political Causes and Reactions funded by the Mining History Association and are both available through Aozora Bunko. After opening the texts in Voyant Tools I was disappointed to see that the majority of the high frequency “terms” that appeared in its Cirrus, TermsBerry, and Summary windows were single hiragana characters rather than parts of the text that we might identify as words. In the case of Jiyū no shito, for example, the most frequent “terms” were te て, ta た, shita した, sensei 先生, and fu ふ. Seeing this I realized that the way Voyant Tools segments Japanese is similar (though not identical) to the way that the Python plugin fugashi segments Japanese text – the segmentation is based on lemmas rather than words. In other words, auxiliary verbs and suffixes are often split from the word that they modify. Readers with some knowledge of Japanese will recognize that te, ta, shita, and fu are all different verb endings. Incidentally, particles also often feature as frequent terms. In Jiyū no shito, for example, the joint tenth most popular term is the particle to と with 52 instances in the text.
Note the large amount of hiragana in the word cloud.
Details on the text in the Summary tab.
Whilst the same issues exist when we use Voyant Tools with modern Japanese text, the problem is exacerbated when using texts written before the Second World War. We can see this with the word iu 云う which is rendered in its archaic form iu (ifu) 云ふ in Jiyū no shito. While Voyant Tools recognises the modern rendering as a single word it divides the archaic form into its constituent parts i 云 and fu (u) ふ. This issue also extends beyond the use of auxiliary verbs and particles – Voyant Tools tends to struggle with old character forms (J. Kyūjitai 旧字体). For instance, in Sano dayori, Kinoshita renders the term kōdoku 鉱毒 with the old character form kōdoku 鑛毒, which Voyant Tools divides into two terms kō 鑛 and doku 毒 thereby skewing the results that it provides.
I feared that this may mean that Voyant Tools is completely useless for analysing Japanese texts. Having read Lu’s pieces in the Digital Orientalist I was aware that she had faced difficulties with Chinese punctuation and cleaned her data beforehand, but I was wary of removing auxiliary verbs and particles rendered in hiragana from the text. I decided to search for some possible solutions to the problem and came across a blog post on the topic from 2016 by the famous Japanese digital humanities scholar Nagasaki Kiyonori on his site digitalnagasaki.
Nagasaki notes that Voyant Tools’ processing of auxiliary verbs and particles may provide us with useful information on the characteristics of the author’s writing style, but that it does not allow us to easily analyse a text’s content. Reading this statement, I realized that the benefits of removing auxiliary verbs and particles from the text may outweigh the negatives. Although we must ignore the statistics that Voyant Tools provides on the total number of words, total number of unique word forms, and average words per sentence since, as noted above, the app classifies some lemmas as words, removing auxiliary verbs and particles can help us to see different patterns within our texts. Users can tell Voyant Tools to ignore certain words or characters by adding these to the Stopwords list in its different tabs (Click “Define options for this tool” and then “Edit list” next to the Stopwords category).
Finding the Stopwords list.
Nagasaki provides us with a list of Stopwords that he applied to his sample texts (a collection of University webpages). I inserted these into my Stopwords lists and noticed an instant improvement in results.
Improvement in the word cloud for Jiyū no shito.
Nevertheless, there was a need to add additional Stopwords based on the auxiliary words and particles commonly used in historical Japanese such as ki き, no 之, nu ぬ, taru たる, tte (tsute)つて, tta (tsuta) つたetc. For extra measure, I included all stand-alone hiragana letters and words such as fairu ファイル which does not appear in the text, but within its metadata. If one does some additional preprossessing of the data or scrapes Aozora Bunko in order to gather their data terms that appear in the metadata can be eliminated easily.
Note the prominence of the word fairu ファイル in this word cloud.
I also removed some numbers which seemed to be taking over the visuals due to their high frequency and the character i 云 due to the aforementioned issue. Now I could finally see some patterns in the texts!
The word cloud for Sano dayori following all preprocessing.
Looking at Sano dayori we see the prevalence of words such as Ashio 足尾 (a place name) and higai 被害 (E. Damage). Turning to some of the other tools and tabs offered by Voyant Tools such as TermsBerry we notice that the term higai is usually related to land (chi 池) or people (jinmin 人民).
TermsBerry for Sano dayori.
In Jiyū no shito, on the other hand, the most popular terms are sensei 先生 (E. Teacher) and shakai 社会 (E. Society). We can see that another frequent term, jōyaku 条約 (E. Treaty), is often linked to the terms kaisei 改正 (E. Revision) and reikō 励行 (E. Enforcement or carrying out regulations).
TermsBerry for Jiyū no shito.
Voyant Tools, therefore, appears to work faily well for analysing patterns in the content of Japanese texts so long as the user is willing to do some processing of their data. Despite this, not all of the platforms tools function well with Japanese due to the way it segments the language.
 An example from the text is Kanjimashita 感じました.
 An old rendering of う such as in the term iu 云ふ.