Japanese Text Segmentation and Analysis with Web ChaMame

In recent years, numerous pieces of open-source software for the segmentation and morphological analysis of Japanese text have been developed including Kuromoji, Nagisa, and KyTea. Although useful for modern Japanese, few of these platforms can deal with historical texts. For example, Kuromoji fails to provide readings for multiple historical kanji 漢字 (E. Chinese Characters) and struggles with common historical vocabulary such as the term gozāsōrō 御座候 (E. to be), which it lists as the separate terms gyoza 御座 (E. throne) and 候 (E. weather).

Screen Shot 2019-03-18 at 23.35.15Example of Kuromoji’s failure to provide readings for some kanji.

One exception is Web ChaMame (Web 茶まめ)[1] developed by the National Institute for Japanese Language and Linguistics (J. Kokuritsu Kokugo Kenkyūsho国立国語研究所) which deals with historical texts relatively well in comparison to other similar pieces of software due to the user’s ability to select different dictionaries, contemporary and historical, to assess the input text. The current iteration of Web ChaMame offers some limited preprocessing options, the choice of ten different dictionaries to analyze the text, a variety of outputs, and the ability to view and/or download the results in HTML, CSV, or Excel formats. Dictionaries that can be used to analyze the text include those for written and spoken Japanese in the modern period (J. Gendai 現代), early modern period (J. Kindai 近代), and Middle Ages (J. Chūsei 中世) as well as for written Early Middle Japanese (J. Chūko wabun 中古和文) and Old Japanese (J. Jōdai 上代). However, only two dictionaries can be used to analyze the text at any one time. The user can choose up to 24 different types of analytical outputs including lexemes (J. Goiso 語彙素), the readings of lexemes (J. Goiso yomi 語彙素読み), word class (J. Hinshi 品詞), conjugated form (J. Katsuyōkei 活用形), and the classification of words by their origin (J. Goshu 語種).

Screen Shot 2019-03-19 at 22.16.53The array of preprocessing options, dictionaries, and outputs that can be used to analyze a text on Web ChaMame.

For this article, I will test Web ChaMame on two texts. The first is a letter found in my own collection. The letter dates from 1823CE, and is the topic of a research note that I published last year entitled “Bunka Bunsei Jidai ni okeru Sendaihan no Kawadake to KirishitanTeppō Aratame Yaku ni kansuru shiryō ni tsuite,” 文化文政時代に於ける仙台藩の河田家と切支丹・鉄砲改役に関する資料について in Fukushima Kōgyō Kōtō Senmon Gakkō Kenkyū Kiyō 福島工業高等専門学校紀要, Vol. 59 (2018), pp. 187-190. The text reads as follows:

Screen Shot 2019-03-19 at 1.54.24Text from a letter dated 1823CE.

I chose the dictionaries for written early modern and Middle Ages Japanese to analyze the text. The first thing that I noticed was that the software doesn’t deal well with names. The name of the author of the letter, Kawada Shihee 河田四兵衛, for example, is rendered as Kawata Yon Hyōe due to the software’s use of alternative readings to render the kanji used in his name. Furthermore, like Kuromoji, the software appears to struggle with some common historical vocabulary. The aforementioned term gozāsōrō 御座候 is rendered as goza 御座 and 候, whereas the term makarinaru 罷成 (E. to reach a certain state) is rendered as yamu 止む (E. to cease) and naru 成る (E. to become) when analyzed with the early modern dictionary. Although some terms, therefore, appear to suffer from unnatural segmentation and analysis, many terms are analyzed and transposed with some level accuracy. More importantly, unlike Kuromoji, the software provides readings for each character even if they are historical or obsolete. A further advantage to the software is that when a user selects multiple dictionaries to analyze the text, the software highlights (in red) entries where there are discrepancies between the data across both dictionaries. As such, users can find potential mistakes and clarify the accuracy of the data following the segmentation process.

Screen Shot 2019-03-19 at 15.22.24The rendering and analysis of Kawada Shihee’s name on Web ChaMame.

Screen Shot 2019-03-19 at 16.16.51Web ChaMame’s rendering and analysis of the term makarinaru 罷成.

The second text that I used to test Web ChaMame was the text of an anti-Kirishitan (E. Christian) edict from 1682CE which I included in the appendix of my doctoral thesis; Rethinking the History of Conversion to Christianity in Japan (2018), pp. 309-310.

Screen Shot 2019-03-19 at 16.39.47

The text of the 1682CE anti-Kirishitan edict.

The software dealt much more easily with this text in comparison to the aforementioned letter, and as such, some of the above-noted issues were less prominent. I noticed, however, that the software didn’t understand some specialized, subject-specific vocabulary. It didn’t recognize the term hateren はてれん, an alternative spelling of the term bateren バテレン (E. Padre), and therefore divided the term into separate words. It must be noted, however, that the common historical spelling of the term is recognized by the software. Similarly, the term iruman いるまん (E. Religious Brother) was not recognized as a stand-alone term by the software, and when I checked to see if it would recognize the traditional rendering 伊留満 it would not.

Screen Shot 2019-03-19 at 16.16.40Screen Shot 2019-03-19 at 16.53.21Web ChaMame’s rendering and analysis of the term iruman 伊留満.

Overall, I believe that Web ChaMame is potentially useful for scholars working with Japanese historical texts and has several advantages over alternative Japanese text segmentation software. The software receives semi-regular updates and has the potential to become a highly useful resource for morphological analysis. Nevertheless, users must be aware of potential errors and be willing to check the accuracy of the software’s outputs.

Further Reading:

Kawaguchi Motoharu, Komoda Ryuki, and Tsutsumi Tomoaki, “Improvement of ‘Web ChaMame’ and experimental production of ‘Web ChaMame Web API’,” in Proceedings of Language Resources Workshop, Vol. 1 (2017), 265-272. (https://doi.org/10.15084/00001481).

Footnotes:

[1]An unsecure version with a few additional features also exists. For those interested follow this link.

One thought on “Japanese Text Segmentation and Analysis with Web ChaMame

Leave a comment