Inspired by a workshop given by Dr. Alex Mallett (Waseda University) that I attended in May last year, I recently began experimenting with Transkribus,a platform that, amongst other things, allows users to create AI-powered text recognition models. Transkribus works well for creating text recognition models for a single person’s handwritten or a single type of printing and therefore for my initial experiments with the platform I decided to try using texts that were printed by Roman Catholic missionaries in Japan in the 16th and 17th centuries which are commonly known as Kirishitanban. My goal in this article is not to provide instructions on how to make a model with Transkribus for which there are extensive guides on the platform’s website, but rather to look explore the outcomes and limitations of my experiments.
My initial goals were to see what Transkribus was capable of and to create a model that might be useful in my research. As such, I searched for texts that were related to my research and which already existed in photographed and transcribed form online. I settled on a selection of random pages and their transcriptions from the Amakusa edition (1592-1593) of the Heike monogatari, Isoho monogatari, and Kinkushū which has been digitized by the British Library (BL) and the National Institute for Japanese Language and Linguistics (NINJAL). The text is written in the Japanese language, but in Latin script according to Romanization conventions created by the Portuguese and the Jesuits present in contemporary Japan. The photographed text can be viewed here and the transcribed text can be accessed here.
The homepage for the digitized version of the Heike monogatari, Isoho monogatari, and Kinkushū.
The model was trained on 40 random pages consisting of 6,544 words (1,030 lines) of text for 80 epochs. As readers will see in the below graph the model had a 0.14% Character Error Rate (CER, the percentage of characters that have been transcribed incorrectly by the model) for the training dataset and a 1.29% CER for the validation set. It is, therefore, a highly accurate model.
Data on the training of the first model.
After these favourable results, I decided that I would try to improve the model by expanding the training dataset both with additional images from the BL-NINJAL’s Heike monogatari, Isoho monogatari, and Kinkushū, but also from other Kirishitanban texts written in Romanized Japanese. I added title pages from a number of Kirishitanban and transcribed approximately 30 pages from Didaco Collado’s Niffon no Cotōbani Yô Confesion (1632), Modvs Confitendi et Examinandi (1632), and Dictionarivm sive Thesavri Lingvæ Iaponicæ Compendivm (1632), and João Rodrigues’s Arte da Lingoa de Iapam (1604). These transcriptions added a small amount of Latin and Portuguese language text (and associated characters) into the model as well as italicised text.
The new dataset comprised of 100 pages or 16,498 words over 2,904 lines. Wanting to maximise the model’s accuracy I chose to train the model for 1,000 epochs (a number which I rather arbitarily picked because it was large). As the reader will see, the revised model had a lower CER for the training data at only 0.05%, however, it had a greater CER for the validation set at 2.68%. In other words, although the model is quite accurate it was not as accurate as the first experimental model that I made. One reason for this is that whereas Heike monogatari, Isoho monogatari, and Kinkushū were printed using the same types, the transcriptions from other texts included additional and different types. Indeed, although the title pages and Arte da Lingoa de Iapam were products of the same press, Collado’s works were printed in Europe. The greater variety of types that these additional transcriptions introduced likely decreased the model’s accuracy. By looking at the validation set it is also possible to determine some general patterns. The model’s primary errors appear to stem from its transcription of accents particular the tilde and macron which may not be clear in some of the texts even for a human reader. It also faces some minor difficulties with interlinear glosses.
Data on the training of the second model.
Despite all this, the model does a much better job than some of the other text recognition software avaliable. ABBYY FineReader 11.0, which was used to automatically transcribe Collado’s Niffon no Cotōbani Yô Confesion and Modvs Confitendi et Examinandi on the Internet Archive doesn’t appear to recognise any of the accents present in the text. The following table shows the transcription of three partial sentences from page 62 of Collado’s Niffon no Cotōbani Yô Confesion and Modvs Confitendi et Examinandi using the model built with Transkribus, ABBYY FineReader 11.0 (the text avaliable on the Internet Archive), and the Google Docs method (described here).
|Transkribus Model (Validation Set)||ABBYY FineReader 11.0 (Internet Archive)|
|Fido no ribai vo tòri motomoru va von imaximè de gozà-||Kdo no ribal vo t6ri motomoru va vonimaximedegozd-||Fido no ribai vo tori motomoru va von imaximè de gozd|
|Saiban meſaruru damiǒ no cacurete tòtta mòno vo caie-||Saiban mefaruru damid no cacurete totra mono vocaic-||Saiban mefaruru damió no cacurete tòtra mono vo caie|
|Màta fito no. vie vo iaſui xi, ſòno còto vōba varǔ ſata tò-||Mata fito no vie vo jafui xi, funo coto voba varu fatato-||Måta fito no vie vo jasui xi, sono còto vóba varů fata to|
A comparison of different automated transcriptions.
Passage from which the transcriptions were made. Didaco Collado,Niffon no Cotōbani Yô Confesion, Modvs Confitendi et Examinandi (Rome: Sacra Congregatio de Propaganda Fide, 1632), 62.
It should be clear from comparison to the original text that the model built with Transkribus provides much greater accuracy (particularly when transcribing special or accented characters) than either ABBYY FineReader or the Google method. Notice, for example, that the model built with Transkribus is the only one to successfully transcibe the long s (ſ).
There is still room for improvement, which can be achieved through rechecking the transcriptions and increasing the size of the dataset by supplementing with additional materials from Collado and Rodrigues. In particular, I would like to increase the accuracy with which the model transcribes accents and interlinear glosses. It might also be beneficial to experiment with building the model in different ways by using different text recognition engines (my experimental model used HTR+) or by using a base model for the training (the Latin Portuguese Print 17th Century model developed by Hervé Baudry might provide a useful base model, for example). My ultimate goal is to be able to build some models for use with Japanese script for some particular printed styles or the handwriting of particular scribes, however, before I begin contemplating something that would constitute a much larger project I would like to further develop my model for Kirishitanban texts. Once I have increased its accuracy further, I hope to make the model public. Watch this space!
Cover Image: Title Page of the Amamkusa version of Heike Monogatari from the British Library and the National Institute for Japanese Language and Linguistics (Public Domain).