Overview of NDL Kotenseki OCR: The Key to Early Modern Japanese Archives?

This article was written by guest contributor Xia-Kang Ziyi (Kyushu University). The author bio is below.

One of the biggest challenges facing early modern Japan researchers is the deciphering of handwritten manuscripts. The komonjo 古文書 (pre-modern documents) and kokiroku 古記録 (pre-modern record) manuscripts documenting various aspects of the early modern Japanese society exist in overwhelming quantities in archives and private collections throughout Japan. However, because they were written in the obscure early-modern cursive hand (kuzushiji), the archives remain very much an untapped resource. Historians often resort to either focusing their project on a small number of manuscripts or limiting their choice of primary sources to transcripts already produced by other researchers.

While there are many reading groups in Japan working to transcribe kuzushiji manuscripts held in local archives, the reliance on human input in their transcribing process means the speed of their transcription is not suitable for large-scale digital humanities projects. To unlock the potential of Japan’s substantive early-modern archive using digital methods, it is crucial we teach computers to read kuzushiji.

There have been a couple of earlier attempts to tackle this challenge. The AI Kuzushiji OCR projects developed by the Center of Open Data in Humanities (CODH) is one of the earliest and best known models. Available for free both through a web platform and the Miwo mobile app, the model has been a great tool for reading kuzushiji kana, but the accuracy drops significantly with komonjo texts written mainly in kanji characters. In comparison, the Komonjo Camera app, developed by Toppan, has a dedicated “Komonjo” mode that is more successful. However, Toppan’s service is only available as a mobile app to general users in Japan and free usage is limited to 30 pages per day. Moreover, without special arrangements, neither the CODH nor the Toppan OCR service supports image processing in bulk, limiting their usability for large-scale data analysis.

NDLkotenOCR-Lite

The NDL Lab at the National Diet Library has been developing a new solution for kuzushiji OCR. Following their success in digitizing the library’s modern collection, the team went on to develop the open-source NDL Kotenseki OCR (or “NDL古典籍OCR”) to process its pre-modern collection.

While the full NDL Kotenseki OCR model has been in development since 2022 and is currently in its third iteration, the NDL Kotenseki OCR-Lite offers very similar functionalities and a much more user-friendly interface with only a 2% drop in accuracy rate.

One of the easiest ways to access the OCR-Lite model is through its desktop app available for Windows, MacOS, and Linux. There is also a web application, but here I will explain how to use the desktop app. In the Japanese-only interface of the app, the user has the option to process either a single or multiple images. After picking the output location, the user can choose the format(s) to save their OCR results. The TXT file gives plain transcript texts line by line, whereas the JSON and the XML files contain additional information such as the layout coordinates and recognition confidence levels. There are also options to output PDF files overlayed with the transcript. More detailed instructions on how to navigate the desktop app can also be found on the NDL Lab website (in Japanese only).

Figure 1: Graphic user interface of the NDLkotenOCR-Lite app (ver. 1.3.1)

Figure 2: Output options in the NDLkotenOCR-Lite app (ver. 1.3.1)

For those more confident working with code, NDL Kotenseki OCR-Lite can also be run from the command line. Detailed usage instructions are provided on the project’s GitHub page. With user-defined parameters, the command line version offers the same functionalities as the desktop app, while enabling additional options for customization and automation.

Figure 3: Command-line usage instructions from the official NDLkotenOCR-Lite GitHub repository.

Advantages

High level of accuracy

Similar to earlier kuzushiji OCR models, NDL Kotenseki OCR was trained using datasets of early modern manuscript images and manually transcribed texts. Its accuracy is further enhanced by pattern recognition based on machine-learning methods. Instead of treating each character independently, the NDL model uses a context-aware decoder that predicts characters using the surrounding text, allowing the model to resolve ambiguous or degraded visual data by inferring the most likely character (see Section 4.2.1 in 古典籍資料のOCRテキスト化実験（令和4年度～)). This approach is particularly suitable for reading early modern texts written in the highly formulaic sōrōbun style. The developer’s assessment of the model (full version 3.0) returns an average accuracy of 92% (see “Results of Character Recognition Performance Evaluation” in 古典籍資料のOCRテキスト化実験（令和4年度～)). In my own tests on manuscripts regarding domain-bakufu interactions, the model produced an OCR accuracy of approximately 97%.

There is one small issue with the context-based character prediction approach. Compared to the single-character recognition, the recognition mistakes made by the NDL model tend to be grammatically correct, making them less immediately noticeable during human reviews. Nevertheless, the impressive accuracy of the NDL OCR model significantly reduces the amount of human effort required to review the OCR results, making it a suitable tool for large-scale OCR projects.

Workflow customisability

The NDL Kotenseki OCR-Lite model runs completely locally without the need of a cloud server. Combined with command-line access, this means with some basic shell and python scripting, the user can easily create customized, fully automated workflows. For example, a researcher with a collection of pdf manuscripts can set up a workflow that automatically converts each pdf file into images, processes them with NDL Kotenseki OCR to produce OCR transcripts for each page, then combines the per-page transcripts of each manuscript into one unified text file. The transcripts can then be reviewed and used for annotation.

Model retraining

The final point that sets the NDL model apart from its predecessors is the possibility to retrain the model with the user’s own data. The NDL Lab has published detailed instructions on how to calibrate the model’s recognition of text layout and characters. Admittedly, as with all OCR models, substantial amounts of new training data are required to produce meaningful improvements. Nevertheless, the option makes it feasible for researchers to optimize the model for specific types of documents.

Conclusion

Because of the laborious work and years of experience required to read and transcribe kuzushiji sōrōbun manuscripts, regional archives rarely have the resources to digitise, let alone transcribe, all of their collections. Though experienced human input is still essential, the NDL Kotenseki OCR model, with its impressive accuracy and high degree of customisability, may just be the key to unlocking the potential of these archives and advancing our understanding of early modern Japan.

Xia-Kang Ziyi works on the diplomatic history of early modern Japan. She received a DPhil in Oriental Studies at the University of Oxford, and her thesis examines the agency of the Tsushima domain in Tokugawa Japan–Chosŏn Korea relations. She is currently a JSPS International Research Fellow at Kyushu University, Fukuoka.¹

This work was supported by JSPS KAKENHI Grant Number 25KF0153. ↩︎

Overview of NDL Kotenseki OCR: The Key to Early Modern Japanese Archives?

NDLkotenOCR-Lite

Advantages

Conclusion

Like this:

Related

Published by

The Digital Orientalist

Leave a ReplyCancel reply

NDLkotenOCR-Lite

Advantages

Conclusion

Share this:

Like this:

Related

Published by

The Digital Orientalist

Leave a ReplyCancel reply

Discover more from The Digital Orientalist