This article was written by guest contributor Xia-Kang Ziyi (Kyushu University). The author bio is below.
One of the biggest challenges facing early modern Japan researchers is the deciphering of handwritten manuscripts. The komonjo 古文書 (pre-modern documents) and kokiroku 古記録 (pre-modern record) manuscripts documenting various aspects of the early modern Japanese society exist in overwhelming quantities in archives and private collections throughout Japan. However, because they were written in the obscure early-modern cursive hand (kuzushiji), the archives remain very much an untapped resource. Historians often resort to either focusing their project on a small number of manuscripts or limiting their choice of primary sources to transcripts already produced by other researchers.
While there are many reading groups in Japan working to transcribe kuzushiji manuscripts held in local archives, the reliance on human input in their transcribing process means the speed of their transcription is not suitable for large-scale digital humanities projects. To unlock the potential of Japan’s substantive early-modern archive using digital methods, it is crucial we teach computers to read kuzushiji.
There have been a couple of earlier attempts to tackle this challenge. The AI Kuzushiji OCR developed by the Center of Open Data in Humanities (CODH) is one of the earliest and best known models. Available for free both through a web platform and the miwo mobile app, the model has been a great tool for reading kuzushiji kana, but the accuracy drops significantly with komonjo texts written mainly in kanji characters. In comparison, the Komonjo Camera (ふみのは) app, developed by Toppan, has a dedicated “Komonjo” mode that is more successful. However, Toppan’s service is only available as a mobile app to general users in Japan and free usage is limited to 30 pages per day. Moreover, without special arrangements, neither the CODH nor the Toppan OCR service supports image processing in bulk, limiting their usability for large-scale data analysis.
NDLkotenOCR-Lite
The NDL Lab at the National Diet Library has been developing a new solution for kuzushiji OCR. Following their success in digitizing the library’s modern collection, the team went on to develop the open-source NDLkotenOCR (or “NDL古典籍OCR”) to process its pre-modern collection.
While the full NDLkotenOCR model has been in development since 2022 and is currently in its third iteration, the NDLkotenOCR-Lite offers very similar functionalities and a much more user-friendly interface with only a 2% drop in accuracy rate.
The easiest way to access the OCR-Lite model is through its desktop app available for Windows, MacOS, and Linux. In the Japanese-only interface of the app, the user has the option to process either a single or multiple images. After picking the output location, the user can choose the format(s) to save their OCR results. The TXT file gives plain transcript texts line by line, whereas the JSON and the XML files contain additional information such as the layout coordinates and recognition confidence levels. There are also options to output PDF files overlayed with the transcript. More detailed instructions on how to navigate the desktop app can also be found on the NDL Lab website (in Japanese only).

Figure 1: Graphic user interface of the NDLkotenOCR-Lite app (ver. 1.3.1)

Figure 2: Output options in the NDLkotenOCR-Lite app (ver. 1.3.1)
For those more confident working with code, NDLkotenOCR-Lite can also be run from the command line. Detailed usage instructions are provided on the project’s Github page. With user-defined parameters, the command line version offers the same functionalities as the desktop app, while enabling additional options for customization and automation.

Figure 3: Command-line usage instructions from the official NDLkotenOCR-Lite GitHub repository.
Advantages
High accuracy
Similar to earlier kuzushiji OCR models, NDLkotenOCR was trained using datasets of early modern manuscript images and manually transcribed texts. Its accuracy is further enhanced by pattern recognition based on machine-learning methods. Instead of treating each character independently, the NDL model uses a context-aware decoder that predicts characters using the surrounding text, allowing the model to resolve ambiguous or degraded visual data by inferring the most likely character.1 This approach is particularly suitable for reading early modern texts written in the highly formulaic sōrōbun style. The developer’s assessment of the model (full version 3.0) returns an average accuracy of 92%.2 In my own tests on manuscripts regarding domain-bakufu interactions, the model produced an OCR accuracy of approximately 97%.
There is one small issue with the context-based character prediction approach. Compared to the single-character recognition, the recognition mistakes made by the NDL model tend to be grammatically correct, making them less immediately noticeable during human reviews. Nevertheless, the impressive accuracy of the NDLOCR model significantly reduces the amount of human effort required to review the OCR results, making it a suitable tool for large-scale OCR projects.
Workflow customisability
The NDLkotenOCR-Lite model runs completely locally without the need of a cloud server. Combined with command-line access, this means with some basic shell and python scripting, the user can easily create customized, fully automated workflows. For example, a researcher with a collection of pdf manuscripts can set up a workflow that automatically converts each pdf file into images, processes them with NDLkotenOCR to produce OCR transcripts for each page, then combines the per-page transcripts of each manuscript into one unified text file. The transcripts can then be reviewed and used for annotation.
Model retraining
The final point that sets the NDL model apart from its predecessors is the possibility to retrain the model with the user’s own data. The NDL Lab has published detailed instructions on how to calibrate the model’s recognition of text layout and characters. Admittedly, as with all OCR models, substantial amounts of new training data are required to produce meaningful improvements. Nevertheless, the option makes it feasible for researchers to optimize the model for specific types of documents.
Conclusion
Because of the laborious work and years of experience required to read and transcribe kuzushiji sōrōbun manuscripts, regional archives rarely have the resources to digitise, let alone transcribe, all of their collections. Though experienced human input is still essential, the NDLkotenOCR model, with its impressive accuracy and high degree of customisability, may just be the key to unlocking the potential of these archives and advancing our understanding of early modern Japan.
Xia-Kang Ziyi works on the diplomatic history of early modern Japan. She received a DPhil in Oriental Studies at the University of Oxford, and her thesis examines the agency of the Tsushima domain in Tokugawa Japan–Chosŏn Korea relations. She is currently a JSPS International Research Fellow at Kyushu University, Fukuoka.3
- See Section 4.2.1 in 古典籍資料のOCRテキスト化実験(令和4年度~) ↩︎
- The average accuracy rate is given in “Results of Character Recognition Performance Evaluation” available for to download in 古典籍資料のOCRテキスト化実験(令和4年度~) ↩︎
- This work was supported by JSPS KAKENHI Grant Number 25KF0153. ↩︎
