eScriptorium: Digital Text Production for Urdu, Hindi, and Bengali Print, part 1

State-of-the-art OCR engines use trainable models to perform two consecutive tasks that produce machine-actionable transcriptions. They first segment the position of the line on the page and then recognize the words by each segmented line. We need script-specific ground truth datasets to train machine learning models that perform these tasks. Often we can train these models with synthetic data, which may work fine for contemporary fonts. However, we need real data or manually corrected ground truth to appropriately model the diverse typographical or paleographical particularities of historical documents.

For the last several months, I have been experimenting with eScriptorium to create ground truth for historical print in three South Asian languages, namely Urdu, Hindi, and Bengali. For the Digital Orientalist, I will post a series of pieces where I introduce the datasets for each language, the models trained from these datasets, and a guide to use them. In this first post, I provide a general introduction to eScriptorium and the workflow associated with it.

Home page of eScriptorium providing a list of its institutional partners
Fig. 1: Home Page of eScriptorium

At one level, eScriptorium works as an end-to-end digital text processing pipeline to prepare publishable metadata-rich digital editions from digital imagery of physical documents. At another level, it is a semi-automatic annotation environment for preparing ground truth that we use to train and test machine learning systems for the particularities of complex documents. Since eScriptorium tightly integrates Kraken – a language-agnostic OCR engine for humanists – it can, in principle, adapt to almost any writing system, including documents in bidirectional (documents with RTL and LTR scripts) and vertical (top to bottom) scripts. As a freely available cross-platform software committed to the principles of open science that individual users can install locally on a standard computer, eScriptorium offers considerable flexibility over commercial software or closed services such as Transkibus

The openness of eScriptorium is particularly significant to allow users to export and publish their models along with the datasets used to train them. This factor makes the models and their results more transparent. More importantly, this enables other researchers to freely reuse models, which saves considerable computing and natural resources needed to train machine learning systems on large datasets.

5 strenghts of eScriptorium: Open-source, data interchangeability, intuitive GUI, deep learning, and shareable models
Fig. 2: eScriptorium: An Overview

In its integration with Kraken, eScriptorium extends Kraken’s usability and overcomes a number of its limitations. Particularly, Kraken’s legacy box segmentation module performed poorly on complex, non-standard document layouts, and was not trainable. This was of course not ideal, especially for documents with complex layouts. The legacy box segmentation module was replaced with a baseline segmentation module, which unlike its predecessor learns from annotated datasets to automatically segment complex and non-standard document layouts. Using eScriptorium as a graphical user interface to Kraken, we can rectify erroneous results of Kraken’s default segmentation model. We can then retain our corrections as ground truth to train a customized segmentation model, thereby improving automatic document segmentation results for documents with complex layouts.

More importantly, eScriptorium enables users to define their controlled vocabulary or ontologies for labeling the coordinates of each pixel on a document’s image as a set of classes corresponding to particular region or line types. This is incredibly useful for labeling a number of elements in the text to model the material complexities of historical documents. For instance, we can use this feature to annotate region types such as advertisements in periodicals, or line types such as prose and verse in a tazkirah.

eScriptorium panel to define ontologies for your documents
Fig. 3: Ontologies in eScriptorium

A standard workflow in eScriptorium broadly involves the following steps. We begin by first creating a project where we import digital imagery of a text, for which there are many options, including import from PDF or IIIF manifest. Then we apply either the default segmentation model, which works reasonably well for standard documents, or a custom segmentation model if it is available, to automatically segment images into regions and lines. At this stage, it is pertinent to correct segmentation errors as it will improve recognition. Following this, we apply an existing recognition model to automatically transcribe the results by each segmented line, and correct the results depending on the quality and downstream applications of the transcription.

Fig. 4: eScriptorium Demo

Since Kraken implements transfer learning – the ability to adapt the existing knowledge of a machine learning model for different but related tasks – we can fine-tune or adapt an existing generalized or base model for the specificities of a related document. We do this by retraining an existing base model for either segmentation or recognition tasks with a couple hundred lines of labeled examples from the related text we want to transcribe.

As the resulting fine-tuned model will likely learn the typographical or paleographical particularities of the new text during the retraining process, it can potentially transcribe this text with fewer errors than the base model. Leveraging this feature, we can iteratively improve text recognition in eScriptorium by first applying an existing base model to segment and transcribe texts, then correcting errors manually, and finally fine-tuning the base model with this corrected data.

In case a recognition model does not exist for your language yet, you can speed up training data preparation in eScriptorium by aligning existing transcriptions of related texts with their images. Following this, you can use this data to train a base recognition model, and then apply the resulting model to transcribe a related text automatically. Lastly, you can export your annotations in ALTO (XML) or PAGE (XML) format and publish them alongside the models associted with them on an open repository such as Zenodo so that other researchers can use your work.

Iteratively improve semi-automatic text recognition in eScriptorium
Fig. 5: Text Production Workflow in eScriptorium

The easiest way to train models is to do it within the eScriptorium instance, especially if you want to fine-tune an existing generalized model for book-specific tasks. However, users working with large datasets comprising examples from multiple documents may benefit from training their models outside an eScriptorium instance using Kraken as a standalone tool from the command line. Kraken’s documentation is a valuable source of information for doing this.

Training generalized models from scratch can be time-consuming and requires considerable computing resources to be practical, such as a graphics processing unit (GPU) with at least 8 GB of graphics memory. However, it is possible to practically fine-tune existing generalized models for book-specific tasks on a standard computer with a 4-core processor and at least 8 GB RAM. The quickest and easiest way to get started with a local instance of eScriptorium is to install it with docker.

There are a few tools one has to install to prepare their machine for this local install, depending on their operating system. Linux and macOS users will first have to install git, docker, and docker-compose. Windows users will have to start by setting up Windows Subsystem for Linux (WSL) on their computers, after which they can install git, docker, and docker-compose. In case you have access to a graphics processing unit to accelerate training, you will also have to install nvidia-docker along with the appropriate CUDA environment.

After these preparatory steps, users can follow the installation instructions here to get started with their own instance of eScriptorium. Once users have set up a local instance of eScriptorium, they will need a segmentation and recognition model to automate their annotation workflow. Several tutorials on basic eScriptorium functionalities exist as both blog posts [0] [1] and video [0] [1] that you can use to get up to speed with eScriptorium.

In the following posts, I will introduce segmentation and recognition models along with their respective datasets that you can use in your research, starting with Urdu. Stay tuned! 

One thought on “eScriptorium: Digital Text Production for Urdu, Hindi, and Bengali Print, part 1

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s