eScriptorium: Digital Text Production for Urdu, Hindi, and Bengali Print, part 3

In part 1 of this series, I introduced eScriptorium and its associated OCR/HTR engine, Kraken. Part 2 described an annotated dataset of Urdu Nasta’liq lithographs printed between 1860 and 1940, and models trained from this dataset for automatically transcribing Urdu texts. This post discusses digital text production for historical printing in Bengali using models trained from an annotated dataset of Bengali texts published between 1860 and 1940. I also offer segmentation and recognition models for early printed Bengali material that you can deploy in your workflow.

Bengali Script and Typography

The Bengali writing system uses a left-to-right alphasyllabary or abugida script where vowels not only appear in their complete form to denote syllables of the vowel sound, but are also attached to consonants as diacritics in varying positions to represent a syllabic unit. The total number of letters in Modern Bengali is about 50, and, like Urdu or unlike English, there are no distinct case markers.

Bengali Alphabets
Image: Wikipedia

Of these, 11 are vowels, one of which – অ – is an inherent vowel and does not require a diacritic form when attached to a consonant.

Diacritic form of আ, ই, ঈ, উ, ঊ, ঋ, এ, ঐ, ও and ঔ attached to the letter ক
Image: Wikipedia

With 10 numeral signs, a handful of diacritics and punctuation marks, and a few special symbols like currency denominators, Bengali grapheme inventory may appear quite simple to model. The complication, however, lies in conjuncts or যুক্তবর্ন (jukto borno). In Bengali, conjuncts are clusters of two to four consonant letters combined in an ordered set to create a single character or grapheme. These complex graphemes in turn either take the inherent vowel sign or any of the 10 vowel signs to form a typographic ligature.

Chart of Bengali Conjunct Combinations
Bengali Conjuncts
Image: Wikipedia

Moreover, the vast variety of typefaces used for printing Bengali before the widespread adoption of mechanical typesetting in the 1940s is another major impediment to automatically transcribing early printed Bengali material at high accuracy rates.

Scanned Bengali Books Available Online

In addition, a significant number of Bengali texts printed before the 1940s, which have been photographed and made available online, have poor quality images. Particularly, texts in the Internet Archive (IA) – Bengali has the highest number of digitized texts in IA among all South Asian languages – have badly binarized images, resulting in distortions or bleeding edges in the text. The compression of these images appears suspect as well, since there are issues of compression artifacts such as blurred or pixelated text.

Digital surrogates of early printed Bengali books and periodicals in repositories maintained by the University of Calcutta or Heidelberg University also don’t fare well in terms of image quality. Specifically, Bengali periodicals available from the FID4SA repository through mutual cooperation between the Centre for Studies in Social Sciences, Calcutta (CSSSC) and the South Asia Institute, Heidelberg University were derived from microfilm, a factor that results in the introduction of significant noise in the imagery. For instance, a few words in a particular line may have speckle noise, rendering recognition illegible on the entire line. 

Image example of noise in scanned copies of early bengali printed texts
#29 in this PDF available from FID4SA

In this collection, several periodicals have two facing pages scanned as a single image, which usually is not that difficult to handle, especially for not so bulky bounds. But this scanning technique may introduce distortions near the center of the spread due to the curvature of the book’s spine, particularly in images towards the middle of the book, and requires dewarping.

Image example of distortions due to the curvature of the book’s spine
#47 in this PDF available from FID4SA

A truly functional OCR of early printed Bengali material needs to account for these factors, which is why I decided to compile a training dataset consisting of such distortions. 

Dataset Description

Like the Urdu dataset presented in the previous post, the Bengali dataset has three annotation layers. Firstly, the <TextBlock> tag in ALTO XML holds region coordinates and labels for text regions in each page. The second layer stored in the <Text Line> tag records line coordinates and line types.

For page annotations, I rely on controlled vocabulary provided in the SegmOnto project. Put simply, the SegmOnto project aims to standardize the description of page layouts while providing flexiblity to accomodate the specificities of a variety of documents. The classes or types provided in SegmOnto can have more than one subclass or subtype, which can further be numbered for differentiation. Users can define a custom zone to include elements not covered in the guidelines and modify the suggested subclasses to add semantic context that suits their needs. This augurs well for systematically labeling large quantities of data and reduces the complexity in modeling layouts of historical documents in computer vision tasks.

In practice, I label the main text block on the page as Main. If the text is laid out in more than one column, the tags follow the <Main:column#number> format, where #number distinguishes each column. Similarly, footnotes are tagged as <MarginText:note#number>. Currently, I do not differentiate between visual elements that serve a purely decorative function and those that are cosubstantial to the text. As a result, I classify both types of visual components as <Graphic:illustration>. I also define a private zone called <Custom: publication> to tag regions that contain publication metadata. The <Numbering:page> tag marks pagination while <Numbering:other> identifies numbering associated with foliation or pages not part of the main body of the book.

Training Region TypesTotal Types
MarginText:note#2 90
Summary of Regions Types in the Dataset

Furthermore, automatically labeling tags for regions into an ALTO XML file in an OCR pipeline is useful for downstream applications of the OCRed document. For example, extracting only those lines that are associated with particular text blocks, such as <Main>, or converting the OCRed document into other formats, such as TEI and RDF. You can consult SegmOnto’s documentation to learn more on this and follow the official eScriptorium tutorial for instructions on image annotations.

The <String> tag associated with each <Text Line> is the third layer of annotation that provides the transcription of each line, and is relevant to train the recognition model. The transcription to train the current recognizer was prepared in three stages. The first step involved transcribing randomly selected pages from a set of 25 books published between 1890 and 1940 and available from the Internet Archive. For this, I used 700 pages or 20,000 lines to train the base recognition model. I then applied the base recognizer to transcribe the complete runs of 14 poetry books from the same period that were a subset of the 25 books used to train the base model. Students of the M.A. Bengali program at the Department of Modern Indian Languages and Literary Studies, University of Delhi then corrected the raw OCR output of these 14 books by double reading. After this, I aligned the corrected transcription with the images in eScriptorium and selected pages where the previous base model had transcribed characters with lower confidence in order to retrain the model on an optimized dataset.

In the final stage, I applied the retrained recognizer to transcribe selected pages of 15 texts published before 1900. Then, I used the entire dataset to train the current state of the model from scratch. Bengali texts often contain text in English alongside Bengali, primarily in footnotes but also as part of the main text. So in the final stage, I also included pages with relatively higher frequency of English text. This way the dataset addresses the presence of mix-script text in Bengali print. It should be noted that the recognition model presented in this work remains limited in inferring text in the Latin script reliably, specifically upper case letters, and certainly needs work.

Model Description

This Github repository provides two types of models – for segmentation and recognition tasks – each of which is a work in progress and will be updated periodically. The segmenter named bnSEG_basic.mlmodel handles basic layouts very well but cannot infer labels for regions. The basic model should suffice for users seeking swift document transcriptions in the form of a plain text file, provided the document has a simple layout. The model titled bnSEG_complex.mlmodel has been trained on image-level annotations described above, and can handle complex layouts including multicolumn documents with a host of paratextual elements. However, in its current state it remains limited in reliably predicting less common region types such as graphic and margin text zones.

Below, you can see the results of applying the bnSEG_complex.mlmodel to layouts that are similar to the data it was trained on.

As you can see, the results are not perfect and requires correction. Fortunately, Jonathan Robker has you covered for this with his post on eScriptorium.

The current version of the recognition model available here was trained on a dataset of ~40,000 lines from 40 different texts printed between the 1860s and the1940s. Given a well segmented image, the recognizer yields significantly better character and word accuracy scores than Tesseract, which is another open source state-of-the-art (SOTA) OCR engine. This recognizer performs on par with Google Drive’s text recognition service, which uses the proprietary Vision API and requires opening an image in Google Docs to get the transcription.

You can view the evaluation reports of the three OCR systems in the gallery below to assess recognition quality. Note that the recognition model provided here has not seen this typeface during training.

These reports were generated with the ocrevalUAtion tool and you can find more about it here

It is worth noting that the Google Drive service often messes up line order in multicolumn documents, and remains an awkward solution for transcribing book length documents. It may well be useless for certain use cases, particularly text types such as poetry and drama as it doesn’t produce line breaks. Unlike the other two systems, the recognition model trained in Kraken – the OCR engine associated with eScriptorium – preserves historical orthography in the transcription since the transcriptions used to train the model were prepared diplomatically, i.e., as is in the source image.

You can follow part 2 of this series for a short explainer on applying this work in your workflow. In the next post, I will introduce another dataset for early printed Hindi texts. Stay tuned!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s