Introduction to eScriptorium, HTR for Hebrew Manuscripts, part 1

Exegetes interested in textual criticism need ready access to digitized and digitalized versions of Hebrew manuscripts so that they must not always rely on the text-critical apparatus of editions. This need places them before the great challenge of accessing not only images of a manuscript, but of finding a satisfactory transcription. While it is becoming increasingly easier to find the first of these—digital photographic facsimiles—in my experience, a paucity of transcriptions have become available. However, there is an online tool available to aid in transcribing Hebrew manuscripts with substantial elements of automation: eScriptorium.

While a number of products are available for the automated or semi-automated transcription of documents, many of these have no models for Hebrew or fail to satisfactorily function in right-to-left (rtl) scripts generally. These matters present no real issue to eScriptorium. In order to demonstrate this, as well as its functionality more generally, I will describe a test-case I undertook as part of a tutorial on eScriptorium hosted by Daniel Stoekl ben Ezra and promoted by the European Association for Jewish Studies.

The application only works with Firefox, so you’ll want to make sure that you have that browser installed and up to date. Once you have an account with eScriptorium (I was provided a training account for the tutorial), you can open a new project and define its parameters. Under the heading “Description” you can provide the project a name and define its script. When working with Hebrew, it is imperative to confirm that the “read direction” is set to “Right to left” and the “Line offset” is set to “Topline.” Having failed to do this the first time, the results were naturally unsatisfactory: the resulting transcription tried to read the lines upside down, which of course led to poor results. Still, that is an easily avoided user error (my technical support colleagues used to refer to it as an “id10t [pronounced “eye-dee-ten-tee”] error”). In the “Description” heading, you can also set the kinds of regions and types your manuscript includes. I was working with the Aleppo Codex, a Masoretic biblical manuscript, but was only interested in the primary text for the purposes of this experiment. It would still be easily possible to include the Masorot, even distinguishing between them, but that does not currently interest me.

The “Description” Interface

Clicking the heading “Images” takes you to an interface for uploading images. I need about twelve folios for the project I am currently interested in, so I dragged those images into the box and uploaded them. By mousing over an individual image, you have the option to manually edit it; a green key highlights to take you there. Clicking on it takes you to the main interface for an individual page. It’s helpful to press F11 in when viewing a single image. This removes the browser interfaces and provides more open landscape for working on the text. (Pressing F11 again brings it back, as does moving the mouse to the top of the screen.)

The “Images” Interface

The “Images” heading also features several other options for interacting with the images, such as selecting the images and importing and exporting text data, but also for the automated processes such as binarization, segmentation, and transcription. Since I am dealing with a biblical manuscript reproduced in clear images, there is no need to binarize the images. Before you can transcribe the text, however, segmenting the image is required. I tried an automated segmentation to see how the results turned out. It took less than one minute for a single folio (even while I was also streaming music in the background), but I was initially unsatisfied with what it produced.

Results of Automated Segmentation

The line segmentation worked pretty well in the automation, but the regions left something to be desired. The issue stems from the placement of the Masorah between the columns, which confused the algorithm. This made the whole center of the page one large region, with some of the Masorah included and some excluded. While that may not seem like a big deal, it led to issues in defining the order in which the lines would be read. It proposed reading the top line of each column, then the second line of each column, and so on, rather than reading the whole right column, then the middle column, and finally the right column (see following image).

Fixing this presented no real issue, but I made two attempts. My first attempt proved unsuccessful. I set the interface to manipulate the regions by pressing the “r” key. Then I pressed “c”, activating the cutting tool, and pressed “shift” to use the lasso tool. Controlling the lasso tool with my mouse and “shift” depressed, I separated the region into columns by cutting out the material in between the columns. While that did create three regions, it did not solve the issue of the reading order of the lines.

Inaccurate Line Numbering Resulting from Incorrect Segmentation

That failure led to my second attempt. Holding “shift” again and in “region” mode, I marked all of the regions and deleted them with the “delete” key. I pressed “r” again to change back to “line” mode and deleted the lines that had been identified in the Masorah with the exact same procedure. Then I created new regions around the primary text, which still had the automated line segmentation. To create new regions, in region mode you single-click where you want one corner to be and then single-click where the opposite corner should be. Making regions this way proved faster than relying on the automated segmentation, since I only wanted to columns of the primary text. I was done in a couple of minutes.

Manually Created Regions in Six Clicks

Taking a closer look at some of the lines, I noted some errors that needed correction. First, some lines were incomplete, had overlapping elements, or marked vocalization or cantillation instead of the consonants.

Identifying Some Errors in Line Segmentation

Fixing this was again no problem. I clicked on a line and then held the “shift” key before clicking on the next line. Then I pressed “j” to join them. Deleting the superfluous lines entailed simply clicking on them and then pressing “delete”.

What appeared to be a more daunting task was the fact that the lines were all too low. The models for Hebrew are designed to read the letters as hanging from the line (as indeed they were written in the manuscripts), so allowing the line to read too low would diminish the results. Simply clicking on a line and dragging it up would not work, since that would only bend the line. Pressing “ctrl” while dragging a line moves the whole thing. But that would mean doing it for each line, an exercise in tedium. Resolving the issue is simple. Rather than click each line and drag it up while holding the “ctrl” key, using the lasso function allows you select every line the column. Holding “ctrl” and clicking a line then allows you to move all of the selected lines in tandem. Since the shape of the lines was good, but the placement was not great, moving all of the lines at once was a viable and easy option. With just a few clicks, the lines had all been moved into the appropriate place. To make sure that the whole of each line was marked, I turned on the “mask” function by pressing “m”. That highlights the marked text with a purple overlay. Then I checked the line numbering again by pressing “l”, just to make sure that nothing had changed through all of my editing. The folio was segmented and ready for transcription. (I did notice two small lines that I forgot to delete after completing all of this, but that is not a substantial error and does not impact the transcription. I was able to delete them after the transcription without producing any errors.) The whole process, including my errors and corrections, took roughly thirty minutes. Without the errors, it takes only about ten minutes for a folio, likely less with more practice.

The Final Results of Segmentation

With the folio segmented, I could now turn to transcribing it, which will be the focus of part 2.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s