Recently, I have been experimenting to see how well Google Docs can transcribe and perform OCR on Syriac books in PDF format. Earlier this month, George Kiraz reported on how to use Google Drive to perform OCR on images of Syriac texts. The process, described by Kiraz, is extremely simple:
- Take a screenshot.
- Upload the screenshot to Google Drive.
- Right-click the file, click “Open with >” and then “Google Docs.”
Once the image has been opened the transcribed text appears below it.
The results from Kiraz’s tutorial.
I wondered if this could also be done with documents and files saved in PDF format. I spent a long time experimenting and got some quite mixed of results. In the end, I found a fairly effective method:
- Convert the PDF files to image files in a folder using Adobe Acrobat Pro (remember you can batch convert PDFs).
- Open the images inside the folder with Google Docs.
This will transcribe the text which can then be saved to a separate .doc file. It appears to produce fairly good results based on my experiments with a liturgical text printed in India in the early 20th Century. The below picture is an example of Google Drive’s OCR using a prayer by Mor Severus.
Results of Google Drive’s OCR on a prayer by Mor Severus.
This method also has advantages over some other methods such as using Transkribus. Transkribus requires high resolution images, whereas the Google Docs method will work quite well with images of a regular resolution, which may save a user from having to rescan documents. Nevertheless, the method is not without its problems. For instance, it can also be time consuming to convert the PDF files to images. I have also noticed that when used with some recent Syriac texts, that the direction of words is inexplicably changed. Nevertheless, I believe that this method will be gradually improved, especially when one considers that some free Google Books written in Syriac have already been made searchable.
Example from Husoyo ܚܘܣܝܐ in which word order has been reversed in the transcription on the left-hand side.
Using this method to transcribe Syriac text makes the text searchable. To search for terms in a transcribed PDF one simply uses the search bar. Below are images of some tests I did searching for a term in a Syriac newspaper.
Searching for a term.
Result of the Search.
I have also found that this method can assist with deciphering terms that one is unsure about in the original Syriac. I was transcribing and translating a short text from India, however, the first two words in the text were not clear. When I googled them I saw that they were rendered differently than my initial thoughts (ܡܢ ܗܫܐ instead of ܡܢ ܗܝܡܐ) in the transcription given by Google Books. This shows that the Google engine could understand the word with the help of artificial intelligence indicating that it is a potentially powerful tool to have at our disposal.
Finding a transcription for an uncertain term in Google Books.
Those interested in using Google Drive and other OCR tools for Syriac should consider reading:
- Emily Chesley, Jillian Marcantonio, and Abigail Pearson, “Towards Syriac Digital CorporaEvaluation of Tesseract 4.0 for Syriac OCR,” Hugoye 22.1 (2019): 109-192.
- George Kiraz, “OCR Syriac texts in Google Drive,” North American Society for Christian Arabic Studies Google Group (May 6, 2020).
2 thoughts on “Brief Notes on OCR and the Automated Transcription of Syriac Books”