A Study on the Accuracy of Low-cost User-friendly OCR Systems for Arabic: Part 1

This week’s guest contribution is by Ishida Yuri, Okayama University, Special-Appointment Assistant Professor, and Shinoda Tomoaki, Tokyo University of Foreign Studies, Research Fellow. See Part 2, here.


Overview

In this study, we review the accuracy of inexpensive and easy-to-use optical character recognition (OCR) systems that support Arabic. With humanities researchers in mind, we do not assume any particular knowledge of image processing or command-line interfaces. Digitized texts are a powerful resource in research work, and OCR systems, which convert image data to text data, are increasingly common. And yet, assessments of OCR accuracy, particularly for Arabic, remain scant.

We began by processing 25 sample images containing Arabic text with 17 free or inexpensive, user-friendly, OCR systems, producing 425 results. This initial test showed that only 11 of the 17 systems support Arabic. Next, of the 25 sample text images, we chose four containing exclusively machine-printed Arabic lettering without complicated layouts for further investigation.

We used a software tool called “OCR Evaluation” to assess the accuracy of the 11 Arabic-compatible OCR tools in processing these four images. This tool calculates accuracy by comparing the OCR-generated text file against a “ground truth” text file, and displays the results as statistics, including the character error and word error rates.

Examining these error rates for each of the 44 combinations of tool and sample image, we found that i2 OCR, OCR Space, Google Drive, and Fine Reader are comparatively robust options. Regardless, even the most accurate OCR tools will leave the user to perform a certain amount of manual correction work afterward. We also found that the compatibility between the processing tool and the file processed is a factor. We therefore recommend that users first process a few pages of their documents using these four OCR tools, compare the results, and select the one that works best for them. If the accuracy is not satisfactory, waiting a few years is another option, as Arabic-supporting OCR tools are still evolving.

Introduction

Optical character recognition (OCR) systems scan the text found in photographic or heavily formatted data files and convert it to smaller-sized and more manipulable plain text data files. The accuracy of OCR scanning has recently improved with the support of artificial intelligence technologies. Although the scanning results are still not 100% accurate and do not eliminate the need for verification with the human eye, OCR has nevertheless become indispensable for digitizing documents.

Today, a range of these image-to-text conversion applications and Internet-based services (collectively referred to here as “OCR”) are available. In addition to commonly found languages such as English and Japanese, some OCR systems also support Arabic, which is the subject of our research. We therefore investigated the accuracy of the latest generation of OCR solutions, focusing on Arabic materials.

In this study we address the accuracy of available OCR options for photo or formatted files containing Arabic language. Unlike Mansoor Alghamdi’s study[1] on the accuracy of OCR for Arabic, this contribution is intended to provide OCR software selection guidelines for humanities researchers like ourselves. Thus, we focus on OCR tools that provide a graphical user interface, which allows for intuitive operation and doesn’t require any particular knowledge of image processing or command line interfaces.[2] The results presented here are based on images processed as acquired,[3] without any special preprocessing.[4] This study was limited to images from Arabic-language books with plain typesetting, and excluded handwritten materials[5] such as manuscripts and lithographs, which do not easily lend themselves to comparison.

Survey Methodology

The authors conducted a survey of available OCR solutions for Arabic from late January to early February 2021, and presented their findings under the title “The actual condition and practice of OCR for Arabic” (in Japanese) on March 19, 2021. This contribution for The Digital Orientalist summarizes a portion of the content from that presentation as well as some additional data. For this survey, 25 samples of Arabic text were scanned using 17 different OCR systems. Table 1 shows the 425 results[6] and Table 2 details the software used.

TypeLang.Num.No.1234567891011121314151617
Printed BookAr.Ar.1×××××××
Printed BookAr.Ar.2×〇 *3××××××
Printed BookAr.Ar.3〇 *1〇 *1×× *2××××
Printed BookAr.Ar.4×××××××
Printed BookAr.Ar.5××××××××
Printed BookAr.Ar.6××××××××
Printed BookAr.Ar.7××××××××
Printed BookAr.Ar.8××××××
Printed BookAr.Ar.9××××××××
Printed BookAr.Ar.10**××××××××
Printed BookAr.Ar.11*××××××
Printed BookAr.En./Rm.12××××××
Printed BookAr.En.13××××××
Printed BookAr.En.14××××××
Printed BookAr./Pr.Pr.15××××××××
Printed BookAr./Pr.Ar.16*×××××××
Printed BookAr./Pr.Ar.17×××××××
Printed BookPr.Pr.18××××××
Printed BookAr./En.Ar.19EEE×EE
Printed BookAr./En.Ar./En./Rm.20AAEEEEEE
Printed BookAr./En.En./Rm.21AAEEEEEE
Printed BookAr./En.En.22AAEEEEEE
ManuscriptAr.Ar./En.23×××××××××
NewspaperAr.En.24××××××××
NewspaperAr./En.En.25×××××××××

Table 1: OCR Survey Results

Guide

〇: Success. The Arabic characters were recognized.

×: Failure. The Arabic characters were not recognized.

△: Mixed. With images that contain a mix of Arabic and Latin scripts, when “Arabic” language is chosen in the OCR settings, the Arabic characters are recognized, but not the Latin. When “English” language is chosen, the Latin characters are recognized, but not the Arabic.

A: Only Arabic characters were recognized (in mixed Arabic and English text).

E: Only Latin characters were recognized (in mixed Arabic and English text).

*: Lithographically printed text.

**: Materials containing photographs and manuscript images.

* 1: Only the first page was recognized.

* 2: Only the first two pages were recognized.

* 3: A formatted document file containing the text could be produced, but not a plain text file with just the text itself.

Types of source materials used: The source materials included printed books, manuscripts, and newspapers. They were formatted as pdf and jpeg files.

Lang: The languages shown in the images included Arabic (Ar.), Persian (Pr.), and English (En.).

Num: The numerals shown in the images included Arabic (Ar.), Persian (Pr.), and Roman numerals (Rm.), as well as numbers spelled out in English (En.).


 OCR Tool NameURLArabic Support
1Convertiohttps://convertio.co/Yes
2Fine Reader PDFhttps://pdf.abbyy.com/pricing/Yes
3Foxit Phantom PDF[7]https://www.foxit.com/shopping/Yes
4Free Online OCRhttps://www.newocr.com/Yes
5Goldhttp://www.sakhr.com/index.php/en/solutions/ocrYes
6i2 OCRhttp://www.i2ocr.com/free-online-arabic-ocrYes
7OCR Converthttps://www.ocrconvert.com/arabic-ocrYes
8OCR Spacehttps://ocr.space/Yes
9Online Convert Freehttps://onlineconvertfree.com/ocr/arabic/Yes
10Sotoorhttps://rdi-eg.ai/optical-character-recognition/Yes
11Adobe Acrobathttps://acrobat.adobe.com/us/en/acrobat.htmlNo
12Google Drive[8]https://drive.google.com/drive/my-driveNo
13Nitro PDF Prohttps://www.gonitro.com/pricingNo
14PDF OCRhttps://www.pdfocr.net/register.htmlNo
15PDFelementhttps://pdf.wondershare.com/No
16Simple OCR Freewarehttps://www.simpleocr.com/No
17Soda PDF ONLINEhttps://www.sodapdf.com/ocr-pdf/No

Table 2: OCR Tools Used

Of the OCR tools tested in this survey, ten of them allow the user to select “Arabic” as the target language (OCRs 1–10 in Table 2). Google Drive (12) does not, but can read Arabic nonetheless. This study will focus on these eleven OCR systems. At the time of this survey, the pricing[9] was as follows: OCR tool #1 costs 7.99 US dollars for 100 pages (i.e. USD 0.08/page). #2 costs 25,000 yen (around 300 USD) per year. #5 requires the purchase of an 800 USD desktop application installer CD and an USB flash drive. #10 costs 115 USD for 1,000 pages (around USD 0.12/page). OCR tools #4, 6–9, and 11 are free. #7 allows a maximum of 30 pages per day.

First, of the 25 sample text images considered, the 16 that contained Arabic exclusively were selected. Next, manuscripts, newspapers, and poetry texts with complicated layouts were omitted, leaving the four sample text images used for this investigation (A-D). It is not possible to include the full images here, but the bibliographic details and examples of each typeface are shown below. These examples include the name Muḥammad and the Arabic letter yāʾ, which can differ greatly depending on the typeface used.

Text Samples Used

Muḥammad b. ʿAlī Kharid, al-Ghurar: Ghurar al-Bahāʾ al-Ḍawī wa-Durar al-Jamāl al-Badīʿ al-Bahī fī Dhikr al-Aʾimma al-Amjād, n.p., n.d. [2007], p. 167.

Ibn al-ʿArabī, Fuṣūṣ al-Ḥikam, Abū al-ʿIllā ʿAfīfī ed., Bayrūt: Dār al-Kitāb al-ʿArabī, n.d., pp. 5–6.

Jaʿfar ibn Idrīs al-Kattānī, Fahrasat Jaʿfar ibn Idrīs al-Kattānī, Muḥammad b. ʿAzūz ed., Bayrūt: Dār Ibn Ḥazm, 2004, pp. 216–217.

Ṣiddīq Ḥasan b. al-Qannūjī, Abjad al-ʿUlūm, ʿAbd al-Jabbār Zakkār ed., vol. 1, Dimashq: Manshūrāt Wizārat al-Thaqāfa wa-l-Irshād al-Qawmī, 1978, pp. 248–249.

Acknowledgements

This work was supported by JSPS KAKENHI Grant Number 20H05830. Part of this paper is based on material from Yuri Ishida and Tomoaki Shinoda’s March 19, 2021 presentation (in Japanese) “The Current State and Practices of OCR for Arabic” at the Islamic Trust Studies workshop (C01 “Digital Humanities”, A01 “The Mobility and Universality of Islamic Economics”).

Cover image from Urguza fi-t-Tibb courtesy of the Wellcome Library, London.


References

[1] Alghamdi Mansoor, “A Novel Approach to Printed Arabic Optical Character Recognition,” (Bangor University: PhD Dissertation, 2019): 43–61.

[2] Command line interface OCR tools for Arabic include Kraken (http://kraken.re/).

[3] Text samples used include pdf files published on the Internet, pdf files created by the authors, and jpeg photo files taken during the survey.

[4] Preprocessing can allow otherwise unreadable photo files to be read, or to be read more accurately. One free preprocessing tool is Scan Tailor (https://scantailor.org/).

[5] For handwritten materials such as manuscripts, there is a handwriting recognition tool called Transkribus (https://readcoop.eu/transkribus/).

[6] Some of the initial survey findings remained inconclusive at the March 2021 presentation due to time constraints. Some revisions resulting from additional testing were made in June 2021.

[7] As of June 8, 2021, this product’s name was changed to “Foxit PDF Editor.”

[8] Requires a Google account.

[9] For Foxit Phantom PDF (#3), we used the free standard evaluation version. Also, the price is omitted because, as indicated by the rebranding mentioned above, it is not currently available.

2 thoughts on “A Study on the Accuracy of Low-cost User-friendly OCR Systems for Arabic: Part 1

Leave a comment