A Study on the Accuracy of Low-cost User-friendly OCR Systems for Arabic: Part 2

This week we publish the second part of “A Study on the Accuracy of Low-cost User-friendly OCR Systems for Arabic” by Ishida Yuri, Okayama University, Special-Appointment Assistant Professor, and Shinoda Tomoaki, Tokyo University of Foreign Studies, Research Fellow. You can read the first part here.


Assessing Accuracy

In this study, we evaluated the accuracy of eleven different OCR tools in reading four sample texts in Arabic. Of these 44 combinations, 7 of them (13.6%) failed completely, producing error messages or blank pages rather than any useable output.

We used a software tool called “OCR Evaluation” (stylized as “ocrevalUAtion”)[1] to assess the accuracy of the OCR processing. This software calculates the accuracy by comparing the text files generated by OCR processing against a prepared “ground truth” text file known to be a flawless reproduction of the source text. The results are displayed as statistics, including the character error and word error rates. A smaller number indicates fewer errors and therefore a more faithful rendering of the source text.

The character error rates are shown in Table 3.

 Text AText BText CText DAvg.
112.54100.00 blank page11.6914.9334.79 (13.05)
27.626.472.236.285.65
319.6840.9428.0426.1228.70
49.0511.254.4545.7217.62
510.326.574.5913.748.81
69.689.5625.408.3113.24
718.7317.13   (17.93)
816.989.064.048.069.54
916.9822.31100.00 blank page100.00 blank page59.82 (19.65)
10100.00 blank page100.00 blank page3.344.4151.94 (3.88)
128.735.9817.0514.5011.57
Avg.20.94 (13.03)29.93 (14.36)20.08 (11.20)24.21 (15.79) 

Table 3: Character Error Rates (%)

Values are rounded to two decimal places. Values in parentheses show the average character error rate when omitting cases where the OCR tool failed completely, producing a blank page or an error message.

OCR Convert (7) produced error messages in response to text samples C and D, indicated with black boxes. Convertio (1) produced blank pages in response to text sample B, as did Online Convert Free (9) in response to text samples C and D, and Sotoor (10) in response to text samples A and B, all indicated by 100% character error rates.

The best overall performers produced character error rates of around 10% or less. Fine Reader PDF (2) showed the best performance, followed by Gold (5), i2 OCR (6), OCR Space (8), and Google Drive (12). We also see that performance varied significantly according to both the OCR tool used and the sample text scanned, showing that compatibility between the processing tool and the file processed is a factor. It is also notable that Sotoor (10) performed extremely well with two files, but failed completely with the other two.

The word error rates are shown in the following table.

 Text AText BText CText DAvg.
126.26100.00 blank page38.1956.9655.35 (40.47)
216.1625.0015.6326.9620.94
377.7891.8499.6579.5787.21
49.0936.2217.3634.3524.26
518.1816.8420.4949.1326.16
63.0328.5732.2929.1323.26
754.5554.08   (54.32)
814.1433.1615.9728.7022.99
936.3658.67100.00 blank page100.00 blank page73.76 (47.52)
10100.00 blank page100.00 blank page15.9711.3056.82 (13.64)
1212.1216.3339.5830.0024.51
Avg.33.42
(26.77)
50.97
(40.08)
39.51
(32.79)
44.61
(38.46)
 

Table 4: Word Error Rates (%)

Values are rounded to two decimal places. Values in parentheses show the average word error rate when omitting cases where the OCR tool failed completely, producing a blank page or an error message.

Once again, Fine Reader PDF (2) showed the best performance overall, this time followed by OCR Space (8), i2 OCR (6), Free Online OCR (4), Google Drive (12), and Gold (5) with a word error rate of around 25%. The single best performance, however, was an extremely low error rate of around 3% with the combination of i2 OCR (6) and text sample A. This again shows the importance of the OCR tool’s affinity for the sample text scanned.

Among the free options, i2 OCR (6), OCR Space (8), and Google Drive (12) demonstrated the highest accuracy. Among the paid options, Fine Reader PDF (2) is a formidable tool.[2] Although other OCR tools may provide better results for particular files, exploring every combination is unlikely to be the most productive use of time and budget. Even the most accurate OCR tools will leave some amount of correction work to be done. Also, when errors follow a pattern, the actual time it takes to manually correct them may not be as long as suggested by the error rates listed above. For example, if “Muḥammad,” which frequently appears in the text, is mistakenly read as “Maḥd,” only one search and replace operation is needed to correct all the occurrences. For now, it is probably a better idea to choose an OCR tool that seems to work well with the documents you have, rather than worrying too much about the error rates.

Conclusion

In this paper, we investigated which of the several currently available, user-friendly, free or inexpensive OCR options may be suitable for Humanities researchers working with Arabic script documents. We found that i2 OCR, OCR Space, Google Drive, and Fine Reader are comparatively robust options. We recommend that you first process a few pages of your documents using these four OCR tools, compare the results, and select the one that works best for you.

If you don’t find the Arabic character recognition accuracy of these OCR tools satisfactory, you might consider preprocessing the text beforehand, or using an OCR tool that is controlled through a command line interface rather than the more familiar graphical user interfaces. However, this may require investing time and effort to learn the command line interface and does not promise significantly better results. Also, if the need for OCR processing is not urgent, waiting a few years for Arabic character accuracy to improve is another option. If your documents contain sensitive or restricted content, we would suggest that you check the OCR software’s privacy policy as well as the copyright and other related laws in the country where the software originates.

As Arabic-supporting OCR tools are still a developing technology, it may be early to adopt them for regular usage. On the other hand, digitized text can be a powerful resource for research work, allowing for sharing, text mining, and the tagging and annotation techniques of the Text Encoding Initiative.[3] Plain text data also requires a great deal less storage space than photographic or heavily formatted text data. As a fundamental utility that undergirds research work, optical character recognition has earned our continuing attention.

Acknowledgements

This work was supported by JSPS KAKENHI Grant Number 20H05830. Part of this paper is based on material from Yuri Ishida and Tomoaki Shinoda’s March 19, 2021 presentation (in Japanese) “The Current State and Practices of OCR for Arabic” at the Islamic Trust Studies workshop (C01 “Digital Humanities”, A01 “The Mobility and Universality of Islamic Economics”).

Cover image from Urguza fi-t-Tibb courtesy of the Wellcome Library, London.


References

[1] See: https://sites.google.com/site/textdigitisation/ocrevaluation/installation (Last viewed June 14, 2021).

[2] Additional testing using other texts not covered in this paper did not change our assessment of the accuracy.

[3] See: https://tei-c.org/ (Last viewed June 20, 2021).

2 thoughts on “A Study on the Accuracy of Low-cost User-friendly OCR Systems for Arabic: Part 2

Leave a comment