At the Dawn of Digital Studies on Arabic Script in France (2) : A Brief History of Handwritten Arabic Text Recognition in France

Introduction

The first article of this series explored recent advances in the digital study of Arabic script in France in general. It traced the beginnings of digital studies applied to Arabic script in France, and discussed the emergence of recent projects combining digital humanities and Arabic manuscript traditions.

However, among the digital technologies addressing Arabic script, Handwritten Text Recognition (HTR) is undoubtedly one of the most fundamental. Therefore, this article and a forthcoming one will focus in much greater detail on HTR in French academia. It will first provide a historical account of its application to Arabic script, including the introduction of techniques and methodologies employed in HTR for Latin or other languages, as well as offering a non-exhaustive review of digital projects in this area. Eventually, we will attempt to assess whether these projects are received internationally and advocate for greater openness between the Francophone and international communities.

Note: This article, along with a forthcoming one, is partly based on interviews with Chahan Vidal-Gorène and Clément Salah, whom I would like to warmly thank for their time and contribution.

Kraken

The previous article stated:

‘What we refer to as the dawn of digital studies on Arabic script has already been, and will undoubtedly continue to be, inspired by these innovations developed for Latin and French script corpora.’

While attempting, in this article, to outline a brief history of Arabic-script HTR in France, we realized that this statement was even more accurate than we initially thought.

Ironically, the software Kraken can be viewed as a symbol of openness between the Germanic and French academia, a theme we will explore in greater detail in our forthcoming article. Its development by Benjamin Kiessling—whom we mentioned in our previous article—began during his research years in Leipzig. Kraken was specifically designed as an OCR solution tailored to the needs of the humanities, especially accommodating a vast diversity of languages and scripts, including ancient and non-Latin ones (Arabic, Persian, Syriac, polyphonic Greek, Hebrew, etc.). Aimed at researchers, it was intended to enable them to train their own models and meet the specific constraints of their corpora, particularly in terms of layout (such as annotated margins), where commercial solutions were still falling short at that time.

Kraken’s popularity is undeniable, as demonstrated on this very site through articles by Rohan Chauhan, which aimed to explain how to use Kraken on one’s own machine. The intersection between the Germanic and French-speaking worlds notably occurred when Benjamin Kiessling began his PhD at l’Ecole pratique des hautes etudes in 2018. During this period, he contributed to several digital humanities projects (Biblissima+, Resilience, Tikkoun Sofrim, and Sofer Mahir).

However, what particularly interests us here is his involvement with eScripta, where the OCR engine was adapted for the handwritten text recognition software eScriptorium, now one of the leading platforms for HTR in Latin languages. Even today, for Latin scripts, the Kraken engine remains the default in several eScriptorium instances, as the author of this article has personally experienced.

Thus, an OCR software originally created for diverse ancient languages, including Arabic, has successfully been adapted into software widely popular for Latin scripts. Eventually, Kraken’s trajectory demonstrates that the Latin and Arabic script worlds are anything but hermetic while demonstrating that collaboration between two different academic worlds leads to significant progress. 

CALFA projects

Calfa is a company founded in 2014, specializing in the development of artificial intelligence technologies for the detection and automated analysis of manuscripts written in oriental languages such as Arabic, Armenian, and Syriac. Comprising PhD students and engineers in artificial intelligence, Calfa focuses notably on improving text recognition and digitization in these languages.

In the field of HTR, while it helps private clients to recognize Arabic script in their personal archives, it is above all a pioneer of HTR on Arabic in the French academic world. This pioneering effort is part of a broader French context in which digital humanities for the Arabic language have been identified as a strategic priority. Notably, the DISTAM consortium (DIgital STudies Africa, Asia, Middle East) and the GIS MOMM (Scientific Interest Group on the Middle East and Muslim Worlds) have both played a key role in supporting the development of datasets and models dedicated to Arabic handwriting recognition (HTR)1.

In other words, in what we call the ‘dawn’ of work on Arabic script using digital humanities, Calfa was a precursor by participating in the TariMa project (2022-2024) – in which the author of the article took part as an intern – and by developing in parallel a dataset for the Recognition and Analysis of Scripts in Arabic Maghrebi (RASAM). Finally, as part of the L’Agence nationale de la recherche (ANR) LiPoL project, the company developed a HTR model to transcribe around 35,000 pages of the Sīrat al-Malik al-Ẓāhir Baybars, achieving 98% accuracy. In October 2023, at the Baybars Hackathon in Beirut, Calfa trained 16 participants in HTR annotation and transcription, contributing to the creation of this model. For this article we were lucky enough to speak to the founder of Calfa, Chahan Vidal-Gorène to look back at Calfa’s achievements and the general situation of automatic transcription of Arabic in France.

It’s important to remember that Calfa didn’t work in Arabic straight away. The company’s first experience of Arabic HTR was within a specific institutional framework: that of Distam and BULAC, which as institutions were interested in the automatic transcription of Arabic. Their joint work on Armenian had created links and a relationship that enabled them to renew their experience of Arabic.

To understand the value of Calfa’s work on Arabic script, it should be noted that towards the end of the 2010s there was a great lack of data on Arabic for the HTR. And while Calfa had already worked on Arabic printed texts, which had given them a slight base, this was a first for HTR2. The particularity of the Tarima and RASAM experiments is that the ground truth was fairly limited, but still led to extremely satisfactory results. The company relied solely on the data produced during a hackathon organised between January and April 2021. In fact, Calfa’s intention was to use a high-performance model from the outset, rather than transcribing from scratch.

As with many Latin sources from the medieval period, the Maghribi khatm on which the Tarima and Rasam experiments focused is a fairly standardised script. The environment was therefore fairly favourable to automatic transcription. Although these two experiments focused on a particular Arabic script, they formed a solid basis for all the company’s HTR projects: whether for private clients or the LiPoL project, the HTR models resulting from Tarima and RASAM are considered by Chahan to be a solid foundation for other HTR experiments in other Arabic written form3.

How much of the inspiration for these results came from observing the work done on Latin scripts? While we have stated on several occasions that the HTR on Latin and Arabic were inspired by each other and that we expect this inspiration to continue, Chahan qualifies this assertion. It is true that it has existed within the framework of the Calfa projects, notably in the methodology of establishing rigorous transcription standards. Nevertheless, the founder of Calfa asserts that the HTR for Arabic and Latin have undergone independent parallel developments in recent years, to the point where he considers that the HTR for both worlds has reached the same level of performance today. But if Arabic HTR has reached this peak today, why is the technique so much less widespread than in the Latin world? The last article in this series will answer these questions.


  1. A useful synthesis of these initiatives can be found in the report by Noëmie Lucas : http://majlis-remomm.fr/72481. ↩︎
  2. An observation shared by the author of this article and still valid today: datasets and modules are still rare compared with Latin languages, for example. ↩︎
  3. In addition, the datasets produced by Calfa and DISTAM are open access. See: https://distam.hypotheses.org/15133 and https://calfa.fr/ocr-arabic-benchmark/ ↩︎

One thought on “At the Dawn of Digital Studies on Arabic Script in France (2) : A Brief History of Handwritten Arabic Text Recognition in France

Leave a comment