Introduction
The first article of this series explored the general landscape of digital studies on Arabic script in France. The second narrowed the focus to Handwritten Text Recognition (HTR), tracing its history from Kraken to the pioneering work of Calfa in the TariMa, RASAM, and LiPoL projects.
That second article ended with a paradox. As Chahan Vidal-Gorène, founder of Calfa, pointed out, Arabic HTR has reached a level of performance comparable to that of Latin scripts, and yet the technique remains far less widespread. Why is that? And what can be done about it?
This final article attempts to answer both questions. It draws once again on our interview with Chahan Vidal-Gorène, as well as on an earlier conversation with Clément Salah, co-author of the RASAM dataset, who shared with us his own trajectory into Arabic HTR, from his initial training to his growing technical expertise. His account confirmed what much of this article argues: that researchers in Arabic studies often come to HTR through individual initiative rather than structured pathways, and that the conclusions we reach here about the need for better training, greater mutualization, and international openness are widely shared among practitioners in the field.
Why Does Arabic HTR Remain Under-Represented?
If Arabic HTR has technically caught up with its Latin counterpart, then its limited adoption cannot be attributed to technological shortcomings alone. The reasons are structural.
The most tangible obstacle remains the shortage of training data, and the figures speak for themselves. HTR-United, the collaborative catalog created by Alix Chagué, Thibault Clérice, and Laurent Romary to improve the findability of open ground truth datasets for OCR and HTR, offers a striking illustration. The initiative, which has become a reference for the mutualization of training data, allows researchers to document, share, and discover datasets across scripts, languages, and periods. A glance at the catalog reveals the scale of the imbalance: while dozens of datasets are available for Latin and Old French scripts, covering everything from medieval charters to early modern printed books, only three are currently registered for Arabic script, and two of these are the very datasets collaboratively produced by Calfa, DISTAM, and the GIS MOMM, such as the RASAM dataset for Maghribi manuscripts. In other words, almost the entirety of the mutualized Arabic ground truth available today is the product of a single cluster of actors. As Mehreen Saeed and her collaborators noted in their presentation of the Muharaf dataset at NeurIPS 2024, deep learning models for HTR are inherently data-hungry, and for Arabic, “the scarcity of public datasets, compounded by their relatively small sizes, further exacerbates the challenges” (Saeed et al. 2025), a situation that contrasts sharply with the wealth of resources available for Latin scripts.
The fragmentation of research communities compounds the problem. In France, the study of Arabic script is distributed across codicology, paleography, art history, and Islamic studies, each with its own institutional anchoring. The specialists who study the script and the developers who build the digital tools often work in parallel rather than in dialogue. In the Latin world, structures like CREMMALAB served both as a technical hub and a community of practice. For Arabic script, no fully equivalent structure exists yet.
There is also a question of training. The CREMMALAB initiative introduced a generation of medievalists to HTR tools through workshops and seminars. In Arabic studies, such opportunities have long been scarce, though, as we shall see, this is beginning to change.
The International Reception of French Projects and Emerging Signs
Some French contributions have gained international traction. Kraken and eScriptorium are used worldwide. The datasets produced by Calfa and DISTAM are in open access. The Baybars Hackathon in Beirut (October 2023) trained 16 participants and contributed to a model achieving 98% accuracy on approximately 35,000 pages. On the training front, promising initiatives are emerging: at the École Pratique des Hautes Études (EPHE), Nuria de Castilla and Riham Mokrani, whose doctoral research on the digital study of Arabic scripts from the Mamluk period we discussed in the first article of this series, organized AljamiaTech in March 2026, an intensive week-long program combining reading sessions on Aljamiado manuscripts with hands-on transcription workshops on eScriptorium for Arabic scripts. This kind of initiative, which brings together paleographic expertise and digital tool training, is precisely what the field needs. At a larger scale, DISTAM and the Leipzig Research Centre Global Dynamics (ReCentGlobe) jointly hosted in July 2025 the first Franco-German Summer School on Digital Humanities and Area Studies, bringing together over fifty young scholars from across Europe and beyond. The program featured workshops on HTR for non-Latin scripts, including Arabic, led by Noëmie Lucas and Chahan Vidal-Gorène, demonstrating that the kind of cross-border, multilingual training we have been advocating for is already taking shape.
However, these efforts remain emergent, and several factors limit their broader visibility. French researchers working on Arabic HTR have also been more visible at francophone conferences than at major international DH venues. Scaling up training initiatives beyond the Parisian institutional landscape and making them accessible to the international community remains the next challenge.
The Transcribathon: When Practice Precedes Theory
A recent experience illustrates both the challenges and the possibilities. Under the aegis of the Flow project, we organized a transcribathon dedicated to Ottoman court registers (sijillāt), bringing together approximately fifteen researchers online from French, Swiss, Belgian, Italian, Spanish, and Malaysian institutions. The event aimed to lay the first foundations of an HTR model in Arabic for this type of archival source.
Promoting the transcribathon within francophone academic networks proved to be an effective form of organic collaboration. The initiative was picked up and cited by the DISTAM consortium, which in turn brought it to the attention of a wider circle of francophone researchers working on Arabic-script digital humanities, demonstrating that visibility, when it comes, often travels through institutional relay rather than individual effort.
But perhaps the most telling outcome was unsolicited. Shortly after the announcement, Benjamin Kiessling, the developer of Kraken and one of the architects of eScriptorium, whom we have mentioned throughout this series, reached out directly. Along with colleagues from OpenITI, he has been working on defining “universal” transcription guidelines for Arabic-script manuscripts, with the goal of better mutualizing training data production across projects. Having seen the transcribathon announcement, he proposed to reconcile his emerging guidelines with ours, noting that the Ottoman court registers, written in both Arabic and Turkish, offered an opportunity to ensure compatibility of the standard with Turkish from the outset.
This exchange is worth pausing over. It illustrates precisely the dynamic we have been advocating for throughout this series: when concrete, collaborative events are organized around a shared task, they attract interest, and they generate connections that no institutional framework could have engineered in advance. A transcribathon conceived in Bern, promoted through francophone networks, cited by a Parisian consortium, and noticed by a developer working on universal transcription standards between Leipzig and Paris: this is what effective, organic scholarly collaboration looks like.
Paths Forward
Resources, transcription guidelines, datasets, documentation, should be systematically available in English as well as in French, not to abandon francophone scholarship but to make it accessible.
Above all, the field needs more collaborative events. The Baybars Hackathon in Beirut, the Leipzig Summer School, our transcribathon, and the AljamiaTech workshops all demonstrate the same thing: when researchers are given a concrete occasion to work together, they come and the results benefit everyone.
Conclusion
Over three articles, we have mapped the landscape of digital studies on Arabic script in France, from the broader panorama to the history of HTR, and finally to a diagnosis of the field’s current state and a proposal for its future.
The picture is one of a field rich in expertise and increasingly dynamic, but still fragmented and insufficiently connected to international networks. The good news is that the foundations have been laid. Tools are available, datasets are being shared, training initiatives are emerging, and collaborative formats are proving their worth. What is needed now is a sustained effort to connect these achievements into a coherent, open ecosystem.
As we stated in the first article, what we have been witnessing is a dawn. Whether the digital study of Arabic script in France fulfills its promise will depend on the willingness of its practitioners to look beyond their immediate horizons and build the bridges that the field so clearly needs.
References
Mehreen Saeed, Adrian Chan, Anupam Mijar, Joseph Moukarzel, Georges Habchi, Carlos Younes, Amin Elias, Chau-Wai Wong, and Akram Khater, “Muharaf: Manuscripts of Handwritten Arabic Dataset for Cursive Text Recognition,” arXiv:2406.09630 (2025), https://arxiv.org/abs/2406.09630.
