Exploring the (Digital) World of Ottoman Turkish Texts: The Digital Ottoman Corpora

It is fascinating to see how Ottoman Turkish texts are transcribed with the help of artificial intelligence (AI)-operated tools, such as Handwritten Text Recognition (HTR) models, into modern Turkish, providing easy access to the nineteenth and early twentieth-century Ottoman world for researchers as well as history enthusiasts. Transcribing Ottoman Turkish into the modern alphabet (Latinization) can be a time-consuming experience for historians exploring Ottoman historical documents. While the transcription of printed or matbu documents is relatively straightforward, transcribing handwritten ones, or rik’a, into the current Turkish alphabet is much more labor-intensive. There have not yet been sufficient AI-based technological advances in deciphering handwritten Ottoman Turkish texts. As such, the decoding of sources still requires human labor, resulting in months, even years, spent producing transcriptions, and thus, slow progress in historical research. 

This article introduces the Digital Ottoman Corpora, which transforms traditional methods of storing Ottoman Turkish text by integrating AI-operated technology and helping researchers transcribe Ottoman texts more easily. The Digital Ottoman Corpora is an online infrastructure platform where Ottoman Turkish documents are preserved digitally. It was developed by scholars Süphan KırmızıaltınFatma Aladağ, and Elif Derin specifically to address gaps in AI-based technology.

The primary motivation behind the creation of Digital Ottoman Corpora was to tackle the challenges historians and researchers face in transcribing Ottoman Turkish texts. By integrating AI-operated tools, the platform aims to enhance the accessibility and usability of Ottoman sources, enabling researchers to focus on analysis rather than the transcription process. Additionally, it provides a robust infrastructure for preserving and studying these texts using advanced computational methods, bridging the gap between traditional archival work and modern digital humanities approaches. So far, the platform has launched three important subprojects under its banner: HTR, Crowdsourcing, and Digital Edition. 

First, the HTR project focuses on applying AI-based automatic transcription to Ottoman Turkish. Currently working with Transkribus, this endeavor seeks to increase the accessibility of Ottoman Turkish historical archives to historians and the general public and contributes to the digital research platform creation for Ottoman Turkish documents. The publicly available model on Transkribus encompasses the production of a generalized text recognition model for nineteenth-century Ottoman Turkish periodicals. Trained on 386 pages of late Ottoman-era materials, the Character Error Rate (CER) of its most recent HTR model is 7.20% as of June 2023. This means that for every 100 characters, there are approximately 7 errors, which is a promising result.   

Figure 1: A page of Ottoman Turkish with transcription produced by Transkribus HTR model

Second, the Ottoman Turkish Crowdsourcing (OTurC) project connects digital humanities and archival resources. It invites the public to contribute to the creation of digital texts in Ottoman Turkish. This project symbolizes a radical change in a field traditionally grounded in individual archival work and offers engagement from both academic and non-academic figures. The data generated through the OTurC is used both as training data for HTR models and for creating digital editions. Contributions from the public also enrich the data set and produces reusable, shareable, and open knowledge. Simultaneously, the transcribed texts are utilized to produce digital editions, making Ottoman Turkish documents more accessible and navigable for researchers and the public. This dual-purpose use highlights the project’s role in advancing both technological development and historical scholarship.

Figure 2: Ottoman Turkish Crowdsourcing (OTurC) project

Finally, the Digital Editions under the Digital Ottoman Corpora infrastructure publishes facsimiles of Ottoman Turkish texts and transcriptions on an open-access platform. The first work within this project is the digital publication of Ziya Gökalp’s Küçük Mecmua, a weekly journal published between 1922 and 1923. Küçük Mecmua provides the thematic framework that was established in the Digital Ottoman Corpora’s crowdsourcing project. It employs a ‘Turkified’ Ottoman Turkish orthography and vocabulary together with Turkish nationalist and folkloric themes. The digital version of Küçük Mecmua is now available on Transkribus, accompanied by an interactive index that allows users to easily navigate and search through the text. Users can also navigate by tags which can be useful to those looking for names, places, and dates, and it is fully searchable.

Figure 3: List of tags in Küçük Mecmua

Figure 4: Küçük Mecmua as a Digital Edition

Recent developments in decoding Ottoman Turkish texts with AI-based technologies are crucial steps in the field. The Digital Ottoman Corpora plays a crucial role in this progress, providing a digital infrastructure that facilitates the transcription, preservation, and study of Ottoman Turkish sources. The team is currently formulating two separate HTR projects focusing on manuscripts (el yazmaları) and judiciary records (sicil yazmaları), which aims to further enhance the accessibility and accuracy of Ottoman Turkish historical archives. With its advanced digital infrastructure, the Digital Ottoman Corpora provides an invaluable resource for scholars and enthusiasts researching the late Ottoman Empire.

References

The Digital Ottoman Corpora. https://www.digitalottomancorpora.org.

Transkribus. https://beta.transkribus.org/sites/kucukmecmua.org.


Further Reading

Aladağ, Fatma. “Crowdsourcing for Ottoman Studies: Zooniverse.” The Digital Orientalist. Last modified November 5, 2021.

One thought on “Exploring the (Digital) World of Ottoman Turkish Texts: The Digital Ottoman Corpora

Leave a comment