At Utrecht University, the project ‘Bridging the Gap: Digital Humanities and the Arabic-Islamic corpus’, seeks to “harness state-of-the art Digital Humanities approaches and technologies to make pioneering forays into the vast corpus of digitised Arabic texts that has become available in the last decade”. This project, which is funded by the Netherlands eScience Centre (started on 1 January 2018) is run jointly by Christian Lange (PI), Melle Lyklema (co-PI) and Umar Ryad (Leuven, co-PI).
Aim of the project
The main objective of it is to develop a web-based interface where researchers can either load or upload a corpus and do high level text searches and analysis. This website is being built by two engineers, Dafne van Kuppevelt and Janneke van der Zwaan. The tools in the web interface include a search functionality that allows searching word or roots of words, or word-stem, and displays basic statistics that can be filtered by the metadata that is added to the corpora. This tool is an extension of BlackLab from INL. Moreover, the search results can be downloaded as a csv file where the user can decide how much context (number of words) he wants to have before and after his search term. This will allow one to have a sub-corpus with words surrounding a term or a root and analyse it further. This project aims (as the title suggests) to stir and contribute to the Arabo-Islamic digital humanities filed were, to date, the research and publications are still few.
The main corpora that will be available on the website are 1- Fiqh (Islamic law) 2- Arabic Poetry 3- Majallatu al-manār 4-Daʿwa (preaching of Islam). The fiqh corpora is a collection of 55 major fiqh books per century for each Islamic school of law. The corpus covers the 2ndH/8thCA century until the 13thH/19thCA century. In each century there is a —main available digital copy of a— legal book for Ḥanafī, Ḥanbalī, Shafiʿī, and Imāmī school. This corpus is of approximately 50 million words, nonetheless the search engine (already now) returns results in less than a second. The books are tagged with metadata that allows the user to filter the results, e.g. one could search for a word and see the results for one school of law, or for particular centuries etc.. The Poetry corpus has (for now) about 155 Diwān (collection of all poems by a poet) from pre-Islam until the 20th century. Majallatu al-manār is an Islamic magazine of 35 volumes published in Cairo between 1898 and 1935 by Rashīd Riḍa. Lastly, the Daʿwa is about 300 books (connected to the project of Melle Lyklema).
Building the corpora
Thanks to the enthusiastic volunteers around the Arabic speaking countries —mainly Saudi Arabia’s al-Shamila project— many Arabic text are digitalised and available online. Other websites and libraries (such as Almeshkat) also provide digitalised searchable texts. I myself work for this project; and one of my jobs is to locate the texts online, collect them, and convert them to plain text (.txt) files. Some text were easily found on the net, others were just playing hide and seek. The books that we needed and found on al-Shamila (or on Open ITI) were no problem, but others (especially the Imāmī ones) were either in epub format or just a text online. The epub files (basically a set of compressed html files) were unzipped, merged and cleaned from any markup and then converted to txt file format. The texts that are only available online (such as the poetry that were collected mainly from Adab.com) were scraped from the internet with BeautifulSoup library in Python. Then the main challenge came, namely cleaning and tagging the texts. For example, in the fiqh corpus we need only the main original body of the text, that is to say no footnotes, no modern introduction etc. just the text as was written by the original author. Next we need to tag the metadata in the text itself (e.g. chapters, sub-chapters, sub-sub-chapters, sub-sub-sub-chapters, Quran quotations, Ḥadīth quotations, etc.). Here regular expressions come to help. This is the most fun part of the process, because each book is different and has its own patterns. As mentioned before, these texts were typed by volunteers (since OCR for Arabic is totally insufficient to date), and each typer will use his own style of typing. For example the Quranic text in most of the time is with tashkīl (vocalisation) which makes it easy to locate. Moreover, typers will mostly use specific parentheses for the Quran quotes, so the tag beginning of the Quran and the end of the Quran is easily found. Other nonstructural texts are very challenging and time consuming, with varying results to catch them using regular expressions.
Currently I am gathering and tagging the poetry and the Daʿwa texts. The project ends in February 2019, and will hosts a conference on the topic at the Royal Netherlands Academy of Arts and Sciences (KNAW) on 13-15 December 2018.