KITAB project is developing a corpus of Arabic texts to which they apply digital methods in such a way that users can explore Arabic texts in new ways and expand their knowledge about complex textual traditions. For this contribution to The Digital Orientalist I have the pleasure of interviewing the project’s principal investigator Sarah Savant, Professor at the Aga Khan University–Institute for the Study of Muslim Civilisations (AKU-ISMC).
Q1. Theodora Zampaki: Professor Savant, could you please say a few things about yourself and your academic position?
Professor Savant: I am a Professor of History at the Aga Khan University, Institute for the Study of Muslim Civilisations (AKU-ISMC) in London. I came to London from the US in 2007 and joined the AKU-ISMC with our first cohort of students. Professionally, I matured within and with the AKU-ISMC.
Q2. What motivated you to start the KITAB – Knowledge, Information Technology, and the Arabic Book project?
My first book dealt with memory and revisions to it during the period of Iran’s conversion to Islam. I thought of the challenge as stratigraphy – seeing the layers of history writing. But I was not satisfied with the mechanism of choosing those narratives and also with how representative they might be of a larger tradition. Around 2011, I was using anti-plagiarism software, and it occurred to me that it could be repurposed to study the ways that Arabic authors reused and revised material from earlier authors. The potential excited me. Though I did not fully comprehend then how powerful such software could be – I had a sense of it. And as I worked with partners over time, I realized that being very ambitious in terms of seeing as much of the written tradition as possible was important for choosing and contextualising any body of material we study for any problem at all, including the one I started with.
Q3. Could you guide us through the way you assembled a team and launched the project?
It took a while. I tried to get TurnItIn involved, but got no real response. Then I sat on the idea for about a year. I played with the software using Shamela texts to see what I could get it to do myself. Which was something but not much. Then in 2012, I met Gregory Crane, editor of the Perseus Digital Library and a pioneer in the field of Digital Humanities. I told him about my idea and he was very positive – he got it. He introduced me to a lot of people who came to be my main collaborators – including Maxim Romanov and David Smith. The registrar of the ISMC at that time, Sohail Merchant, was also working on the side as a developer, and he and I spoke a ton about the project and what might be required to study text reuse. The president of the University, Firoz Rasul, in 2014 heard me present my idea and suggested we seek help from a network of volunteers who support work at the different entities overseen by the Aga Khan. Two developers – Ahmad Sakhi and Malik Merchant – joined Maxim, Sohail, and me. The five of us then worked solidly on the concept, then with David’s support began experiments on texts. Everyone was so generous with their time – we met most Sundays from 2014 until the beginning of 2018. The AKU gave us an outdated server and then in 2017 also pledged an investment in more start up costs, including a new server. There were tech colleagues within the AKU also, I should say, who helped us set things up.
It was only in 2018 that we secured funding from the ERC and were able to hire a team. The ERC funding supported this team, and the AKU also supported it as the university funded our computer scientist, Masoumeh Seydi. At the same time, we received additional funding through a project with the Qatar National Library focused on the Sira of Ibn Ishaq and its many witnesses, and shortly afterwards, with our partners, from the Andrew W. Mellon Foundation, for work on Optical Character Recognition. That period was intense – it was like buses. We waited for some years, then they came at the same time.
Q4. Could you describe the content and structure of the project?
The project ultimately is interested in understanding the history of Arabic texts – how they came into existence, how they circulated, how they were cited, and what we can discern from such patterns about the circulation of ideas and memories in the Islamicate world in the period. A large part of the project has involved rethinking the very idea of a book – because once you get into the weeds, it is very clear from the metadata on our collections, the reuse and citation data, and also manuscript catalogues and case studies that the idea of a fixed book copied repeatedly over centuries vastly misrepresents the tradition. The picture is much more interesting. That is true in the early periods but also much later.
KITAB has three parts. First, with KITAB we want to build a corpus of Arabic texts, chiefly focused on the period ~700-1500. This involves acquiring machine readable files; we also work on OCR through the Mellon grant. The team annotates and verifies text files. Lorenz Nigst oversees the corpus as a whole, and took over from Maxim, who is now in Hamburg running his own project. Peter Verkinderen is doing a lot of work to analyse the corpus in relation to the written tradition as a whole. Secondly, we develop methods to study book history – include text reuse detection, but also work on named entities and citations. Masoumeh has done several experiments with the team, and she and I are beginning to publish. Also, David’s student Ryan Muther is doing a PhD focused on reuse and citation and works closely with team members; he was originally funded by the Qatar National Library project, and now continues to work with us through the ERC and AKU-ISMC. His results are now being published too. Finally, and crucially, we want to publish relevant studies on the written tradition and memory. We have three books in progress at the moment, plus articles and presentations by team members. Aslisho Qurboniev is presenting in early December at Exeter on his work. Our blogs are worth a read too.
In my view, we are part of a field that does not yet have a name – but can be seen in wider culture. I call it the History of Reuse and Recycling.
Importantly, the story does not end with KITAB. We contribute our texts and share expertise through another partnership – the OpenITI (Open Islamicate Texts Initiative). I am the co-PI with Matt Miller at the University of Maryland and Maxim. Through the OpenITI we are seeking to build a resource for the field that can outlast any individual project. So when I refer to the KITAB corpus, I mean the KITAB corpus within the OpenITI on GitHub. Matt and UMD have led the Mellon grant. David is involved too, with Northeastern. And also Intisar Rabb, from Harvard and the SHARIAsource project. We’re putting our heads together on problems relating to the corpus, but also thinking about a lot of the bigger research questions together. David’s questions about layout analysis make us think about book structures in new ways. Intisar and I are just starting a project that involves thinking about cultural history broadly.
So, in terms of the project’s structure, I want to paint a picture of a group of people talking to each other not just about the technical problems but those are important. Everyone brings something and we welcome partners and newcomers. We have internal debates; most recently I had a long and fruitful set of discussions with Kevin Jaques, from the QNL project but based at Indiana University, about citation terms in Tabari’s works. Gowaart Van Den Bossche’s interest in periods that fall after my own regularly results in chats about what made the ninth century different from the fourteenth. Abdul Rahman Azzam and Mathew Barber, both originally also with the QNL project, but now with the ERC, have spent much time thinking about reading and public outreach. Team members also have many conversation partners beyond the project.
Q5. How easy is it for a researcher to use the website (https://kitab-project.org/) of the KITAB project?
You will have to tell me! Mathew has recently led a revamping of the website to pave the way for us sharing more of our data in the years ahead. The website has been redesigned with four key goals: 1. To explain our corpus and data sets; 2. To explain our various digital methods; 3. To provide clear documentation on our data; 4. To act as a portal for applications that allow for easy interaction with our data. The website currently serves 1-3 and we will be adding user-friendly applications in the coming years. We hope that the website will become a place where users of our data (both digital specialists and more traditional humanists) can interact with that data, learn how it was produced, and learn how to use it in their own projects.
Q6. Let’s come to the corpus of Arabic texts included in the KITAB corpus. I see that you have included texts of the premodern Arabic tradition coming from a variety of sources plus their corresponding metadata files. What are the criteria for the selection of these texts?
There are several. On the one hand, we have taken files that are freely available on the internet. An enormous amount of work has been done over the past 20 or so years by colleagues working internationally to build up what we have now. We acknowledge the work of our predecessors in the URIs. We are now beginning to add works to these through an OCR pipeline. These we are adding based on our research projects. Plus we are building user groups and we seek their input on what to add. We welcome recommendations. Many large projects today create corpora and have a requirement to make their materials freely available afterward. Putting them into the OpenITI will do that.
The biggest challenge for us now is taking serious account of the manuscript tradition. Current OCR works well (and is continually improving) for printed texts, but it performs poorly for handwritten text. Team members are currently adding short manuscript transcriptions (typing them out by hand) to OpenITI that relate to their own research projects, but the ability to automatically transcribe manuscripts would allow us to further broaden the scope of the corpus. As a project, we hope to play a major role in work on handwritten text recognition.
Q7. In what way can these texts be used by Arabists/Orientalists or even Classicists?
We envisage a wide variety of use cases for OpenITI texts. The annotation schema used in our texts, OpenITI mARkdown, developed by Maxim Romanov, is a light-weight mark-up language designed with Arabic text in mind. Those working with our texts computationally can easily strip out that annotation if it interferes with their analyses, or they can use it to guide their analysis. A growing subset of our texts, for example, have structural annotation indicating section boundaries and headings. A researcher could split the text according to those boundaries and then perform analysis on or compare subsections. As OpenITI mARkdown is lightweight and readable, our texts can simply be read like any printed text. Texts with structural annotation can (if using the appropriate program) be collapsed into a list of headings (a kind of table of contents), allowing for easier reading. It is our hope that less-technically experienced researchers will in the future feel confident using basic forms of mARkdown to effectively take notes within the text. For example, should a researcher be interested in analysing passages about a certain event across multiple chronicles, they could add tags for text relating to that event within the chronicles themselves, allowing them to easily return to those passages later on.
Also, I would note the most basic use of the texts: search. Just being able to search the texts is hugely beneficial for language learners, translators, and anyone researching a topic.
For those with no Arabic at all, searching our metadata might be useful. We are continuously improving it and will be working on classifying it too.