An interview with Professor Sarah Savant about KITAB project, part 2

This is the second part of the interview with Sarah Savant, Professor at the Aga Khan University–Institute for the Study of Muslim Civilisations (AKU-ISMC) and principal investigator of the KITAB project. You can read part one here.

Q8. Theodora Zampaki: Bearing in mind the Graeco-Arabic translation movement, could you explain how a scholar working on Graeco-Arabic Studies can use the KITAB corpus, e.g. to find and search for relevant material in the files of the KITAB corpus?

Professor Savant: The main starting point for any user of the OpenITI Corpus is our metadata search application (https://kitab-corpus-metadata.azurewebsites.net/). Users can search for books in the corpus using this application, by any field or combination of fields within the metadata. In the future, one might envisage filtering the texts in the metadata search application according to whether they are translations from Greek or Syriac, but this is not possible at present.

If a researcher wished to look for material within individual book files for specific types of material, then it would be necessary to download the corpus to analyse on one’s own computer. The whole corpus can be downloaded as a zip file on Zenodo (we publish news of our releases on our website, here: https://kitab-project.org/corpus/releases). One may then look for material using a variety of techniques, depending on one’s technical expertise. The easiest method is to assemble a sub-corpus and search across that corpus using a specialist text editor. We hope to publish a guide on how to do this on our website.

Q9. What are the digital tools and methods applied at the corpus?

So, there are three core methods applied to our corpus: text reuse detection, citation detection, and named-entities recognition (within citations). We are experimenting with a range of other methods (for example, stylometry) and you can find detailed guides on these methods, their applications and drawbacks on our website (https://kitab-project.org/methods).

Text reuse detection (https://kitab-project.org/methods/text-reuse) is the primary method applied to the whole corpus. We use an algorithm called passim authored by David and improved by Ryan to do this. Back in 2016, when the volunteer team first successfully ran passim on thousands of texts, I was dancing in my kitchen when I saw all of the relationships documented at one go. I have since learned how hard it is to dig into this data and how misleading it can be if not looked at closely and critically. But it still is exciting.

The process begins by splitting our texts into 300-word chunks (the milestones that you will find annotated into all OpenITI texts). This will allow us to see more easily how texts are rearranged when they are reused and to see which parts of texts are reused more heavily. Each 300-word chunk in a text is then compared to every other 300-word chunk in every other text. Through this process passim creates files documenting instances where text is shared between two books in the corpus and the milestones in which this reuse occurs.

We are interested in all forms of citation, but our citation detection method currently focuses on identifying, splitting, and studying isnads (citation chains) and their networks. A method developed by Ryan uses machine learning to identify the locations of isnads within the corpus. The algorithm has been presented with a selection of texts containing training data (texts where the location of isnads – both where they begin and end – has been annotated by specialists). The output is not perfect, but for certain types of text the method performs very well. We now have statistics on every text in the corpus and what percentage of its words belong to isnads.

Our work on named entities builds off the existing work to identify isnads. We aim to identify the names of individual transmitters within citation chains and relate them to one another (to identify, for example, transmitters that are involved in multiple chains in the same or multiple works). To do this we need to automatically identify the names and decide between different transmitters who share the same name and identify the multiple names that correspond to one transmitter. Ryan is testing a number of approaches for resolving this problem, and we are testing his analyses by creating ground truth for isnads in two important texts that have quite different approaches to citation: al-Tabari’s Ta’rikh and Ibn ‘Asakir’s Ta’rikh Madinat Dimashq.

Ibn Asakir’s Ibn Sa’d citations (with the tube map)

Q10. Could you elaborate the mARkdown and the annotation process?

We elaborate in detail the annotation process in the documentation on our website, which can be used as a guide for anyone who would like to understand the process in detail. First the text is acquired and added to the corpus. If the text has already been digitised then it might be possible to convert existing structural mark-up (for example in HTML) into OpenITI mARkdown. Whether or not the text has existing digital mark-up, mARkdown is implemented on OpenITI texts as part of a two stage process: annotation and vetting.

The role of the first annotator is to add or check the structural annotation of a text against the printed edition upon which the text is based. They will use a combination of the formatting of the printed text and their own understanding of a text to create a nested hierarchy of headings and sub-headings. The sophistication of this nested annotation can vary enormously. For example, al-Qalqashandi’s Subh al-A’sha has a very formalised nested structure which can often have over 5 levels of heading. Take a look at the OpenITI to see for yourself. Other texts will just have one level of heading throughout. OpenITI mARkdown can be used to annotate a wide variety of features, but our primary focus is on annotating structural features. This makes the text more navigable and allows for more directed forms of digital analysis. The first annotator will also make assessments upon text quality. They will identify transcription errors or HTML tags left in the text, and make notes in a special metadata yml file. For efficiency we intend to identify the most prominent issues with texts in our corpus and deal with them collectively, rather than resolving them on an ad hoc basis.

The second annotator is then responsible for reviewing the decisions made by the first annotator. They will review the structural annotation alongside the printed edition and make any changes if necessary. If the team has agreed upon a fix to any major issues identified by the first annotator, then the second annotator may fix these issues.

As OpenITI texts are all stored on GitHub, all stages of the annotation process are archived on GitHub. If we (or any other user) wished to return to the work of the first annotator, or even the original digital text, they would be able to do so using the version history.

Q11. How does the set of core visualizations work?

We rely on three tools for building visualisations: JavaScript, PowerBI and Python. For visualising text reuse between pairs of books we use a JavaScript application designed at the very beginning of the project. The visualisation can be seen across our publications (especially our blog), and we plan to iterate on it to create an application that can be used by the general public in the near future.

PowerBI is largely used for representing statistical data, both as tables and graphs. For example, we have a PowerBI application for searching and analysing our text reuse statistics.

Alongside these two we are developing a number of other visualisations in Python Dash and Bokeh (packages that create interactive visualisations from a Python script), which allow us to easily build interactive visualisations when we need them (the team are mostly proficient in Python, rather than JavaScript).

Scroll visualisation with two chronicles: Ibn Taghribirdi’s Nujum on the top and Ibn al-Athir’s Kamil on the bottom (one can see how the material in Ibn al-Athir is compressed into a small part of the Nujum, as it covers a much longer period)

Q12. How can Arabists use the corpus data and metadata application? Can you give an example?

The metadata application allows users to search metadata about the books in the corpus using keywords. It can be used to identify all the books that are attributed to an author or be used to search by title or date. For example, if one wished to download a specific book in OpenITI for digital analysis, one would search, firstly, for the author’s name and the book title either in Arabic or Latin transliteration. This search will return a list of links to books in the OpenITI corpus (stored on GitHub), including different versions of the same book, from which the user can click on a work and download. At present, we have limited genre tagging for the OpenITI corpus (as it requires a lot of manual and specialist work).

Q13. What are the benefits for someone who will use the materials of the KITAB project?

Users will be able to download the corpus or search our datasets for relationships between books, either as pairs, or for a more global view (how many books are related to each other in each century, for example). Users can load alignments between books and read through them, seeing the aligned passages in context of each book overall. Also, we have visualisations that help a user to judge between different possible sources for a later work, and also see how a work and which of its parts possibly passed on into later works.

In general, we cannot guess what others will do with the corpus, though. Maxim has created an n-gram reader. Others will do important and useful things with the corpus. I am particularly interested to see what corpus linguists do.

Q14. Could you share a few thoughts about the next steps or the future plans of the project?

We have a lot of ideas. They all start with work on the corpus. We need to build the corpus further and improve metadata.

In terms of historical research, we should continue to focus on corners of the tradition that challenge our ideas of the book as a singular, fixed item. This will include foregrounding so-called lost texts, composite works (a special interest of mine), fragmentary pieces, and work on notebooks. We take inspiration from colleagues in the field. Ahmed El Shamsy presented recently at the NISIS Autumn School on the dasht, or repositories in manuscript libraries of unbound pages. Konrad Hirschler is also working on more ephemeral witnesses to book tradition, and indeed the Cairo Geniza scholars know such texts well. Beatrice Gruendler’s work makes us think about the material and complexity of a work and its authority too. The other side of this is trying to discern authoritative texts and how they circulated – what were they? How does this correspond to what we have now in the OpenITI? Peter is working on this problem directly and will be publishing. These all sound like worthy challenges for the project now and after the ERC funding ends in April 2023. These questions can be addressed in very innovative ways by technology.

Q15. Do you have enough resources to extend and further finance the project?

The AKU-ISMC has been incredibly supportive. I do not know what the future holds, but I know the university is keen to make sure our work continues. We will also apply for further funding, of course. Whatever the future holds, I am optimistic that all of the KITABis will remain part of discussions that will go on for a long time.

One thought on “An interview with Professor Sarah Savant about KITAB project, part 2

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s