Deconstructing the Kashf al-ẓunūn with Python, part 1

I am using the programming language called Python to make sense of Ḥajjī Khalīfa’s famous bibliographical work Kashf al-ẓunūn. I had been looking at Python as a suitable tool for Digital Humanities projects for a while, and with the article “Find for Me!”: Building a Context-Based Search Tool Using Python by José Haro Peralta and Peter Verkinderen I finally found the right starting point to dive into the deep end. This article appeared in the collected volume The Digital Humanities and Islamic & Middle East Studies, edited by Elias Muhanna.

Up until now I have been an interested user of the Kashf. I never took time to sit down and actually read it, but finding entries and using its biographical and bibliographical information has been fruitful. What I especially like is that the entries are according to book titles, and contain information about the commentaries which it spawned. I am myself an avid user of commentaries and commentary traditions, to research the postclassical period of Islamic intellectual history. The Kashf was therefore an easy choice for my Python project.

The first was to get the right file. Python works best with plain text files. All I had of the Kashf was a password-protected Word document. A website had no trouble converting this to .txt. Following the instructions of Haro Peralta and Verkinderen, I was soon instructing Python to read the file and manipulate it.

Following their article, I found out that the file I was using for the Kashf has 3,099,887 characters and 545,254 words. I used the following code: (everything also available on GitHub)

import re
kashf = open('KashfAlZunun.txt', mode='r', encoding='utf-8')
text = kashf.read()
print(len(text))
list_of_words = re.findall(r'\w+', text) 
print(len(list_of_words))
kashf.close()

Then I wanted to look into the frequency of some words. The following code did the trick:

import re
kashf = open('KashfAlZunun.txt', mode='r', encoding='utf-8')
text = kashf.read()
def word_counter (search_word):
findNumberOfWord = re.findall(search_word, text)
print(len(findNumberOfWord))
word_counter("شرح")
kashf.close()

If you type this into a file and save it as .py, then run it in the Terminal (for Mac, on Windows you would the Command Prompt, cmd.exe) it will return in the Terminal a number. In this case, since I was searching for the word sharḥ (“commentary”), the script returned 5121. In other words, In the Kashf al-ẓunūn, the word sharḥ is used more than five-thousand times.

Some issues immediately came up here. The way Python works is that you write your code in one file, a .py file, which you create using an editor, then you address that file in the Terminal and command it to be run by what is called the Interpreter. Mac OSX comes with Python, but an older version. Downloading and installing the new version requires you to navigate to the right folder in the Terminal, then entering:

python3 kashf1Basics.py

I have tried a few editors and noticed that very few support Arabic. PyCharm does support Arabic, but is rather advanced and complicated to start using immediately. I like Sublime Text and the fact that it does not support Arabic well is something I am willing to put up with.

The question that came to mind is, can I find out how many entries the Kashf has? As it turns out, after briefly examining the .txt file, I noticed that entries seem to be closed with a period.

I used the following code:

import re
import xml.etree.ElementTree as ET
kashf = open('KashfAlZunun.txt', mode='r', encoding='utf-8')
text = kashf.read()
elementsKashf = text.split(".")
print(len(elementsKashf))
kashf.close()

The result was 18,187. For a moment I assumed that the Kashf really consist of eighteen-thousand entries, but when I moved on to the next question I realized I was wrong. This next question was; what is the length of each entry?

I came up with the following code:

import re
import xml.etree.ElementTree as ET
kashf = open('KashfAlZunun.txt', mode='r', encoding='utf-8')
text = kashf.read()
elementsKashf = text.split(".")
f = open("KashfLengthEntries.txt","w")
for x in range(0, len(elementsKashf)):
    numberWordsElement = re.findall(r'\w+',elementsKashf[x])
    f.write(str(len(numberWordsElement))+"\n")
f.close()
kashf.close()

This creates a text file with every entry’s number of words on a new line. Opening this file in TextEdit or Excel, we can browse through it and get a general sense of the average length. What I found was that there were quite a many entries with zero words. Looking back at the source file, I found out that in some cases, the Kashf gives a few dots in place of a year, if a year was not known to Ḥajjī Khalīfa.

An updated version of my code first deleted these multiple periods:

import re
import xml.etree.ElementTree as ET
kashf = open('KashfAlZunun.txt', mode='r', encoding='utf-8')
text = kashf.read()
denoised_text = re.sub(r'\.{2,}', '', text)
elementsKashf = denoised_text.split(".")
f = open("KashfActualLengthEntries.txt","w")
for x in range(0, len(elementsKashf)):
    numberWordsElement = re.findall(r'\w+',elementsKashf[x])
    f.write(str(len(numberWordsElement))+"\n")
f.close()
kashf.close()

Opening the resulting text file in Excel allowed me to see that this produced still one entry without any words, the last entry, but this did not bother me. As a consequence, I now figured the actual number of entries of the Kashf is 14,943.

While Excel was open I created a column chart, the result of which is:

NumberOfWordsEntries

There are ways to make graphics with Python, but using Excel was the way of least resistance and therefore just fine in this case. Apparently, the x-axis label cannot be set to a higher interval than 255 so I settled on 200 (hence 1, 201, 401, …). To prepare this graph for actual publication I would probably render it with Python to have bigger control over its visual appearance.

The graph shows that there are a few entries with a very high number of words, that there are more than a few with more than 500 words, but that the majority remains below a hundred. Quite odd is the interval between approximately entries number 2600 and 3200. It falls entirely flat.

To investigate this further, I decided I needed to have the entry number, the entry’s word count, and the actual entry text together. I decided to create an XML file for this.The code for it is:

import re
import xml.etree.ElementTree as ET
kashf = open('KashfAlZunun.txt', mode='r', encoding='utf-8')
text = kashf.read()
denoised_text = re.sub(r'\.{2,}', '', text)
elementsKashf = denoised_text.split(".")
kashfalzunun = ET.Element("kashf")
for x in range(0, len(elementsKashf)-1):
    WordsElement = re.findall(r'\w+',elementsKashf[x])
    numberWordsElement = len(WordsElement)
    entryNew = ET.SubElement(kashfalzunun, "entry")
    entryID = ET.SubElement(entryNew, "ID")
    entryID.text = str(x+1)
    entryLength = ET.SubElement(entryNew, "Length")
    entryLength.text = str(numberWordsElement)
    entryText = ET.SubElement(entryNew, "Text")
    entryText.text = elementsKashf[x]
ET.ElementTree(kashfalzunun).write("kashfalzunun.xml")
kashf.close()

This produces an XML file of the Kashf, neatly compartmentalized according to entry. Whereas the previous scripts completed virtually instantaneously, for this script my computer actually had to crunch the numbers for a full second. The resulting XML file was 19MB big and I was unable to open it other than with Firefox. However, once Firefox had loaded it, I could freely browse through it.

I looked into the entries 2600-3200 and I noticed that there were indeed a great many number of entries with only 2 words. I looked at where this was beginning and cross-examined it with the print version of the Kashf. Soon enough, I understood what was going on; these 2-word entries were actually part of the entry ʿilm al-taʾrīkh. Take for example the title al-Zabad wa-al-ḍarab, a work on the history of Aleppo. In the Kashf it appears right after ʿilm al-taʾrīkh (length=2), it appears as its own entry (l=37), and it appears within the entry tawārikh ḥalab (l=304).

Clearly, then, we need to find a way to separate these fake entries from actual entries. Additionally, we need to weed out the entries that merely discuss a particular science. Since I suspect these entries all start with the word ʿilm, we can do the following:

import re
import xml.etree.ElementTree as ET
kashf = open('KashfAlZunun.txt', mode='r', encoding='utf-8')
text = kashf.read()
denoised_text = re.sub(r'\.{2,}', '', text)
elementsKashf = denoised_text.split(".")
kashfalzunun = ET.Element("kashf")
y = 1
for x in range(0, len(elementsKashf)-1):
    WordsElement = re.findall(r'\w+',elementsKashf[x])
    numberWordsElement = len(WordsElement)
    if WordsElement[0] =="علم":
        entryNew = ET.SubElement(kashfalzunun, "entry")
        entryNumber = ET.SubElement(entryNew, "Number")
        entryNumber.text = str(y)
        entryID = ET.SubElement(entryNew, "ID")
        entryID.text = str(x)
        entryLength = ET.SubElement(entryNew, "Length")
        entryLength.text = str(numberWordsElement)
        entryText = ET.SubElement(entryNew, "Text")
        entryText.text = elementsKashf[x]
        y = y+1
ET.ElementTree(kashfalzunun).write("kashfOnlyUlum.xml")
kashf.close()

This gives us an XML file with 259 entries. Some of these are clearly normal book entries, but most of them are indeed entries that introduce a particular science.

I will stop here for now. In a next post I shall continue deconstructing the Kashf using Python, in the hope of creating an XML file with tags for normal book entries and other kind of entries, with non-entries cleared up, and with the title extracted in a separate tag.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: