how to build an archive (2): different data

Without data, no archive. We will look at two different categorizations in this post. The nature of files, and the type of files.

The nature of files

Your archive should consist of

  • primary sources
  • secondary literature
  • tertiary reference material
  • your own files.

In my experience, it works well to create a folder structure that separates files in these categories. This means I do not combine files in one folder according to theme (e.g. ‘epistemology’, ‘Avicenna’, ‘Quran’). However, these themes can exist on the secondary level of the folder structure. Within primary sources, I have everything organized according to author. Within secondary literature I have everything organized according to scholar. My reference material has a more eclectic organization, which I will go in at a later point. My own files are organized according to project. Next to this, you always keep a ‘various/unsorted’ folder, but try to refrain from using it. At a later point I will go into folder structure more deeply.


The type of files in our field of studies can be:

  • PDFs
  • Word-files (.doc)
  • e-books (.epub, .djvu, .mobi, etc.)
  • images
  • Excel-sheets (.xls)
  • notes

PDF’s are by far the most common and most important ones. PDFs are the digital equivalent of paper publications. In general, two types of PDFs exist:

  1. Truly digital material
  2. Scanned material

Truly digital material are PDFs which have been generated from an application in which the publication was formatted.They have as an advantage that they are relatively small in file size and fully searchable. Many journals offer such files for their articles. Searchable is, by the way, not always correct, especially with Arabic text. A lot of PDF compilers make garbage out of the Arabic. It may look fine on the screen, but it is not searchable.

Whereas other fields of studies virtually completely rely on truly digital PDFs (e.g. the medical sciences and their pubmed maffia), Islamic Studies is still (and will always be) heavily paper oriented. Primary sources are virtually always in physical paper form, and they should be, because it allows us to refer to a specific edition, volume, and page number. There have been attempts at typing out editions, and this is great and we will look at them in other posts, but the rate of error is significant and it is always better to take a look at the real paper publication.

That is, in a nutshell, why I consider scanned material to be of tremendous value for our studies. Scanned material means that a PDF file consists of one image per page-of-the-pdf-file, with this image either being a photo of one page-of-the-paper-publication, or two pages-of-the-paper-publication. The file size goes up, and the search ability usually drops to zero, but I gladly trade those two in for absolute certainty that what I am looking at is exactly as it is in the paper publication.

PDF is by far my favorite format since it will most probably be most resilient over time. If you are at the beginning of your academic career, you will want to keep using your files for another 40 or so years. Will Microsoft Word still be around to open your documents? Maybe. Maybe not. Maybe it will but it will change the formatting unrecognizably. PDFs will most likely be able to be opened and display their data exactly the same way as it was all those years ago. Invest in the future.

Word files are primarily your own documents, be it finished projects or works in progress. Using Word or any other word processor to make notes is sub-optimal, but we will get to it in one of the posts on workflow. Even though the aim of your archive is to create a digital library, that is, gather primary sources, secondary literature, and tertiary reference material, it is wise to consider your own files as part of your archive as well. If not already to make sure you have your own stuff backed up. Sometimes you have Word files of primary sources. I tend to make PDFs out of them and store them as PDFs.  Usually, PDF readers are lighter applications than word processors (drains less resources from your computer).

E-books are rare in our field, which is probably a good thing. You may run into a .djvu file now and then. Do not panic. Just download a dvju-viewer and you can still open the file. In cases where I cannot find a PDF, I will store an e-book. In cases where there is also a PDF available, I do not keep the djvu file.

Images could be illustrations, or photos of objects of interest, like manuscripts or art or architecture. Be prepared to have a lot of disk space available. Do not save images with a resolution so poor you cannot read the manuscript.

Excel-sheets and notes are perhaps the least common. Make sure to save the data in a format that is future proof.


One comment

  1. H

    Wonderful! Thank you!

