The Digital Syriac Corpus: An Interview with Prof. James E. Walters

In addition to the contributions of the Beth Mardutho Institute to the world of Syriac Digital Humanities (which I have explored; here, here and here), there are other projects which have an important place in the Syriac Digitial Humanities, and which will be the focus of forthcoming contributions to the Syriac section of the Digital Orientalist.

One of the earliest initiatives to construct a Syriac corpus database from Syriac texts found in manuscripts and books is the Digital Syriac Corpus (DSC) which was started in 2004 as a collaboration between Prof. David G. K. Taylor (Oxford) and Dr. Kristian S. Heal (BYU). The corpus, which is now run by a team of Syriac studies specialists, has successfully developed new methods and built up a relatively large database. In order to learn more about this project, I have interviewed Prof. James E. Walters, the General Editor of the Digital Syriac Corpus. I would like to extend my gratitude to Prof. Walters for accepting my invitation to interview him.

The website of the Digital Syriac Corpus.

Q1. Could you introduce the readers of the Digital Orientalist (DO) tothe beginning and the history of the Digital Syriac Corpus (DSC)? What were the original ideas behind this project and what were the targets which the DSC wished to reach?

The project started as a joint project between Kristian Heal (BYU) and David Taylor (University of Oxford). The original idea, as I understand it, was to compile a collection of Syriac works as transcribed Word documents. Over the years, Heal and Taylor employed a number of people to transcribe works, primarily from print volumes but also quite a few texts directly from manuscripts. Over the years, they amassed a significant collection of word documents (probably 600-700 individual documents, some of which are only a few pages long, but others of which are hundreds of pages long), covering well over a millennium and representing all genres of Syriac literature.

To accompany this collection of texts, the Oxford-BYU Syriac Corpus (as it was called) was going to be accompanied by a piece of software that would allow users to perform complex searches across the entire collection of Word documents. To this end, software engineers at BYU developed a piece of software called Word Cruncher, and this software was customized to be used with documents in the Syriac corpus.

Primarily, this corpus was designed with Syriac scholars and researchers in mind. The goal was to make a vast corpus of literature readily available and able to be searched.

A visit to the Vatican library in 2004 to report on the transcribing of texts and digitizing of Vatican manuscripts at the beginning of the project. Left to Right: Fr. David Royel (now Mar Awa Royel), Fr. William Toma, Prof. Noel Reynolds (BYU), President Cecil Samuelson (BYU), Ambrogio Piazzoni (Vice-Prefect, Vatican Library), and Dr. Kristian Heal. Courtesy of Dr. Kristian Heal.

Q2. In a recent article in the Ancient Jew Review (AJR), you taught readers how to use the DSC. You started the post by asking “Have you found your Syriac course suddenly converted to an online course? If so, the Digital Syriac Corpus is here to help! This suggested to me that there may have been new developments with the corpus. Could you explain to us, how the DSC looked in the past and what has been newly added to the corpus?

Previously the Syriac Corpus simply was not widely available, and it was not accessible to many users. People knew about the project largely through word of mouth, and the creators were happy to share their work. However, the full project was never fully available for download or online usage. Moreover, one issue that the developers ran into was that the Word Cruncher software could only be used on Windows operating systems, and it was not available on other operating systems. This significantly limited the number of people who could use the software, and thus the project as a whole in its previous form.

So, the primary advantage of the new version of the project is that it is not operating system dependent—that is, it is simply a web-based project that can be searched from any device (even a smart phone!). Furthermore, the project is now widely accessible to anyone around the world. This is significant not only for scholars and those who read Syriac for research purposes, but also as an enduring archive of the Syriac language for global heritage communities of the Syriac traditions.

An image showing the corpus in Word Cruncher. Courtesy of Dr. Kristian Heal.

Q3. Among the many fantastic tools that the DSC is offering is a search function that can search for Syriac words across mutliple texts from different historical periods. However, there are still issues when one wants to use the DSC to search for phrases. Are you planning to develop the corpus to perform advanced searches according to different linguistic purposes?

Yes, this has been a significant point of discussion for the development team. Technically the project is set up to allow searching at the phrase level, but there are some issues in the database software we use that stand in the way of fully implementing this search feature. One issue that we’ve encountered, as many digital projects have to deal with, is that this is a completely unfunded project that is run primarily by volunteer labor. Once the initial batch of texts has been converted and published, we may try to apply for some grants to develop the search feature further.

Likewise, we also have a long-term goal of adding lexical tagging to the whole corpus, which would allow even more complex searches (such as searching by root word rather than just specific morphemes). Again, we would likely have to add this feature with grant funding, but it is certainly something we are interested in pursuing.

Current search functions on the DSC.

Q4. How is the DSC cooperating with other lexicographical Syriac Digital projects, such as SEDRA, Dukhrana, and the Comprehensive Aramaic Lexicon?

The Syriac Corpus is already integrated with the SEDRA database. If you hover over any Syriac word in the Corpus, a small pop-up box will appear with lexical data from SEDRA. This integration is not perfect, though, as many words will not return any results based on the way the projects are set up and linked. If, as discussed above, we eventually add lexical tagging to the Corpus, this would allow more accurate and comprehensive linking with the SEDRA resource.

At this point we do not have any formal cooperation with Dukhrana or the Comprehensive Aramaic Lexicon, but we are certainly open to collaboration. One of the values of digital humanities projects is the promise of connecting related projects, and I’m hopeful in the future that we will be able to make further connections.

Q5. I read on the DSC website that you are sharing data with the, particularly the Syriac Biographical Dictionary. Can you tell us more about this cooperation? Are you open to expand this collaboration with other DH projects by sharing your data with them?

Yes, all of the data of the Syriac Corpus is fully available for sharing. All of our XML files and the code for the website are on GitHub, a well-known collaborative resource for sharing digital projects. We do plan to expand the cooperation with in the future, including perhaps linking person names and place names with the database. And we are open to sharing our data with any project that would like to collaborate. That’s why we have a Creative Commons (CC-BY 4.0) copyright—we want to encourage collaboration, and we welcome anyone interested in working on such a project to join us.

The homepage for the Syriac Biographical Dictionary.

Q6. What is the vision for future, if you had the possibility to expand in terms of gaining more contributors and volunteers? Can we put it in a practical framework? 

I’ve mentioned a few of our plans already, so I’ll say a few more things here: as a first step we will finish converting and publishing all of the documents that were originally handed over to me (James) from Kristian Heal and David Taylor. If I could work on this as a full-time job, it wouldn’t take much time at all. But as it is, I can only work on the project in my spare time. As such, we would love to have volunteers, and in fact, there is some information on our website about the kinds of things people could do to get involved.

Ultimately, we would also love to have users proofread the texts in the corpus and let us know about any errors or typos, and this is one of the easiest ways for volunteers to get involved. The vast majority of our texts were transcribed without any editorial proofreading, so there are certainly typos, and we would love to get them corrected.

We also welcome anyone to contribute new texts to the Corpus (again, more information is available on the website). The goal of the Corpus is to include as much Syriac literature—from any time period—as possible, so we would gladly accept anyone’s Syriac transcriptions to be included, and all contributors will be properly credited for their work.

As mentioned previously, there are a few developments we’d love to add if we are able to add grant funding in the future. These additions would primarily be focused on allowing more advanced searching features with linguistic data, but there are other enhancements we’d like to add as well. One dream that I have is to create a Scripture index across the whole Corpus where citations of and allusions to specific passages from the Bible would be linked, and users could search for all those citations by consulting the index.

I have also dreamed of adding a “Syriac school” component to the Corpus, which would provide resources for self-teaching the Syriac language for anyone who desires to learn it, but lacks access to Syriac courses.

Q7. Would like to share any concluding remarks with us?

By way of conclusion, I’ll simply express my hope for the Digital Syriac Corpus: I hope that it becomes a useful tool for anyone who is interested in reading, learning, or researching Syriac literature. There is a great deal of work to be done in order to make the project more usable and more comprehensive, but we believe that this could become a model project for corpus-based research. I also hope that the Corpus allows and encourages new learners of Syriac and creates interest in this wonderfully rich language and heritage.

Select Bibliography of Publications and Presentations related the Oxford-BYU Corpus (2007-2014)

In the process of preparing this contribution to the Digital Orientalist, I contacted Dr. Kristian Heal, who provided me with the following select bibliography of publications and presentations (between 2007 and 2014) related to the Oxford-BYU Corpus, as well as some of the images used in this contribution.


Eric Ringger, Peter McClanahan, Robbie Haertel, George Busby, Marc Carmen, James Carroll, Kevin Seppi, and Deryle Lonsdale. June 2007. “Active Learning for Part-of-Speech Tagging: Accelerating Corpus Annotation.” In Proceedings of the ACL Linguistic Annotation Workshop, Association for Computational Linguistics. Prague, Czech Republic. Pp. 101-108.

Deryle Lonsdale, “A Computational Perspective on Syriac Corpus Development and Annotation” (IOSOT, Ljubljana, Slovenia).

Kristian Heal & David Taylor, “Towards an Electronic Corpus of Syriac Texts” (IOSOT, Ljubljana, Slovenia).


Robbie Haertel, Kevin Seppi, Eric Ringger, James Carroll. December 2008. “Return on Investment for Active Learning.” In Proceedings of the NIPS 2008 Workshop on Cost-Sensitive Machine Learning. Whistler, British Columbia, Canada.

Robbie Haertel, Eric Ringger, Kevin Seppi, James Carroll, Peter McClanahan. June 2008. “Assessing the Costs of Sampling Methods in Active Learning for Annotation.” In Proceedings of the Conference of the Association of Computational Linguistics (ACL-NAACL: HLT 2008). Columbus, Ohio.

Eric Ringger, Marc Carmen, Robbie Haertel, Noel Ellison, Kevin Seppi, Deryle Lonsdale, Peter McClanahan, James Carroll. May 2008. “Assessing the Costs of Machine-Assisted Corpus Annotation through a User Study.” In Proceedings of the Language Resources and Evaluation Conference (LREC). 2008.

James L. Carroll, Robbie Haertel, Peter McClanahan, Eric Ringger, and Kevin Seppi. 2008. “Modeling the Annotation Process for Ancient Corpus Creation.” In Proceedings of the 2007 Conference on Electronic Corpora of Ancient Languages (ECAL). Prague, Czech Republic.

Kristian Heal & Eric Ringger, “The BYU-Oxford Corpus of Syriac Literature: An Interim Report” (Symposium Syriacum XI, Granada, Spain).


Kristian S. Heal, “The BYU-Oxford Corpus of Syriac Literature,” at the Launching Conference of the RNP Comparative Oriental Manuscript Studies. Hamburg, 2009.


Peter McClanahan; George Busby; Robbie Haertel; Kristian Heal; Deryle Lonsdale; Kevin Seppi; Eric Ringger, “A Probabilistic Morphological Analyzer for Syriac.” Pages 810-20 in Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, edited by Hang Li and Lluís Màrquez. Cambridge, MA: Association for Computational Linguistics, 2010.

Kristian S. Heal, “Corpora, eLibraries and databases: Locating Syriac Studies in the 21st Century,” at Beth Mardutho/Syriac Institute Symposium on Syriac Libraries. May 2010.


Paul Felt, Eric Ringger, Kevin Seppi, Kristian Heal, Robbie Haertel & Deryle Lonsdale, “First Results in a Study Evaluating Pre-labeling and Correction Propagation for Machine-Assisted Syriac Morphological Analysis.” Pages 878-885 in Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC ‘12). Istanbul, Turkey, 2012.

Kristian S. Heal, “Corpora, eLibraries and Databases: Locating Syriac Studies in the 21st Century.” Hugoye: Journal of Syriac Studies 15.1 (2012): 65-78.

Kristian S. Heal, “Report on Syriac Projects at BYU,” at XI quadrennial Symposium Syriacum. Malta, 2012.


Kristian S. Heal, “Accessing Late Antiquity: Syriac Digital Humanities Projects at BYU,” Invited Lecture, Committee for the Study of Late Antiquity. Princeton University, Feb. 20th, 2013.


Paul Felt, Eric Ringger, Kevin Seppi, Kristian Heal, Robbi Haertel & Deryl Lonsdale, “Evaluating Machine-Assisted Annotation in Under-Resourced Settings.” Language Resources and Evaluation 48.4 (2014): 561-599.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s