Images of the BUDA: Digital Archives and the Future of Research Using Linked Open Data

The Buddhist Digital Resource Center (BDRC), a nonprofit organization dedicated to preserving, organizing, and disseminating Buddhist literature based in Boston, MA, underwent a drastic expansion of its mission beginning in 2015 when it was then known as the Tibetan Buddhist Resource Center (TBRC). The organization went from working with Tibetan texts alone to working with the entirety of Buddhist literary heritage. Realizing the goal of a multi-traditional resource required the development of a completely new platform and data architecture. This has involved unprecedented work in Buddhist Studies in the realms of linking disparate datasets, text analytics for several Buddhist languages, the processing of OCR e-texts, and so forth. BUDA, a platform for accessing roughly twenty million digitized pages drawn primarily from Tibetan Buddhist textual corpora, but with search capabilities in other Buddhist languages linked to GRETIL, DILA, CBETA, and SAT, is the product of extensive collaboration and was designed to foster innovative partnerships for a new generation of researchers. Through support from the Robert H. N. Ho Family Foundation, BUDA was born and now offers users perhaps the most robust and sophisticated interaction with image-based open data in Buddhist Studies digital humanities spaces.

One of the most compelling aspects of BUDA is its focus on image-based data sources, which offers users a unique research experience compared to other outlets. Whereas text-based databases present data in Unicode forms of the original text, BUDA preserves the original text through direct, high-resolution photography. Users are therefore not only able to explore the text as it appeared in its original form, but also several other paratextual aspects that may otherwise be indiscernible in text-only databases: seals, stamps, bindings, marginalia, and other indications of use, ownership, and circulation.

BUDA’s sleek and lean interface provides users with a clear presentation of search results, which users can parse using several categorical filters. Searches may begin using Unicode and romanization in any one of four primary languages: Tibetan, Indic (Sanskrit or Pāli), Chinese, and English. Search criteria such as versions, work (bibliography), persons, places, topics, and lineages offer more granular results. For example, a simple search for the Saddharmapuṇḍarīka (Lotus Sūtra) using the default “versions” filter generates a list of the item in various scripts in Indic, Nepalese, and Newar accompanied by image thumbnails. Once a single item from this list is selected, users may then click the “work” tab to reveal a broader view of variant titles and a list of the item’s original languages. Scrolling down further reveals “parallels,” or translingual analogs to the same text linked to other databases. One simple search of this single text title therefore offers direct access to high-resolution images of scriptural versions, though the results also link to parallels of this title in Chinese, and further still to translations in English, German, French and Italian.

“Versions” search results using Saddharmapuṇḍarīka.

This Tibetan manuscript is from the Tibetan collection of the National Library of Mongolia, which BDRC is scanning in collaboration with NLM and Asian Classics Input Project with support from Khyentse Foundation.

Thus, while BUDA does offer access to a vast network of text-based sources, image-based data lies at the heart of this project. In order to better understand the implications of this approach to data hosting, I reached out to Jann Ronis, executive director of the BDRC, and Élie Roux, the architect behind BUDA’s one-of-a-kind web platform:

Image based data is at the heart of BDRC, and therefore BUDA. The study of textual artifacts, either in person or through facsimiles, will always be crucial to the study of Buddhist thought, literature, and society, and BDRC was founded by scholars to support just such humanistic inquiry. By providing access to images of thousands of actual manuscripts and xylographs (as opposed to e-texts or white-washed tracings of old texts) BUDA allows for the full range of codicological, paleographical, and—for lack of a better word—anthropological research. Another reason that BDRC will always be a primarily image-based resource is that preservation is in our DNA. We aim to digitally preserve Buddhist texts in situ through collaborations with Buddhist communities. We don’t cherry-pick the texts we scan but digitize entire archives as they have been carefully and courageously maintained into the present day. Thus, we capture vernacular and highly local texts alongside canonical works, and do not reject “duplicate” copies of Buddhist classics. BDRC’s work supports text-based projects but is different from them for the above reasons.

The possible uses of BUDA’s data in image form widen further when we consider its implementation of the Buddhist Digital Ontology. Ontologies structure and organize Linked Open Data (LOD) in ways that categorize by, for example, personal relationships, topical classifications, or textual matches. Since vocabulary drawn from the Library of Congress only partially meets the needs of the data hosted in the archives, the team behind BUDA has developed ontologies focused on classes and properties that comprise the core concepts of cultural heritage vocabulary in Asian and Southeast Asian Buddhist cultures, like complex typologies of Buddhist rituals and social relations such as multi-lifetime associations. This way, users can assume a broad-scale view of the Linked Open Data from the entire corpus in the archive and draw much broader conclusions regarding the content, circulation, and use of the smaller bodies of texts hosted in the archive.

Advanced users can also record and host identifiers related to a particular research topic (e.g. all translators of a given canonical text) for use in personal research. Ideally, users will enhance and contribute to these identifiers with the intent of returning the data back into the archive, but there is also the expectation that this data will be directly put to use in both research and pedagogical contexts.

IIIF manifests can also be downloaded at different levels of granularity (volume, collection, or a text). This means anyone can embed our images on their website and create a curated collection of BDRC texts. This can be used by teachers (both in academia and in traditional curricula!) for pedagogical purposes, or for researchers to create their own collection and share it.

The openness of this data and its representation in the BUDA archive is a major aspect that distinguishes it from other archives. All of BUDA’s archived images are openly accessible, and while some databases may redirect to others for a composite picture of a given body of data, BUDA provides its native data in forms (TTL and JSON) that drastically widen the range of possible applications in the digital humanities sphere. All of BUDA’s data is machine-readable, uses standard formats, and uses a unique LOD ontology described above. Users who are interested in advanced queries can even access BUDA’s complete dataset.

This means that users can see very clearly that BDRC is not just a web interface for viewing images, but also an open access bibliographical and prosopographical database that scholars can use in their research. We hope that by allowing users to look at how our data is structured (as opposed to how the user interface presents them for a general audience), it can be an incentive for academic projects to design their data model in a way that is compatible with ours. This has the advantage of allowing them to share their data with us and the rest of the world more easily. This can be a significant strategic advantage for an academic project in terms of long-term data management. If a project uses a compatible model and contributes their data to BDRC, their data can live on BDRC long after the project has ended.

Open access to data like this ensures not only that researchers can engage this archive in several registers, but also that other databases, archives, and large-scale digital projects can benefit from this access. Future plans to link BUDA’s data include collaborations with Wikidata, the Virtual International Authority File (VIAF), and World Historical Gazetteer. The prospect of an image-based archive of this volume working with these other platforms would mean major shifts in the study of Buddhist textual corpora generally. As it stands now, BUDA already offers an incredibly rich site for understanding much more than the text buried within each image. Considering the fact that the BDRC has data on roughly 20,000 persons and 10,000 places, the prospect of linking this data to several other rich databases would mean a full-scale view of the life of any given manuscript as it exists in relation to the social and geographical contexts of Asian cultures worldwide.

While BUDA is an especially valuable source now that archival visitation has been severely limited by the global pandemic, it also signals new and important horizons for research in the digital humanities sphere generally. The promise of full access to rich imagery at this volume and scale, along with the user potential of its Linked Open Data, makes BUDA one of the most valuable resources for digital research in the field of Buddhist Studies today.

