Text-Matching at the Canonical Crossroads: An Introduction to BuddhaNexus (Part II)

In my previous post about the text-matching database BuddhaNexus, I corresponded with project co-director Orna Almogi, who described the promise of the project and its intended impact on Buddhist textual studies in the 21st century.

For this second and final part, I will cover some of the technical aspects of the project, as well as some of the exciting linguistic expansions currently under development. I corresponded with Sebastian Nehrdich, who, along with assistance from two developers currently at work on another project at SuttaCentral, coordinates much of the technical development on the BuddhaNexus project. While Sebastian is trained in classical Indology and Sinology and is currently preparing a PhD degree in the Buddhist Studies program at Universität Hamburg, he is also skilled in computational linguistics and its deep learning applications in Buddhist philology. His primary task on the BuddhaNexus project is to implement linguistic tools across several source languages and to create and organize data that is accessible on a web platform.

From Sebastian’s perspective, one of the useful functions made possible by the platform’s technical prowess is that it allows for both snapshot views of large-scale volumes of text and displays of deeper, more granular results with a focus on linguistic variation:

Users generally appreciate the fact that with the help of BuddhaNexus, one can get a quick and solid impression about the “position” of a text within the larger Buddhist (canonical) literary corpus. The visualizations help to learn about the relationships texts might have with other texts, and the text-view mode shows which passages frequently occur in other files […]. Many users, including myself, appreciate the global search function and its ability to search through the various source collections without the need to use different applications at the same time. One big plus of this search function is the Sandhi splitting and stemming of the Sanskrit search queries, which makes it possible to locate search results within compounds and with different morphological forms, which is very important in the case of an inflectional language such as Sanskrit.

The linguistic aspects of digital research lie at the heart of the BuddhaNexus project and these search functions provide the mechanism through which research can better understand the evolution and formation of Buddhist textual corpora over time.

And yet, the platform’s simplified interface and ease of use make the technical prowess of the BuddhaNexus project difficult to gauge from the frontend. Its inclusion of Pāli, Sanskrit, Chinese, and Tibetan is one of the distinguishing features of this platform, and it has demanded particular attention to otherwise uncharted areas of its technical development. That is, whereas routes to computer-aided linguistic research in contemporary languages such as English, modern Chinese, and Japanese have been paved, similar research using Pāli, Sanskrit, Buddhist Chinese, and Tibetan is still in its infant stages. The sheer volume of each textual corpus only complicates this issue, and as Sebastian describes, this requires several technical platforms working in unison to process and organize the source material:

The workhorse of BuddhaNexus is fasttext, a flexible library for the generation of word representations. One of the big challenges of BuddhaNexus is to generate match data for large data collections. In order to tackle this problem, BuddhaNexus is built upon approximate nearest neighbor searches using Hierarchical Navigable Small World (HNSW) graphs. The combination of fasttext word representations and approximate nearest neighbor searches is largely language-agnostic as long as the linguistic preprocessing steps, such as word segmentation, stemming, and so on, can be applied to the source material.

Word representations are arranged according to calculable values attributed to each word and can easily express semantic and syntactic patterns. With fasttext and search approximation, it becomes possible to identify clusters of words with similar semantic or syntactical value within a given textual corpus.

Word representation of the 100 nearest neighbors to the numeral “dvi” with clear semantic groupings.

This means that, in its current state, linguistic classification platforms like fasttext allow BuddhaNexus to work within several languages, each of which include their own conventions for segmentation, semantic and syntactical conventions, sentence boundaries, and so forth. With Hierarchical NSW graphs, the data is built into hierarchized layers that allow for higher recall rates, especially across clusters of voluminous data. Additionally, the database also uses lit element, a fast web interface that reacts immediately to user input, and ArangoDB, an equally fast, multi-model database system that supports the types of search queries common to BuddhaNexus. In short, these tools allow BuddhaNexus to operate as smoothly and seamlessly as it does, despite the high volume of textual data packed into its searchable database.

The Buddhist tradition has cut across cultural and linguistic divides during its historical development, which means that Buddhist Studies scholars rarely conduct research confined to a single language. One of the most exciting features planned for the future technical development of BuddhaNexus is translingual matching. In its current state, users can run intralingual searches and produce intralingual visualizations (e.g. identifying matches within the Tibetan textual corpus only). Translingual matching would open up new vistas for textual research that spans across otherwise disparate Buddhist textual cultures insofar as users would be able to identify matches across two languages. This type of development is complex and requires a methodical approach to training the algorithm:

In the last two years, contextual representation models such as Bidirectional Encoder Representations from Transformers (BERT) have emerged. These development models have distinct advantages over static models such as fasttext. One of our goals for the near future is to apply these contextual representations for the generation of matches, especially translingual matches, which are not yet available on BuddhaNexus. Regarding translingual matches, the first step would be to provide matches between Sanskrit and Tibetan files, and the next step is to calculate matches between Pāli and Sanskrit. We start with Sanskrit and Tibetan since these two languages have comparatively good resources for the training of the translingual word embedding. Also, the Tibetan translations are rather standardized regarding vocabulary and translation strategies concerning grammatical and syntactic structures, which is not the case compared to the Chinese material. It is therefore to be expected that the translingual parallels between Sanskrit and Chinese will be more challenging to calculate. In the long run, we also want to provide matches between Sanskrit and Chinese, and Chinese and Tibetan.

Much of the challenge in working across languages is in pooling enough parallel training data to develop a suitable translingual model. In other words, if the model is developed with a lack of sufficient data, then translingual search results will be imprecise. The training itself is complicated by the range of sentence boundaries among Pāli, Sanskrit, Chinese, and Tibetan, which makes it difficult to identify a unifying model for use across more than one language. Additionally, the presence of dialects and vernacular languages within the textual corpora may mean that earlier systems for translingual matching need to be constantly retrained and made more precise as more training data becomes available.

This is all to say that while the BuddhaNexus platform already offers users a valuable research tool for Buddhist textual studies, the future development of the platform will continue to alter the way that researchers work across Buddhist textual languages. Those who work on texts that have emerged in multilingual regions or those focused on philological issues surrounding the formation, circulation, and evolution of Buddhist textual corpora will particularly benefit from the capabilities of the BuddhaNexus platform. Even scholars who take a distanced approach to Buddhist texts, however, and who might focus on the practical aspects of the Buddhist tradition are still offered a quick and clear snapshot of texts may have informed those practices.

Sebastian’s keen technical development has enabled what BuddhaNexus co-director Orna Almogi calls “gold standards” in the database’s search results. This means that users are assured as clear and precise results as possible, and that results will only continue to become clearer and more precise as the platform grows and continues to train its algorithmic models. As work continues at BuddhaNexus, the field of Buddhist Studies looks forward to these exciting technical developments and improvements.

Keep an eye on this outlet for future updates on BuddhaNexus.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s