Teaching Bengali Digital Texts to Anglophone Undergraduates: What Voyant Reveals about the Infrastructural Bias of DH Tools

In designing an introductory Digital Humanities class, I am often faced with the question of how best to incorporate linguistic diversity, particularly from the Global South, for a predominantly Anglophone student body. How do I invite students to critically examine the Anglophone bias underlying much of DH theory and practice without necessarily depending on the languages they know and work within? Beyond reading key theoretical works by Roopika Risam, Kelly Baker Josephs, Rahul K. Gairola, and Sayan Bhattacharyya, how might students perform the task of digitally engaging with cultural records from the Global South? In what follows, I offer one method that I have successfully employed in my DH classes to create practical, hands-on assignments which introduce students to both the limitations and possibilities of DH tools in working with languages from the Global South. The discussion highlights my notes on teaching Omeka using Bengali language texts as the sample document for the class. My goal here is to also focus on publicly available software and texts so as to increase access for both students and instructors. As a caveat, I offer that this method is neither the only possible one, nor as an exhaustive introduction to these tools. This is simply the approach that has worked well for students who speak and live in English, and do not have a working knowledge of languages beyond that.

Establishing the Problem: Text Mining as a Precursor to Digital Curation

To give a sense of the student body I am referring to, I will use the Fall 2024 cohort I taught at Saint Mary’s College of California.  In Fall 2024, I taught an Introduction to Digital Humanities at Saint Mary’s College of California to a cohort composed predominantly of incoming first year students , the majority of whom spoke English. Given that this course was part of the diversity requirement for the English Department , my aim was to introduce the field while placing it within the context of the Anglophone bias underlying much of DH theory and practice. This cohort had little to no knowledge of DH as a field, or the tools employed therein, and their high school second language experience was limited to the Romance languages. It was therefore safe to assume they had not encountered a non-Latinate alphabet, digitally or in print, or any orthographic systems that stacked letters and vowel signs horizontally . To that end, I chose what I hoped were approaches supported by publicly available tools with a relatively easy learning curve, which would then allow students to focus on questions of language as they explored practical applications of these tools.

The unit began with experimentation using Voyant with the goal of understanding how textual analysis can be digitized, and how an open source program might be able to break a text into a range of data–such as word frequency, distribution lists, and visualized connections between words–which the user can then interpret. Using a born digital text such as the Bengali newspaper Anandabazar Patrika, students identified blocks of texts before copying them into Voyant’s searchbox. The results are visually striking as they introduce students to the complexities of an abugida script. An abugida script features vowels that take on a modified shape once they are attached to a consonant.1

 Figure 1: Text analysis using Voyant

Given that Bengali is written left to write (LTR) with each word a distinct element on a page, students had at least some anchors when it came to experiencing a page. I use “experiencing” advisedly given that their approach to reading Bengali was limited to identifying shapes and repeated patterns. They could thus spot syntactical units such as a sentence when I pointed them to “।” (dānri) as the equivalent of a period marking the end of a sentence. As the class played around with word frequency and distribution, they started to notice the degree to which they relied on the concept of a “word” made up of distinct “letters.”

Figure 2: তাপমাত্রা (tapmatra, temperature) highlighted to demonstrate abugida script

As the class played around with word frequency and distribution, they started to notice the degree to which they relied on the concept of a “word” made up of distinct “letters”. For example, for Anglophone students reliant on a singular consistent alphabet, the idea that ত্রা (tra) was a modification of তা (ta) (to represent an added র to the consonant) but was fundamentally similar to তা required a reorientation to accommodate the range of forms words or letters could take.

What made this introduction to text mining successful was Voyant’s easy learning curve, and the tool’s ability to work with texts of varying lengths. Since most students had had little, if any experience, with DH tools, this ease was significant as it allowed them to focus on the linguistic representation on their screen. By toggling between the Cirrus, Terms, and Links tabs, students were able to encounter the script from various angles, thus gradually increasing their familiarity with the visual shape of the Bengali alphabet. As we moved to the next part of the class, the importance of recognizing letter shapes became increasingly apparent.

What emerges, then, from this initial exploration into text mining is that while users with little to no training in an Indic language can still experiment with digital representations of texts in these languages thereby becoming acquainted with the layout of the alphabet, such engagement is limited to texts of more recent vintage, preferably born digital ones.

Encountering Infrastructural Dissonance: Working with Scanned Texts

Following this foray into experiencing Bengali digital texts, and the layout of the alphabet, I then moved the class towards pre-digital texts—the bread and butter of most DH scholars—to see how comfortable these tools were with them. Sticking with the theme of publicly available news magazines, I introduced students to Betar Jagat, the print mouthpiece for the Calcutta Radio Station (now known as Akashvani Kolkata) which ran from the mid 1920s till the late 1980s. Issues of the journal can be freely accessed on Google Books and Internet Archive, thus allowing scholars and enthusiasts to read a culturally significant publication. My choice in this instance was also guided by the use of legacy fonts in Betar Jagat as these necessitate a more creative approach when it comes to digitizing.2 While the class had a basic grasp of the challenges posed by legacy fonts, it was only when they copied and pasted a page in PDF format from the magazine into Voyant, that they realized how incompatible digital tools are with pre-digital non-Latinate scripts.

Figure 3: Scanned PDF of Betar Jagat with individual words identified by Adobe Acrobat

Figure 4: Same text as Figure 3 PDF using Adobe OCR function

For example, while Adobe Acrobat identifies individual “words,” those elements on the page are not read as legible words by Adobe’s built in OCR (Optical Character Recognition) tool. Barring a handful of words, the rest are rendered as a collection of meaningless symbols. Similar garbled results are produced by most freely available OCR tools, and as a result, the Voyant result is unusable.

Figure 5: Text analysis of Betar Jagat using Voyant

However, because the class had already spent a considerable amount of time looking at the shape of the Bengali alphabet using the same Voyant interface, students were able to note fairly quickly the difference between the letters in figure 3, and their subsequent iteration in figures 4 and 5. They were thus able to identify the errors that the digital tools introduced, and I could have a productive conversation about the challenges of encoding the Bengali script.

Conclusion and Pedagogical Takeaways

The illegible result discussed above gives students the opportunity to experience first-hand what DH scholars from the Global South have long noted; available infrastructure can offer at best limited support to digitizing Indic language texts. While this is an obvious point to some of us in the field, it remains to be fully addressed in the pedagogy, particularly at the introductory level. By first demonstrating to students the possibilities of text mining using a language unfamiliar to the majority of the participants in a North American classroom, one is then able to pinpoint the many limitations of publicly available digital tools. This hands-on approach places in context the theoretical discussion around what might be broadly construed as Postcolonial DH, giving students an insight into what Anglophone bias in technology looks like on the ground.


Footnotes

  1. For more on the structure of an abugida script, with particular reference to Bengali, see Purbasha Auddy’s “Mining Verbal Data from Early Bengali Newspapers and Magazines” ↩︎
  2. Andrew Hardie’s “From Legacy Encodings to Unicode” offers a good overview of the specific case of encoding Asian languages. Hardie examines how legacy fonts rely on 8-bit graphical fonts that are largely incompatible with Unicode, leading to an impossibility when it comes to text extraction from formats such as PDFs. ↩︎

References

Auddy, Purbasha. 2022. “Mining Verbal Data from Early Bengali Newspapers and Magazines Contemplating the Possibilities.” In Global Debates in the Digital Humanities, edited by Domenico Fiormonte, Sukanta Chaudhari, and Paola Ricaurte. University of Minnesota Press.

Hardie, Andrew. 2007. “From Legacy Encodings to Unicode: The Graphical and Logical Principles in the Scripts of South Asia.” Language Resources and Evaluation 41 (1): 1-25. https://www.jstor.org/stable/30200570.

3 thoughts on “Teaching Bengali Digital Texts to Anglophone Undergraduates: What Voyant Reveals about the Infrastructural Bias of DH Tools

Leave a Reply