Scripts That Don’t Fit: The Hidden Bias of NLP in South Asian Languages

This is a guest post by Saniya Irfan.

Digital Humanities in South Asia often begins with negotiating invisibility, the invisibility of our languages in NLP resources, datasets, and interfaces. Working with Urdu text in Python in my own PhD project, exposes the infrastructural biases of digital tools, reminding us that computational work is never linguistically neutral. 

In her article, Decolonizing the Digital Humanities in Theory and Practice (2018), Roopika Risam states that, ‘Digital Humanities offer tremendous potential for democratizing scholarly knowledge, such possibilities are undercut by projects that recreate colonial dynamics or reinforce the Global North as the site of knowledge production.’ While Digital Humanities can make knowledge more accessible and inclusive, for example, by digitizing archives, creating open datasets, or allowing scholars from anywhere to collaborate, in practice, many DH projects still reproduce old hierarchies.  Most projects are designed, funded, and led by institutions in the Global North, while the materials, languages, or cultures being studied come from the Global South. This means that even with open and digital tools, the power to define what counts as knowledge, how it’s represented, and who gets credit remains concentrated in the North. 

What happens when we bring a Right-to-Left (RTL), low-resource, and highly poetic language like Urdu into a Python-based workflow designed for English? Working with poetic texts in Indic or Right-to-Left languages using tools like NLTK or BERT poses several challenges like: 

  1. Many NLP libraries are optimized for English, RTL scripts like Urdu or complex Indic scripts often display incorrectly or break during tokenization;
  2. Poetic Urdu or Hindi often uses irregular spacing, compounding, and diacritics, which cause tokenizers to split words incorrectly; 
  3. Pretrained models like BERT have limited or no training data in these languages, leading to poor recognition of poetic vocabulary, idioms, and named entities;
  4. There are very few gold-standard datasets for South Asian poetry, making fine-tuning or evaluation difficult; 
  5. Urdu and many Indic languages mix Arabic, Persian, and local lexicons, confusing monolingual or even multilingual models trained on standard texts; 
  6. These tools were built for high-resource, RTL languages, so they often fail to capture the structural, cultural, and aesthetic richness of South Asian poetic texts. 

When digital tools fail to recognize Urdu’s syntax and semantics, they also fail to grasp the cultural world embedded in its literature. In my PhD study, I examine Urdu Marsiya, a poetic and hybrid genre that blends Arabic, Persian, and indigenous Indian images. This multilingual and metaphorical composition introduces a distinctive layer of complexity to computational analysis. In the use of Named Entity Recognition (NER) using Language Models, I frequently observe difficulties in accurately identifying or categorising names, locations, and cultural references that transition seamlessly between languages. A single Marsiya may incorporate Arabic theological terminology, Persian analogies, and vernacular Indian phrases within the same stanza. This language amalgamation perplexes models trained on standardised or monolingual datasets, leading to the omission or misinterpretation of items that are culturally and semantically interconnected. This difficulty underscores the technological constraints of existing NLP techniques and the necessity for contextually aware, multilingual models attuned to the cultural and poetic subtleties of South Asian literature.

Roopika Risam expands upon the contributions of Harding and others in postcolonial science and technology studies (STS). Kavita Philip, Lily Irani, and Paul Dourish, in Postcolonial Computing: A Tactical Survey (2010), advocate for Postcolonial Computing as a strategy for interacting with technoscience. For them, ‘Postcolonial computing is a bag of tools that affords us contingent tactics for continual, careful, collective, and always partial reinscriptions of a cultural-technical situation in which we all find ourselves’. She further adds, ‘It engenders questions of technology and translation, mobility, labour, and infrastructure and how they manifest across cultural contexts.’ As part of her work, Risam supports many projects that use digital cultural heritage, games, performance art, and maps to help decolonise Indigenous communities, immigrant histories, and the field of digital humanities itself.  All these projects are dedicated to bringing to the forefront things that have usually been pushed to the background in dominating narratives. They don’t agree with the knowledge structures that came about because of colonialism. Instead, they put Indigenous, immigrant, and Global South knowledges at the centre of their work. These kinds of examples use the power of technology to change things while also drawing attention to how they don’t accept simple answers or answers that are easy to apply to the ongoing effects of colonialism and neocolonialism on the creation of knowledge. They welcome the diversity, contradictions, tensions, and hybridities that are required for decolonisation. They are culturally situated, innovative, and try new things. By getting involved at the local level, they help us understand the global aspects of digital humanities. 

To do Digital Humanities in South Asia in a decolonial way, we must think about who makes the tools, who owns the data, and whose language rules shape computer models. The Global North is where most information is created because most NLP infrastructures are trained on English and other high-resource languages. When academics use these tools to work with Indian or Right-to-Left languages, they run into problems with both the technology and the way they think about things. For example, local scripts, idioms, and poetic forms are seen as strange. Creating NLP workflows for Urdu or other Indic languages is a decolonial act because it gives back control to language by extending the technological imagination beyond English and insisting that South Asian languages, with their mixed histories and plural meanings, deserve equal computational representation. 

A significant essay in this context, Who are the users in multilingual digital humanities? (2025) by Horváth et al., examines the diversity of practitioners engaged in multilingual and non-Latin script digital humanities (DH) environments from a user experience (UX) perspective. The authors, drawing on prior research and a DH2023 workshop, advocate for the necessity of gathering qualitative data to comprehend the demographics of DH users, particularly those from under-represented language and cultural backgrounds. They introduce fictional user personas derived from survey data to exemplify the diverse obstacles encountered by multilingual digital humanities scholars, including insufficient infrastructure, poor digital tools for non-English or non-Latin languages, and institutional prejudices. The research emphasises intersectional issues such as geography, gender, and institutional access that influence user experiences. It necessitates inclusive infrastructures, interregional collaborations, and pragmatic measures for libraries, educators, funding bodies, and transnational digital humanities organisations to promote language variety. The authors ultimately endorse a dynamic, community-oriented paradigm of multilingual digital humanities that acknowledges the growing and diverse identities of its users.

Horváth et al. (2025) provide five essential proposals for establishing and sustaining a centralised, open repository for multilingual digital humanities user profiles: 

  1. The authors propose the establishment of a public, version-controlled database (e.g., on Zenodo or Wikibase) to host, translate, and update multilingual user personas. This area would guarantee exposure, accessibility, and ongoing community engagement with user data. 
  2. Incorporate multilingual user personas into institutional planning and infrastructure design: University libraries, digital humanities centres, and research computing units should utilise these personas when developing catalogues, discovery systems, and digital tools, ensuring that infrastructures accommodate the requirements of non-Latin script and under-resourced language users. 
  3. Incorporate multilingual awareness into digital humanities instruction and training: Graduate programs, workshops, and training schools ought to utilise user personas to develop inclusive courses that confront the realities and challenges of operating in multilingual and cross-script environments. 
  4. Utilise personas to inform hiring, assessment, and promotion rules: Academic committees must acknowledge the additional labour and intricacies associated with multilingual digital humanities work, employing personas as instruments to measure impact and champion equitable appraisal of such research. 
  5. Mobilise funding agencies and transnational Digital Humanities organisations for multilingual advocacy, and urge for entities such as ADHO, DARIAH, and OPERAS to implement multilingual user personas as a framework to foster equitable funding, policy formulation, and research assistance for non-European and non-Latin script languages.

These ideas seek to enhance the linguistic inclusivity and global representation of digital humanities infrastructures, policies, and training. Should the proposals put out in previous studies, such as the creation of open, multilingual datasets, the promotion of collaborative annotation, and the establishment of inclusive NLP infrastructures, be diligently implemented, the realm of multilingual Digital Humanities could undergo substantial transformation. These collaborative endeavours would enhance the adaptability of computational tools for Indian and Right-to-Left languages while fostering egalitarian environments for researchers operating beyond predominant linguistic and institutional hubs.  I see a future when I, along with others, may effortlessly engage with Urdu texts and other Indian languages, delving into their literary depth without the persistent challenges posed by technical constraints or structural hierarchies.


References: 

  1. Horváth, Alíz, Cosima Wagner, David Joseph Wrisley, et al. 2025. “Who Are the Users in Multilingual Digital Humanities?” Digital Scholarship in the Humanities, September 24, fqaf091. https://doi.org/10.1093/llc/fqaf091.
  2. Philip, Kavita, Lilly Irani, and Paul Dourish. 2010. “Postcolonial Computing: A Tactical Survey.” Science, Technology, & Human Values 37 (1): 3–29. https://doi.org/10.1177/0162243910389594.
  3. Risam, Roopika. 2018. “Decolonizing the Digital Humanities in Theory and Practice.” In The Routledge Companion to Media Studies and Digital Humanities, 1st ed., edited by Jentery Sayers. Routledge. https://doi.org/10.4324/9781315730479-8.

Cover image: Digital Illustration of Iconic Neural Network.

One thought on “Scripts That Don’t Fit: The Hidden Bias of NLP in South Asian Languages

  1. Hello! I really enjoy your work 🙂 What’s the best way to submit a guest post? Do you have any guidance I can review. I work with Mozilla common voice on global languages.

Leave a reply to fununabashed72708ddb05 Cancel reply