With recent advancements in AI, including developments by Sarvam AI and Hanooman AI, creating a lot of hype around large language models (LLM) in Indian languages, it is time to reflect on what kind of AI Indian languages need.
At a recent three-day conference organized by the Centre for Internet and Society, Bengaluru, and Maharashtra Knowledge Corporation Ltd, Pune, people from different walks of life gathered to exchange ideas and problems around working with AI for Indian languages. The premise behind the conference, titled “AI: Future of the Commons: A Conversation on Artificial Intelligence, Indian Languages and Archives”, was that Indian languages are a part of our digital commons and that significant developments in AI in English have further widened the divide between English and Indian languages in the digital world. Given the unequal resources and investment in Indian languages, what must one aspire to vis-a-vis Indian languages? Around 40 experts deliberated on this pressing question including Wikimedians, policy experts, academicians, content creators, representatives of civil society organizations, technologists, and project coordinators working on Indian languages.
Here are six key takeaways from the event:
1. Not all that under-resourced
Indian languages are often considered to be under-resourced, under-served, or low-resourced languages on the Internet. There are languages (such as Sindhi and many more that have multiple scripts) that do not even have basic OCR tools. Expecting them to figure into any kind of conversations around AI seemed very far-fetched.
However, as it turns out, several projects, especially those funded by government bodies or for public projects, have developed and implemented a variety of computing tools such as OCR, speech-to-text, and text-to-speech technologies for a lot of the official languages. Many of these applications are made for deployment in welfare projects by the government in order to weed out duplicates and speed up the delivery of various agendas under social justice to those who deserve it.
Since the data and corpuses exist for most of the regional or state languages, it is not irrational to expect that more sophisticated applications can soon be used on such languages. It is hoped that the concerns raised in the conference will lead to collaborations on enhanced projects, making such tools open access.
2. Open, but closed
While it is true that a lot of work has happened in some Indian languages, it is tragic that these resources are not accessible to all. These are often available to only the commissioning vendors and clients: partly because the stakeholders may not have considered the big picture of the relevance of making these tools available to all, and partly because openness, as an idea in India, is yet to be understood fully.
The computational tools built for public projects, funded out of public funds, are not available in the public domain. As a result, teams working in different institutions and organizations, unaware of the work already carried out, have been building their tools from scratch, duplicating the work that is already done.
These tools and projects are also not easily discoverable via search engines. For instance, searching for “Hindi OCR” is not likely to yield the projects using OCR or the agencies involved. The developing agencies are obliged to deliver only as much as expected of them to the clients, and hence, not much thought is given to making these tools visible. If there is no mandate for the tools to reach a wider audience, publishing practices such as making content SEO-friendly is not much of a priority for the parties involved. Unfortunately, it is left to those who work towards building such resources with the goal of making them available who are concerned about disseminating information about the tools in a systematic manner.
3. Defining non-English parameters
Venturing into building a corpus for projects in Indian languages requires one to think about the nature and commons aspects of the data. The data available in English is a huge corpus of several centuries of digitized content. However, the data available for Indian languages is a mere 20 years of born-digital content. This discrepancy in the corpus is bound to affect the quality of output by any AI projects in Indian languages.
Noticing the discrepancy itself helped the participants ask the larger question: do we need as much data as English? Or do we need a different kind of data for Indian languages? Also, what can be done to bring digitized and non-digitised resources in Indian languages into the purview of digital commons? The questions raised should help inform the projects the participants have been working on at different institutions. It needs to be seen to what extent these projects align with the concerns raised at the conference.
4. Training LLMs on limited data
Expecting Indian language LLMs to be trained on vast amounts of data, as done for English, seemed unrealistic because of barriers such as lack of resources on every front, starting with digitisation efforts, expertise and labor. However, a less-is-more approach has already yielded positive results, with some Indian language LLMs performing well after being trained on just one text, the Bible.
The Bible is not just the book that postcolonial theorist Homi Bhabha spoke of as a text of critical importance to understand Indian modernity; it might as well be the foundational text to the processes of defining the ethos of data in terms of working with just one text.
5. Indian momentum in AI
The Indian imagination of AI could contribute something to the larger AI narrative. For instance, reflecting on the linguistic heterogeneity in India, it could make an important case study on the possibility of working on various languages simultaneously, as opposed to working on one language at a time. Similarly, it could also bring about programs for AI literacy in Indian languages so as to ensure that different communities and users can engage with and benefit from AI technologies in their native (and/or preferred) languages.
6. The question of demand
One of the provocations at the conference was: maybe AI in Indian languages does not exist because there is no demand for it. After all, someone would have built it, had there been a market for it.
The connection thus made between the logic of the market and availability of solutions made it even more interesting to explore why the question of Indian languages is framed in terms of the present state of affairs. Technology is not necessarily based in clear terms of what a society lacks today; it makes its own inroads into work, processes, and interactions. Indian AI, if it materializes for one Indian language or more, will be a way of decolonizing technological imagination of the contemporary moment.
How technology pans out for Indian languages is yet to be seen: perhaps future editions of the “Future of the Commons” conference will provide a platform to help monitor these questions. It is hoped that further conversations about the subject will help project directors ask better questions of whether the goal of AI for Indian languages is little more than a wannabe aspiration or does it mean something deeper for the nuance of language within technology. While the conversations at the conference remained at the level of languages that enjoy a certain sense of power and visibility as dominant state languages, a lot more needs to be done to make these conversations inclusive to disadvantaged languages, such as tribal and Northeastern languages, and languages with multiple scripts.
Thus, AI for Indian languages seems within reach, with on-going projects such as Bhashini and Linguistic Data Consortium for Indian Languages working on building corpuses in Indian languages. But this AI remains wishful thinking because it is not being monitored for inclusivity and openness. As mentioned above, hopefully, the future editions of the conference will bring such critical lenses to the development of such projects and a different vision for what AI can do for Indian languages and Indian languages can do for AI will emerge.
The “Future of the Commons” conference had a two-fold focus on Indian languages: AI and archival practices. The takeaways from the discussion on the latter will shortly follow in a sequel to this article.
Cover image by Vanshhuyaar. Languages of Bharat.

3 thoughts on “AI and Indian Languages: Wishful Thinking?”