The CrossAsia Lab equips researchers in Asian Studies with practical tools to engage with data in more interactive and meaningful ways. It moves beyond passive information consumption, allowing users to search, analyze, and manipulate multilingual datasets. Key offerings include advanced full-text search, n-gram data for computational analysis, and transliteration tools for Central Asian scripts, among others. In part 2 of the CrossAsia miniseries (part 1 is available here) we explore the Lab’s core offerings that enable hands-on exploration of vast archives and diverse materials and possibilities to enhance research workflows.
Simplifying Data Access
One of the CrossAsia Lab’s defining features is its focus on accessibility. The IIIF2PDF tool simplifies the process of viewing and archiving large digital manuscripts by converting IIIF objects into fully downloadable PDFs. The tool addresses a critical bottleneck for researchers needing offline access to fragmented collections, extending beyond historical documents to include manuscripts, newspapers, or archival layouts. Since the app works with any public IIIF manifest, researchers can use it for collections well beyond CrossAsia’s own holdings. For geo-restricted collections, alternative solutions exist through GitHub repositories like this in combination with a VPN, though these require additional technical knowledge and setup effort.
Meanwhile, Translit serves a different purpose: making multilingual text work easier to navigate. Researchers working with languages like Tibetan, Cyrillic Mongolian, or Uighur often face issues with inconsistent transliteration systems. Translit allows users to convert from native scripts into standardized Romanized versions (e.g., Wylie for Tibetan or ULY for Uighur) and vice versa. The flexibility of formats means it is useful not just for transliteration but also for searching text across corpora with varying script formats. Importantly, it supports processing large amounts of text at once, facilitating both quick conversion tasks and large-scale multilingual projects.
However, for the vertical Mongol script (Mongol bičig) you would still need other third-party tools like tuorqai’s Bicig.js.
From Extraction to Analysis: xA2XML and N-Grams
For researchers working with limited access to CrossAsia’s digital resources, processing materials into an analyzable format can be a barrier. This is where xA2XML comes in: it extracts text from CrossAsia’s immense “Integrated Text Repository” (see part 1 of the series for more information) and transforms it into XML. The process breaks down raw textual data into well-organized units, such as bibliographical data, paragraphs, or chapters, depending on the source material.
It uses the Library of Congress’ SRU (Search/Retrieval via URL) protocol, which is commonly employed to query bibliographic data, and integrates this directly into workflows for further analysis or text mining. The structured XML format makes the data highly accessible for analytical tools and programming languages, an example of its potential use can be seen in the figure below.
Give it a try yourself by replacing the generic “searchterm” query with your own: http://crossasia.org/pazpar2/pz2Interface/PZ2Controller.php?cmd=PZ2&serviceid=xAdef&query=searchterm
A search for “democracy, Tibet” reveals a continuous trend of bibliographical records on the subject. The xA2XML file was filtered (using the <md_title> and <md_date> entries), calculated with pandas, and visualized with plotly by the author.
The n-gram offerings by CrossAsia Lab focus on enabling detailed computational text analysis with accessible preprocessed datasets. N-grams—sequences of one (unigram), two (bigram), or more (trigram) items like words or characters—facilitate discovery of linguistic and conceptual patterns within texts. The Lab has developed ready-to-use n-gram datasets from collections of its Integrated Text Repository (ITR), including the Xuxiu Siku Quanshu 續修四庫全書 (a corpus of over 5,400 historical Chinese texts), the Daoist compendium Daozang jiyao 道藏辑要, and a collection of over 10,000 Chinese local gazetteers covering the period from the Song dynasty to Republican China and some older geographical texts. These datasets allow researchers to explore the evolution of terms, phrases, and underlying grammatical structures across texts and time periods.
One concrete application of n-grams is tracking how specific terminology, such as official titles or slogans, emerged and transformed during key cultural or political periods. For instance, a dataset of local gazetteers could help analyze how administrative terms were used in different regions, while datasets from Daoist texts might highlight the evolution of religious terminology. Combined with the CrossAsia ITR Explorer you could examine the frequency and co-occurrence of terms like “revolution” (革命) and “law” (法) as shown in this example:
A selection of local gazetteers from the n-gram datasets show a periodical correlation between “revolution” and “law”. Visualisation via the ITR explorer.
What makes these tools particularly powerful is their flexibility and scalability. They cater to a variety of research aims, from linguistic studies—such as lexicon development or stylistic analysis—to broader historical investigations across disciplines. Researchers can start small, analyzing a single term over a decade, and then expand into larger projects comparing trends across genres, regions, or centuries (learn more about the process and ideas on their dedicated blog post).
Nota bene: While the strength of these n-gram datasets lies in their ability to suggest new research directions and unexpected connections, their most insightful use often comes when combined with close reading and contextual knowledge. Terms like “revolution” or “reform” carried different meanings across time periods and political contexts—nuances that computational analysis alone might miss.
Newspapers at Scale: The ITR-Newspaper Explorer
Historical newspapers occupy a unique position in Asian Studies, offering archives of dense cultural, political, and social data. Yet they have typically been difficult to analyze at scale, particularly given the multilingual and fragmented nature of these sources. The ITR-Newspaper Explorer addresses these challenges by providing a dedicated interface for working with Asian historical newspapers from sources like People’s Daily, Ta-kung Pao, and the South China Morning Post, among others. Together, these collections span regions and periods significant to modern Asian history.
The Newspaper Explorer includes features for keyword search, Boolean logic (AND, OR, NOT), and customizable result sets. What sets it apart, though, is its customizable heatmap visualization, which enables users to track term usage and trends over time. Users can zoom into specific decades, years, months, or even days to observe coverage changes and identify significant historical moments. For instance, the tool could be used to compare how different newspapers reported key events like the “May Fourth Movement” or the post-liberation development of the People’s Republic of China, shedding light on competing ideologies or divergent cultural narratives.
The visual nature of the Explorer makes it particularly helpful for presenting research findings. For example, scholars can generate over-time maps of how nationalist rhetoric evolved across three unique eras of Chinese media history, using it as evidence for larger historiographical arguments. By combining clean data access with analytical interactivity, the ITR-Newspaper Explorer transforms the way newspapers function as historical sources.
Expanding Research Horizons
The toolkit provided by CrossAsia Lab makes technical processes—from extracting data to analyzing linguistic patterns—accessible to researchers across varying technical skill levels. Whether breaking down complex archives or visualizing trends in cultural discourse, these tools highlight the importance of connecting raw data with meaningful research questions.
What makes these offerings particularly impactful is how they bridge traditionally separate domains of research. From linguistic analysis to visualizing trends in genres or themes, CrossAsia Lab provides a versatile toolkit catering to diverse scholarly needs. It offers a safe environment for researchers to experiment with both small- and large-scale data analysis, stripping away much of the intimidation often associated with digital humanities tools. Without requiring heavy investments in training, time, or specialized software, it allows researchers to explore computational methods at their own pace. Housed in a straightforward interface designed for functionality over flash, it stands out as a valuable yet underutilized resource for Asian Studies—a platform that quietly combines utility with remarkable research potential.
In the third and final article of this series, we will explore how CrossAsia’s preservation work and metadata expertise come together in the migration of the “Digital Tibetan Archives Bonn”. This journey highlights the critical role of sustainable infrastructure in ensuring long-term access to invaluable digital collections, as well as the power of scholarly collaboration in adapting historical materials for future research.
Cover image generated with Ideogram.ai

2 thoughts on “Inside CrossAsia’s Lab (Part 2)”