CrossAsia: A Gateway to Asian Studies in the Digital Age (Part 1)

The first of three articles looking at how the Staatsbibliothek zu Berlin’s digital platform CrossAsia is changing the way Asian resources become accessible using state-of-the-art technology. Interview conducted by Daniel Wojahn with CrossAsia members Matthias Kaun, Martina Siebert, Hou-leong Ho, and Antje Ziemer.

CrossAsia website

In the world of digital humanities, CrossAsia has emerged as a noteworthy platform. What began about 17 years ago as a service to access electronic resources at the Staatsbibliothek zu Berlin has developed into a system that is changing how researchers work with Asian materials.

CrossAsia in its early days functioned essentially as a digital librarian—it helped scholars find and access licensed materials related to Asia. This was part of the German Research Foundation (DFG)-funded “Special Collection Area for East and Southeast Asia”, a programme started over 70 years ago designed to ensure German researchers had access to essential materials (predominantly printed and electronic books and journal database subscriptions).

Integrated Text Repository (ITR): The Foundation for Digital Scholarship

In 2016, things began to change. CrossAsia evolved into what is called the “Fachinformationsdienst (FID) Asien”, or Specialized Information Service for Asia, a new funding line of the German Research Foundation. This was not just a name change—it represented a fundamental shift in thinking.

“The perspective changed a little bit,” as Matthias Kaun put it. “We shifted from merely concentrating on Asia to becoming more content-focused”. Instead of just providing access, the team started asking how they could help researchers work with these materials, how they might break down language barriers, and how they could preserve digital content that might otherwise disappear (we look at one case study in part 3 of this mini-series).

Anyone who has researched across multiple languages knows the challenge: Materials exist in different formats, on different platforms, with different search capabilities. Sometimes you can download full text, sometimes just browse page scans in poor resolution. Sometimes you can search within documents, sometimes not.

CrossAsia tackled this problem by developing the Integrated Text Repository (ITR): Not just another database but a new approach to managing digital content.

Repository Architecture and Data Processing

At its core, the repository removes barriers imposed by commercial database constraints, enabling researchers to work with content independent of vendor platforms. As team member Martina Siebert puts it, they aim to “free the data from their boxes” – a philosophy that captures CrossAsia’s mission of liberating valuable information from locked, proprietary systems and making it accessible for diverse research methods. This ensures that scholarly access remains stable even if vendor relationships end, protecting materials and making them available for long-term projects. Accordingly, Matthias Kaun emphasizes: “It’s not a kind of repository where people put everything into it and leave it like it is. It is quite a well-structured repository of tons of text and material.”

The process of achieving this independence begins when CrossAsia receives data from vendors (an extensive list of licenced databases are found here). This often comes in various formats (PDF, HTML, XML, etc.), with metadata tied to proprietary systems. Subject experts and technical staff collaborate to carefully transform vendor-specific formats—for instance, using regular expressions to standardize information or employing tools like MARKUS to align historical Chinese era designations with Western calendar systems. Custom workflows are applied to reorganize the disparate formats, reducing metadata complexity where necessary or expanding it as needed. Dublin Core standards and other established frameworks, such as MODS and METS, are used to normalize the extracted data, creating consistency across collections.

Structurally, the repository organizes content at a granular level. Books and journals are divided into smaller components, such as chapters and pages, with metadata assigned appropriately to ensure logical relationships between materials. Technically, the ITR is built on Fedora Commons Repository software with a Solr search index layer. This architecture allows for flexible data modeling, preservation of relationships between objects, and sophisticated versioning capabilities. The system meticulously maintains both the original files exactly as received from vendors and the standardized versions created during processing, ensuring researchers can always verify transformations against source materials.

CrossAsia ITR workflow (Designed with mermaid.js by the author)

This setup opens vast possibilities for researchers. By standardizing inconsistent data and integrating international frameworks, the repository makes it easier to analyze patterns across massive amounts of text, find hidden connections between materials, and apply text-mining techniques that transcend traditional reading (we will dive deeper into CrossAsia’s dedicated DH tool set in Part 2). Today, according to their website, the CrossAsia Integrated Text Repository (ITR) contains approximately 418,000 titles with 67.2 million pages that can be searched through their Full Text Search function. These span multiple Asian languages alongside Western-language materials focused on Asian studies—creating a substantial collection for researchers worldwide.

Breaking Down Language Barriers Through AI

One of the interesting developments at CrossAsia is how they are using artificial intelligence to address language barriers. When researching topics across Asia, relevant materials often exist in multiple languages. “The basic question is, if the user wants something and doesn’t know the language, they’ll never find it,” explains Hou-leong Ho. “So, what we can do now is use AI to facilitate that”.

While AI translation tools supplement traditional research methods, they cannot replace the need for language expertise and human understanding of cultural context. What they do provide is a new layer of accessibility, helping researchers discover relevant materials they might otherwise miss entirely due to language barriers. CrossAsia is in the process of  implementing several AI capabilities within their system. For collections that previously had minimal organizational structure, their AI tools generate descriptive metadata. For example, when a large collection of historical photographs lacks detailed labelling, the system can analyze image content and suggest descriptive tags, dates, locations, and categories. This automated enrichment creates access points that would require prohibitive amounts of human labour to produce manually. This then also makes previously unsearchable image content findable through multilingual text queries.

Infrastructure First: CrossAsia’s Collaborative Approach

CrossAsia’s philosophy centres on providing infrastructure rather than directly serving individual researchers. Instead of trying to anticipate possible research questions, they build flexible tools that can support diverse inquiries.

“We are non-experts of our data,” explains the CrossAsia team, highlighting an important aspect of their work. Unlike traditional subject librarians who might specialize in specific content areas, CrossAsia staff focus on creating systems and tools that work across multiple domains. They create the foundation and framework that enables specialized research.

This infrastructure-first approach shapes how they design their digital humanities tools. Rather than creating highly specialized applications for narrow research questions, they develop generic toolboxes that can be adapted to specific fields. Their systems are designed with flexibility in mind, allowing them to serve diverse research contexts. This generalist approach only works because CrossAsia actively collaborates with domain experts. They seek scholars with specialized knowledge to help guide system development and test applications in real research scenarios.

“We can only develop infrastructure that meets actual research needs by working closely with the experts who articulate those needs,” notes Antje Ziemer. This collaborative model has been central to their approach, allowing them to build systems that adapt to specific domains.

Navigating Copyright and Licensing Challenges

Working with digital materials means navigating complex copyright restrictions. CrossAsia has developed experience with specialized license agreements that include provisions for text and data mining while respecting publishers’ rights. The CrossAsia team has built credibility with content owners through responsible stewardship. “Our license providers trust us,” notes Matthias Kaun, which enables more flexible usage terms than standard agreements might allow. 

While CrossAsia offers invaluable resources, it is important to note that access restrictions apply due to funding policies and licensing agreements. Full access is currently limited to researchers affiliated with German universities. This reflects the project’s funding structure through the German research council rather than any technical limitations. However, all researchers have full access to their catalogue search (over 250 million bibliographic records!), the ITR  full text search, and their continuously growing CrossAsia Lab (more on this in Part 2).

CrossAsia also hopes to coordinate with other European institutions in the future to share acquisition costs. “We purchase one set and maybe Leiden or the UK purchases another set,” explains Matthias Kaun. This collaborative approach stretches limited budgets while maximizing available resources.

Looking Forward

Despite its significant contributions to (digital) Asian studies, CrossAsia seems to fly under the radar internationally. While the platform conducts public outreach and maintains international collaborations, there is potential for broader international engagement. This relatively limited recognition stands in contrast to the platform’s sophisticated technical infrastructure and vast resources. As digital humanities continue to evolve globally, CrossAsia’s innovative approach to data management and research support could serve as a model for similar initiatives worldwide.

At its heart, CrossAsia remains committed to making diverse resources accessible to researchers regardless of linguistic or geographic barriers. Through technological innovation, thoughtful curation, and institutional collaboration, it has established itself as an important platform for anyone engaged in research on Asia.

For digital humanities enthusiasts, CrossAsia offers a view of what is possible when cultural heritage institutions embrace technology. You do not need to be a technical expert to benefit from their work—just a curious mind and some exciting research questions.

If you have domain expertise in Asian studies, consider reaching out—CrossAsia’s collaborative model depends on experts like you to share research data and help shape the future of these digital tools.


Additional reading

Matthias Kaun et al., “CrossAsia-ITR (Integriertes Textrepositorium) – Ziele, Aufbau, Technik,” ABI Technik 39, no. 4 (2019): 303-310. https://doi.org/10.1515/abitech-2019-4007

“CrossAsia Blog,” CrossAsia, https://blog.crossasia.org/kategorie/itr-und-entwicklungen/

3 thoughts on “CrossAsia: A Gateway to Asian Studies in the Digital Age (Part 1)

Leave a comment