Kitāb or kitaab: improving search and discovery for global languages

Written by Lara Salha, Harrison Pim, James Gorrie, and Adrian Plau.

Wellcome Collection holds a diverse collection of multilingual items, ranging from library, archive and manuscript collections to a variety of audio-visual materials. Truly global in its breadth, with materials from every continent and spanning more than two millennia of history, the collection presents complex challenges of discoverability and accessibility. A key challenge is that of ensuring users can find these global materials by searching our website. Cataloguing metadata for global languages typically come with a whole host of different practices of transcription and transliteration, many of which are meaningless beyond research communities. So how do we make sure all users can find what we have? 

Language/script challenges 

One of the problems related to search in the Arabic language is found in the transliteration of the definite article al/el/ال. This particle is what renders a noun as definite when prefixed on to the word. For example, kitāb (كتاب meaning ‘book,’) while al-kitāb (الكتاب meaning ‘the book.’) 

The Arabic language has 28 letters, all of which are consonants. From those, 14 of them are considered the sun letters (حروف شمسية‎) and the other 14 are considered the moon letters (حروف قمرية). This separation is dependant on how these consonants react with the definite particle. When the definite article is assimilated with sun letters, it will lose its distinctive sound. Therefore, while a word will be written and read directly as al-rajul الرّجل (the man) due to the following letter being a sun letter, the actual grammatically correct pronunciation would be ar-rajul. The definite particle’s distinctive ‘al/el’ has diminished, and instead assimilated into the word.  

The problematic aspect of this is while a cataloguer who has correctly transliterated the word as it is, for instance, al-tib الطب (medicine), a researcher or another cataloguer may have transliterated the word according to how it should be correctly read as at-tib. This would mean a research could enter either al-tib or at-tib and only respective records will appear depending on the variation even if it is the exact same word. 

In addition to grammatical components that can impact searching the catalogue, transliteration itself presents some difficulties for the researcher. One of the examples we used was the word nuğum/nujum/نجوم meaning ‘stars’. There are various standards to be used when transliterating/romanizing Arabic text. For the letter ج in Arabic, it is romanized differently depending on the standard. DIN 31635 and ISO 233 will use the diacritic ğ, which appears to have been what cataloguers were previously using when cataloguing at Wellcome. Unfortunately, for native Arabic speakers, the letter ğ does not correlate with the sound ج produces, which is a soft j most of the time (especially noting that ğ, also known as yumuşak ğ, is known for being a soft g with a more ‘y’ sound in Turkish and Azeri languages).  Hans Wehr transliteration system, as well as Library of Congress, lean more towards using the English letter ‘j’, which for a native speaker and student of Arabic arguably makes more sense.

Similar challenges abound for South Asian languages and scripts. For instance, the Devanagari script is commonly used for a large variety of languages, such as Hindi, Marathi, Nepali, and English, but transliteration schemes developed for Sanskrit might still be applied across all of these. The results can be alienating and confusing, rendering names of authors and publishing houses unintelligible. It might also give rise to inconsistencies.  

The challenges in transliterating both Arabic and South Asian languages point to a wider discussion surrounding whom the records are intended to serve. Most transliteration schemes featuring diacritics are wholly incomprehensible outside of academic circles, so why should material that might otherwise be readily available to language users be hidden behind an unnecessary border? If diacritics-driven transliteration schemes are annoying and/or unnecessary to people who use the language and needlessly alienating to those who don’t, why are they needed? 

Solutions

There are a few information retrieval tricks we could have deployed to help with the problem. For example – search synonyms. As much as this might have helped, it would have required us to have a thorough list of all the transliteration variants that have been introduced over years the collection has grown, which would be a substantial manual intervention. 

Instead – taking an intention driven approach – trying to satisfy the information-need of searching for things that might have alternative spellings closer to what a researcher’s intent might be. 

With this in mind, fuzzy searching seemed more fit for purpose. 

With the examples above collated, they were added to our search tests and a fuzzy matching part added to our structured search query. At first this seemed to work, but we landed up with some unwanted typos sneaking in, e.g. searching for “stimming” had the top results matching on “swimming”. To fix this we boosted exact matches to rank higher than fuzzy matches, added these tests to our suite to ensure we maintained this balance, and released the query to our collections search

Further balance required 

We had received some very positive feedback from people searching our transliterations. There was even the occasional bit of feedback on us returning typos. But within a few days we started receiving feedback on unexpected results popping up in a lot of areas of our collection. “Parrots” when someone was searching for “Carrots” – an easy example to see why it has happened, replacing C => P, giving it a Levenshtein distance of 1, and why it is wrong, but there were more nuanced, problematic examples cropping up, “monsters” matching “ministers” for example. 

We are currently in the process of tightening up the fuzziness, making it align with the intention of searching for alternative spellings rather than making typos, as we have the expectation from our researchers of erring on the side of precision over recall. But we are getting better, and through applying this process to other areas, such as multilingual search, we hope to continue to improve our search more broadly while maintaining the quality of our transliterated search. 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s