The Sanskrit WordNet: a new database for the study of the Sanskrit lexicon

This is a guest post by Erica Biaggetti

WordNets are valuable resources in linguistics, offering structured and comprehensive databases that serve as tools for exploring the lexicon of languages. Inspired by cognitive science, WordNets are designed to capture the nuances of word meanings and their relationships in a systematic manner. Since the creation of the Princeton WordNet for English in the 1980’s (Fellbaum 1998), several WordNets have been developed for languages all around the world. The Global WordNet Association promotes the standardization of existing WordNets for all languages, as well as the development of guidelines and methodologies for building WordNets in new languages.

WordNet’s Architecture

WordNets are lexical databases, in which information is stored in a relational way. They comprise nodes for nouns, verbs, adjectives, and adverbs to which meanings are associated in the form of synsets, i.e., sets of synonyms identified by an ID number, a brief definition and optionally by exemplifying sentences. Different lemmas can share one or more synsets, which means that they are (partly) synonymous. The establishment of synsets is based on a weakened notion of synonymy, which does not imply that two synonymic words are completely interchangeable, but that they can be substituted with one another in some contexts. In the Princeton WordNet 3.1, for example, all the following lemmas are associated to the synset n#15193837 | the first light of day (“we got up before dawn”; “they talked until morning”):

  1. dawn, dawning, morning, aurora, first light, daybreak, break of day, break of the day, dayspring, sunrise, sunup, cockcrow.

At the same time, a single lemma can be associated to more than one synset, which means that it is polysemous. Polysemy is the capacity for a word to have multiple related meanings and should not be confused with homonymy, which is an accidental similarity between two or more words (such as bear, the animal, and the verb bear). For instance, the noun morning is associated to three additional synsets in the Princeton WordNet:

  1. n#15190336 | the time period between dawn and noon (“I spent the morning running errands”)

n#06645178 | a conventional expression of greeting or farewell

n#07340708 | the earliest period: (“the dawn of civilization”; “the morning of the world”)

WordNet interlinks lemmas, as well as their specific senses. Lemmas are interlinked by means of lexical relations (e.g., derivation, composition), whereas conceptual-semantic relations (e.g., hyponymy-hypernymy, meronymy-olonymy) establish connections among synsets, resulting in a network of meaningfully related words and concepts. 

In conclusion, we can say that WordNets are electronic dictionaries that do not display words alphabetically as traditional dictionaries, but rather organize them in a semantic network, mainly because the representation of words and concepts as an interrelated system seems to be consistent with evidence for the way speakers organize their mental lexicons.

The Sanskrit WordNet

The Sanskrit WordNet builds on, and extends, original work by Oliver Hellwig for the Digital Corpus of Sanskrit and it is now part of a family of WordNets for ancient Indo-European Languages that also comprises Ancient Greek and Latin (Biagetti, Zanchi, and Short 2021). The development and linking of such resources are the objectives of a team of scholars at the University of Pavia, the UCSC Milan and the University of Exeter, in the framework of the newly funded project Linked WordNets for ancient Indo-European languages.

The three WordNets have been designed to be fully interoperable, as well as integrated into the larger ecosystem of digital lexical and textual resources for ancient languages – mainly valency lexica and dependency treebanks. They share the same data organization and the same pool of synsets, enabling comparison of linguistic – above all lexico-semantic – structures cross-linguistically.

In order to provide linguistic resources that are useful not only in the field of linguistic typology, but also in historical linguistics, as well as philological and cultural studies, the WordNet infrastructure has been enriched with family-specific features. First, etymological information is given for each lemma occurring in the database, which will allow users to investigate whether Sanskrit, Ancient Greek, and Latin cognate words lexicalize comparable arrays of concepts. See for instance the annotation associated to the lemma uṣas- ‘dawn’ in the Sanskrit WordNet:

FieldValue
Lemmauṣas-
EtymologyPIE *h2eus-ōs ‘dawn’
POSNoun
Morphologyn‐‐‐‐‐f‐‐c

Other language-specific features have been added at the level of lexical relations. In order to represent the rich derivational morphology of Sanskrit and other ancient Indo-European languages, the set of lexical relations has been extended: beside derivation and compounding which were already included in the Princeton WordNet, the Sanskrit WordNet accounts for antonymy (obtained through prefixation of privative a(n)-), parasynthesis (simultaneous conversion and affixation), inclusion (the relation between a multi-word unit and its parts), as well as for the relation holding between a participle and its base verb (e.g. sat- ‘true’ ⟶ as- ‘be’). The following table contains lexical relations between words related to div- ‘sky, heaven’:

RelationLabel Example 
Derivation is derived from deva- ‘heavenly, divine’ ⟶ div- ‘sky, heaven’ 
Compounding is composed of dyau-loka- ‘the world of heaven’ ⟶ div- ‘sky, heaven’;⟶ loka– ‘world’ 
Antonymy is privative of a-deva- ‘not divine, impious’ ⟶ deva- ‘heavenly, divine’ 
Inclusion includes dyāvā pr̥thivī ‘heaven and earth’ ⟶ div- ‘sky, heaven’; ⟶ pr̥thivī- ‘earth’ 

Since semantic relations constitute the core of WordNet architecture, the Sanskrit WordNet employs the established set of relations as much as possible and the introduction of culture-specific synsets is kept to the minimum. This will ensure compatibility of this WordNet with the existing ones. 

Finally, as Sanskrit enjoys centuries of attestation and a long tradition of studies, each of the identified senses of a lemma is tagged for its periodizations (listed in 3), literary genres (saṃhitā, brāhmaṇas, vedaṅga, ithihāsa, purāṇa, etc.) and loci, i.e., exemplifying attestations referred to by author(s) and work(s). This way, users will be able to track whether and how word meanings change over time and vary across literary genres and authors.

  1. Periods: 

Rigvedic (1700 to 1200 BCE)

Mantra (1200 to 1000 BCE)

Vedic Prose (1000 to 300 BCE)

Epic (300 BCE to 500 CE)

Classical (500 CE to 1100 CE)

A new theoretical framework for semantic annotation

An innovation of the Sanskrit WordNet is that the lexicographic work is framed within a cognitive linguistic approach, which requires assuming a principled view on polysemy. This entails i) avoiding the proliferation of distinct senses associated to a lemma, and ii) assuming that all senses of a lemma can be organized in a structured semantic network. Consequently, a clear distinction is made between literal and non-literal senses. Literal senses are detected based on their early attestation, concreteness, and predominance in the network. Non-literal senses, conversely, are derived from literal ones through cognitive processes like metaphor and metonymy. According to Cognitive Linguistics, the difference between metonymy and metaphor rests in what can be called conceptual contiguity: in metonymy, the senses associated with the polysemous word belong to the same conceptual domain, whereas in metaphor two senses belonging to different conceptual domains are mapped onto one another. For instance, in the Sanskrit WordNet, three senses are associated with the noun div- (/dyu-): 

  1. a. n#09459612: the atmosphere and outer space as viewed from the earth 

b. n#15190004: the time after sunrise and before sunset while it is light outside

c. n#06879763: god of the sky 

The sense in 4a is the literal one, as the noun div- is morphologically related to the root √div- ‘to shine, be bright’ and so the sky is literally ‘the shining one’. The sense in 4b derives from 4a as the result of a metonymic process: the word indicating the sky as the ‘shining one’ is used to refer to the period of time during which there is light outside. Finally, in 4c, div- refers to the god of the sky: this is a case of personification, a type of metaphor that consists in attributing volitional behavior to objects or abstractions (cf. Evans 2005).

A collaborative project

The Linked WordNets for ancient Indo-European languages project has just started and the project members are now harmonizing the data already contained in the three resources, fine-tuning the annotation interface and drafting the annotation guidelines. Once this first phase is completed, the project will open up to the scientific community, allowing interested parties to personally annotate one or more semantic fields of interest for their research. Anyone interested in investigating a particular conceptual domain will be able to contact the developers and ask for credentials to access the annotation interface. Data annotated by external parties will be validated by project members and each annotator will be given full credit for their work. 

That’s all for now. In the next post, I will present a case study to demonstrate how WordNets can be used in tandem with other language resources to answer specific research questions – stay tuned if you want to know more! In the meanwhile feel free to reach out to inquire about becoming part of the team.

References

Erica Biagetti, Chiara Zanchi, and William Michael Short. 2021. “Toward the creation of WordNets for ancient Indo-European languages.” In Proceedings of the 11th Global Wordnet Conference, 258–266. University of South Africa (UNISA). Global Wordnet Association.

Evans, Vyvyan. 2005. “The Meaning of Time: Polysemy, the Lexicon and Conceptual Structure.” Journal of Linguistics 41:1,33–75.

Fellbaum, Christiane (ed.). 1998. WordNet: An electronic lexical database. Cambridge, MA: MIT Press.