Genius loci: extracting names and places from Japanese texts

This guest contribution is written by Anna Oskina.

Introduction

Among other things, the digital humanities deals with individual items, which can be easily counted. When these items are counted in a relatively large number of texts, we can look at their distribution across the corpus and make interpretations. Proper names are very suitable for this for several reasons. First, like most units in languages, they are clearly determined in linguistics and easy to extract and count. There are a large number of tokenizers for Japanese including Fugashi and MeCab which let us easily find and extract proper names. Second, they likely have a semantic meaning in fiction, because they have an association with specific cultures, especially when it concerns foreign names and places. This post is a tutorial on and description of automatically extracting proper names and places from Japanese texts and creating word clouds with Python.

Gathering

A basic operation for DH textual research is to gather representative samples of texts for analysis. For gathering data, I used the Japanese digital library Aozora Bunko, which includes copyright-free books and works that the authors make freely available. The choice of this library is based on two factors. First, according to its index, Aozora Bunko includes 18,656 items by more than 2,000 authors as of November 2nd, 2022. These include, biographies, diaries, travelogs, articles, and Japanese language translations. It is a huge resource for gathering data. Second, the index page is well-structured and available as a csv-file with all metadata and links to download texts as a txt-files. That makes it an easy location from which to automatically download the necessary files.

The corpora I use for this exercise will be limited to the Meiji (1868-1912) and Taisho (1912-1926) periods and will not include translations. I got 3452 files equating to a total of 3431 works (21 texts consist of two files).

Extracting

I used Python 3.8 to write my code and took advantage of Fugashi, a Japanese tokenizer and morphological analysis tool. James Harry Morris wrote about Fugashi and Japanese text segmentation in an earlier post. Fugashi splits a text into lemmas and accompanies each with information on morphological features. Thus, we can extract tokens (i.e. proper names in the case presented, marked as  ‘固有名詞’), or names and places marked as ‘人名’ and ‘地名’ respectively.

How do we do this? First, we need to import all necessary libraries for Python. The OS library can manipulate many files in a folder. Regular expressions (re) help to clean a text from ruby (readings of characters). We should clean a text from ruby, because if we include it we will artificially increase the occurrence of names and places that included furigana. Counter is designed for counting hashable objects (in our case these are names and places).

from fugashi import Tagger 
import os    
import re    
from collections import Counter    #necessary to count words in lists

I downloaded my text files in a folder. The following function is to open and read my file:

def read_file(path, filename):
      with open('{}/{}'.format(path, filename), 'r', encoding = 'utf-8') as f: #check your encoding. It might be 'Shift-JIS'
            text = f.read()
            text_without_ruby = re.sub('《.+?》', '', text)
      return text_without_ruby

These following two functions that look quite similar to each other can be used to extract places and names from a text. First for place names:

def extract_places(text, tagger):
    tagger.parse(text)
    list_places = []
    for word in tagger(text):
        if word.feature.pos3 == '地名':  #find all lemmas, marked as '地名'
            list_places.append(word.surface)    #and save the world's surface in the list
    return list_places

Second for personal names:

def extract_names(text, tagger):
    tagger.parse(text)
    list_names = []
    for word in tagger(text):
        if word.feature.pos3 == '人名':
            list_names.append(word.surface)
    return list_names

I created a dictionary to merge different variants of spelling (e.g., ‘ヨーロッパ,’  ‘欧羅巴,’  ‘欧洲,’ and ‘欧州’). A key feature of the dictionary is that variant spellings are assigned a value that will normalizes the names. In other words, we are able to unite all variants (‘欧州’ : ‘ヨーロッパ,’ ‘Europe’ : ‘ヨーロッパ’). You can decide if you need to merge ‘大和’ and ‘日本,’ ‘支那,’ ‘唐’ and  ‘中国.’ For some researchers the difference between these nouns is critical and therefore wouldn’t need merging. For instance, if you are investigating the percentage of instances that these terms are written in katakana, kanji and romaji you wouldn’t want to merge these terms. Nevertheless, since the purpose of my research is to determine the geographical width of the corpora, ‘独逸,’ ‘ドイツ,’ ‘德国,’ and ‘Germany’ must be the recognized as the same country (merged) in my list. The dictionary is as follows:

normalized_dict = {'大和 ': '日本', '欧州' : 'ヨーロッパ', 'Europe' : 'ヨーロッパ', '欧羅巴' : 'ヨーロッパ', '欧洲' : 'ヨーロッパ', '露西亜': 'ロシア', '露国': 'ロシア', 'Russia': 'ロシア', '亜米利加': 'アメリカ', '米国': 'アメリカ', 'America': 'アメリカ', '支那': '中国', 'China': '中国', '唐': '中国', 'シナ': '中国', '英国': 'イギリス', '印度': 'インド', '仏蘭西': 'フランス', 'France': 'フランス', '仏国': 'フランス', '巴里': 'パリ', 'Paris': 'パリ', '独逸': 'ドイツ', '德国': 'ドイツ', 'Germany': 'ドイツ', '倫敦': 'ロンドン', 'London': 'ロンドン', '伯林': 'ベルリン', 'Berlin': 'ベルリン', '伊太利亜': 'イタリア', 'イタリイ': 'イタリア', 'イタリー': 'イタリア'}

The next function replaces place names in our list according to the variant we chose in the dictionary.

def normalize_place(place):
    try:
        return normalized_dict[place]
    except:
        return place

Finally, we arrive at our main chunk of main code:

path = '<path to folder>' #this is a path to your folder with text-files
files = os.listdir(path)
places_tally = []    
names_tally = []
tagger = Tagger('-Owakati') #you should initialize fugashi's tagger once here
for filename in files:
    try:
        text = read_file(path, filename)
        print(filename)
        list_places = extract_places(text, tagger)
        clean_list_places = []
        for place in list_places:    #here you normalize places' names according your dictionary
            clean_list_places.append(normalize_place(place))
        list_names = extract_names(text, tagger)
        places_tally.extend(clean_list_places)
        names_tally.extend(list_names)
    except UnicodeDecodeError: #some files on Aozora can not be opened, to prevent our code from interruption we should except this mistake
        print('UnicodeDecodeError:', filename) 
with open('list_places_tally.txt', 'w') as f: #here we save our places counted with the Counter and write result in a file
    for word, frequency in Counter(places_tally).most_common():
        f.write(word + ',' + str(frequency) + '\n')
with open('list_names_tally.txt', 'w') as f:
    for word, frequency in Counter(names_tally).most_common():
        f.write(word + ',' + str(frequency) + '\n')

By running this code, the texts I chose for my research texts were opened and read, and parsed with fugashi before the names and places were extracted. Proper names and places are saved in two separate txt-files in the folder next to your code. These files consists of the extracted proper nouns (persons’ names or places), and frequency in our corpora. They can be used for further quantitative analysis or to create visualizations such as a word cloud.

The complete list of place names.

Generating a Word Cloud

I used the guide on Turnip 2 to generate a word cloud for my results.

First, I installed word cloud library in the terminal:

$ pip3 install wordcloud

Next I saved the following code in my folder with the result lists as make_word_cloud.py. This code must be run from the terminal. The function make_cloud() asks the user to enter the name of the input file (in our case this is list_names_tally.txt or list_places_tally.txt), the name of output file (any_name.png) and the path to the font which we install before running this code. This function reads our input file as csv and rewrites the data into a dictionary. The word cloud parameters (background_color, width, heights etc.) are preset, and I wouldn’t recommend changing them.

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import csv,sys
from wordcloud import WordCloud
 
def make_cloud():
    if(len(sys.argv)!=4):
        sys.exit("Usage: $ python %s <input_filename> <output_filename> <font_path>" % sys.argv[0])
    reader = csv.reader(open(sys.argv[1],'r',newline='\n'))
    d={}
    for k,v in reader:
        d[k]=int(v)
 
    wordcloud = WordCloud(background_color="white",
    font_path=sys.argv[3],
    width=800,height=600, max_words=1000,relative_scaling=1,normalize_plurals=False).generate_from_frequencies(d)
    wordcloud.to_file(sys.argv[2])
 
if __name__ =='__main__':
    make_cloud()

Finally I opened the terminal in the folder, where I keep my files: make_word_cloud.py and list_places_tally.txt/list_places.txt. And entered the three following commands:

$ sudo apt install fonts-takao # install Japanese fonts
$ python3.8 make_word_cloud.py list_places_tally.txt cloud_places.png 
"/usr/share/fonts/truetype/takao-mincho/TakaoMincho.ttf"

The above code allowed me to initialize the code saved in make_word_cloud.py, send the results from list_places_tally.txt and save a picture with word cloud as cloud_places.png. Here is the result:

Place name word cloud.

Analysis and Conclusions

My interest in this exercise came from wanting to find out what names of famous people represented foreign countries in the Japanese works of fiction of Meiji-Taisho. The frequency of names and places reflects an interest of writers in foreign cultures. I extracted 7368 unique places and 18198 unique names. While Japan (12703 entries) and its major cities like Tokyo/Edo (6043/3483), Kyoto (1880), Osaka (1463) remains the main scene of action in these texts, we can see that neighbouring countries – China (6163), India (1832), Tibet (1494) and Korea (809) – also appear in high frequency. We can also see representative interest in America (1513) and Europe (1069), where the most frequent countries are France (1643), England (1348), Russia (1056) and Germany (1017). We can then add Paris (941) and London (496) and Berlin (169) as metonyms for  these countries. Italy and Rome (315), Greece (206) and Athens (29), Spain (132) and Madrid (43), Denmark (59), Switzerland (49) have less frequency, but still represent Europe and refer a reader to European culture.

People’s names in fiction can play a guiding role. They may be a signal to literature or cultures that influence the main characters, or they may contribute to creating atmosphere in the text. I used a term of genius loci in the title of this post, because a name becomes ‘a spirit of a place’, playing a role of allusion in a text. The most frequent foreign names that we might associate with Europe in the data set is Christ/Jesus (371/137), which refers a reader to Christianity and is explained by interest in religions. The name of Mohammad (82) is less frequent, but still attracts the attention.

At first glance the names are split into four big groups according to countries: France, England, Russia and German – as we saw on places’ frequency. These include writers, artists, politicians, scientists etc., as well as famous characters like Carmen or Mephistopheles. Thus France is represented by Rodin (110),  Carmen (38), Racine (37),  Zola (36), Cezanne( 30), Bergson (24), Bodler (23); England by Sherlock Holmes (66), Byron (64), Hamlet (43), Newton (39), Doyle (35),  Spencer (32), Cromwell (27), Darwin (27), Robinson Crusoe (25), Milton (23); Russia by Tolstoy (257), Dostoevsky (51), Turgenev (28), Kuropatkin (28), Chekhov (21) or names like Ivan (118) and Nicolai (32); and Germany by Goethe (44), Nietzsche (52), Faust (37), Bismarck (45), Mephistopheles (25).

These are just some of the possibilities of textual analysis using quantitative methods. Proper nouns are a key to understanding the work of fiction, it’s context and allusions. By comparing the use of proper nouns with other time periods we can look at the history of literature from the point of geographical and ideological influence or we can look precisely at a particular writer and explore his or her interest in foreign cultures. The question of cultural study can also be raised by these results. We can explore and compare the set of famous people mentioned in different literature: whether this set is different in European and Asian countries. So I encourage you to also begin experimenting with extract proper nouns and making wordclouds, as I have explained today, to see where it might take your research.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s