This guest contribution is written by Anna Oskina.
Introduction
In my previous post, I showed how to extract proper names and places from Japanese texts and create word clouds with Python. This time I will try a new approach by using the data to help visualize places on maps. I will use Python to aggregate and collect data and R to visualize it. Since this is my first experience using these methods, I will note the difficulties that I faced while I was doing this exercise.
Last time we extracted 7368 place-names and figures on their frequency from a corpora of Meiji-Taisho period Japanese fiction and saved them in a txt-file. The importance of visualization is clear: we want to see more in the data, and make the data informative and helpful for interpretation. First, I try to visualize the data we got last time to show you the basics of working with geodata. Then, I enrich my data and focus on particular authors and their works.
Data from the first part of this activity.
How to find the coordinates of a place
When we aim to place something on a map we need its coordinates in latitude and longitude. For this purpose I suggest using GeoPy’s Nominatim API. Nominatim provides geocoding based on OpenStreetMap data. It recognizes addresses and places in many languages. To play around try to run this simple Python code which can also be found on my GitHub:
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent='your_e-mail)
some_place = input("Input your address: ")
location = geolocator.geocode(some_place)
lat = location.latitude
lon = location.longitude
coordinates = []
coordinates.append(lat)
coordinates.append(lon)
print(‘Lat and Lon of the said address: ’, coordinates)
This code allows you to input any address in any language and returns its coordinates as a list. As can be seen on the screenshot of my terminal (below), first I tried to input the address (‘Seiseki Street, Hongo 6-chome, Bunkyo, Tokyo’). After that I tried the name of a place (‘the University of Tokyo’), and finally I input ‘Tokyo’ in English, Japanese and Russian and got three equal results. Here I faced the first problems: Nominatim does not know the historical names of Japanese places, and a place must be written to accord with Nominatim’s database. As such, Nominatim will not recognize alternative spellings of Venice like ベニス (‘benisu’), べニース (‘beni:su’) or ヴェネチア (‘venechia’), but only the modern ヴェネツィア (‘venetsia’). If Nominatim does not know the place-name, the programme returns an error, that is why I strongly recommend using the Python methods that I explain below.
Terminal showing searches.
You can also use Nominatim to find an address according to coordinates. To check this feature run this Python code.
Now let’s add coordinates to our list of places (list_of_places.txt). The following Python code reads the txt-file taking a place from a line and using Nominatim to find coordinates for that place. If Nominatim recognizes a place-name, the code adds a line enriched with coordinates to a new list. From 7368 places Nominatim processed 6967, which is about 94.5%. Unfortunately, after visualization I found some mistakes: in at least two cases Nominatim recognized places, but gave wrong coordinates. For example, Nominatim gave the words Yōroppa ‘ヨーロッパ’ (Europe) and Chibetto ‘チベット’ (Tibet) coordinates within Japan.
from geopy.geocoders import Nominatim
import csv
def extract_csv(name_csv):
with open(name_csv, encoding = 'utf-8') as r_file:
data = csv.reader(r_file, delimiter = ',')
data_list = list(data)
return data_list
def find_coordinates(some_place):
geolocator = Nominatim(user_agent='annaoskina@gmail.com')
location = geolocator.geocode(some_place)
lat = location.latitude
lon = location.longitude
coordinates = []
coordinates.append(str(lat))
coordinates.append(str(lon))
#print(coordinates)
return coordinates
def write_csv(data_list, filename):
with open('{}.csv'.format(filename), 'w', encoding = 'utf-8') as file:
for row in data_list:
row.append('\n')
for srting in row:
file.write(string)
return 0
name_csv = 'result_tally.csv'
data_new_list = extract_csv(name_csv)
for line in data_new_list:
if line[3]:
place_name = line[3]
try:
line.extend(find_coordinates(place_name))
except:
print('Nominatim Error: ', place_name)
filename = 'places_with_coordinates'
write_csv(data_new_list, filename)
print('Done')
As a result of running the above code we now have a table with place names, their frequency, latitude and longitude. Next I added titles to the columns in the places_with_coordinates.csv file. Since I noticed that Europe and Tibet were not located correctly, I manually changed their coordinates.
CSV file with coordinates.
How to visualize places on map in R
Now we switch to R. R is a free software environment, and it is a handy instrument for aggregating and visualizing data. If you are not familiar with R, read the documentation here.
We need to install two basic libraries: tidyverse and leaflet. You can install them by entering following commands in R:
install.packages("tidyverse")
install.packages("leaflet")
Now we are ready to visualize our data. Find the code on my GitHub and below:
library(tidyverse)
library(leaflet)
read_csv("path_to_places_with_cooridnates.csv") -> data
pal_bin <- colorBin("Spectral", domain = data$frequency) #this is a function for coloring categories
data %>%
leaflet() %>% #calling leaflet
addTiles() %>% #for more Tiles see: http://leaflet-extras.github.io/leaflet-providers/preview/index.html
#addProviderTiles('Stamen.TonerLite') %>% #example of adding another Tile
addCircles(lng = ~lon,
lat = ~as.numeric(lat), #here is a point. This column was a character while I needed numeric values. That’s how I fixed it.
label = ~frequency, #you can see frequency, if you touch a circle with a cursor
popup = data$place, #you can see place-names, if you click a circle with a cursor
opacity = 1,
radius = ~frequency*100, #then more frequency, then bigger a circle
color = ~pal_bin(frequency)
) %>%
addLegend(pal = pal_bin,
values = ~frequency,
title = 'Places from Fiction')
Well done, we have successfully produced an interactive map! But wait, the map looks quite crowded with data!
Screenshot from the map.
It probably isn’t a good idea to use all the data for a single map. Furthermore, it seems that there are still many mistakes. Mistakes occur for several reasons. First, fugashi may incorrectly extract places-names (use additional dictionaries according to the specificities of your texts e.g. UniDic. Second, Nominatim is designed to work with modern names of places. To solve this problem, use your own dictionary of normalized names for places, such as the one that I wrote for a script in my previous post. Since Nominatim is based on OpenStreetMap, you can check the mistake on OpenStreetMap’s website to see if the locations appear on the map as you would expect. Also be wary that Nominatim may find the coordinates for train stations, pubs, shops etc., which share the place name that you are searching for.
My last advice is to save an attribute of a place if it has it. For example, extract 栃木 (Tochigi) with the word 県 (prefecture), or accompany the name of 嬬恋 (Tsumagoi) with the word 村 (village).
I suggest enriching and filtering the data before making a map.
Research Example
In my previous post, I talked about how references to foreign place names in fiction could act as a signal to the reader. In the Meiji-Taisho periods, Japan’s interactions with other countries opened new borders within literature, lifestyle and interests. A deep interest was taken in other countries and cultures. We know that Mori Ogai had passion about German culture, Nagai Kafu was taken with France. In order to know about other places mentioned in different works, we might want to use a visualization such as a map.
In order to make our data more readable, we need to enrich our data from the very first step by adding the metadata of the text(s) in which a place-name was found. I improved my script for collecting places and published it on my GitHub. To run this code you need a path to a folder with your corpora in txt-files, which you input in the 43rd line of the code. I shortened the list of places with coordinates, since I want to concentrate my view on Europe and the USA. Thus, I deleted all places written in kanji and hiragana (western name-places written in kanji I normalized with the function normalize_place()) and saved the file as places_with_coordinates_shortlist.csv in the same folder as the code. Remember to ensure that you have installed all necessary libraries: fugashi, regex, collections and python-csv.
This code reads each txt-file, cleans your text from rubi (we discussed why this may be necessary in the previous post), extracts filename, title and author, counts places and adds its coordinates. All this data is saved in a csv-file as result_tally.csv. Now we can be more flexible in visualizing our data.
Data from texts by Akutagawa Ryūnosuke.
Let’s try to make a more elegant visualization in R. Now we will filter our data by author and will visualize places mentioned in the works of particular authors. The full code in R is here, and I will explain it in further detail below.
There are mainly two steps: preparing data and its visualization. For these steps I need two libraries: tidyverse and leaflet. When I run library(tidyverse) and library(leaflet), I load them and make them available. Then I read my data in a dataframe and named it ‘data.’
library(tidyverse)
library(leaflet)
read_csv("https://raw.githubusercontent.com/annaoskina/Geo_fiction/main/csv/result_tally.csv") -> data
The data frame ‘data’ contains 7 columns and 4761 entries. Remember, I counted the frequency of place-names in each work.
The data frame ‘data.’
Now I need to filter data by author (e.g. Akutagawa Ryūnosuke ‘芥川龍之介’) and count total frequency in all of Akutagawa’s works. To do this, I used functions filter, group_by and summarise. The function str_c joins multiple strings into a single string, collapsed with ‘<br>’ – a new paragraph. I saved it as a new data frame named ‘Akutagawa.’
data %>%
filter(author == '芥川龍之介') %>%
group_by(place_name, lat, lon) %>%
summarise(total_freq = sum(frequency),
all_titles = str_c(title_jp, collapse = "<br>")) ->
Akutagawa
My ‘Akutagawa’ data frame contains 5 columns: place-name, latitude, longitude, total frequency and all titles of works, where this place-name was mentioned.
Akutagawa data frame.
Now my data is ready for visualization, but one more function is still necessary. In order to have a color pallet I added this line:
pal_bin <- colorBin("Spectral", domain = Akutagawa$total_freq)
You can read about coloring numeric and categorical data here.
Next we add tiles, circles and legend for data in the ‘Akutagawa’ data frame. The most exciting point is to make a popup, which shows us all titles by clicking on a place on a map.
Akutagawa %>%
leaflet() %>%
addTiles() %>%
addCircles(lng = ~as.numeric(lon),
lat = ~as.numeric(lat),
label = ~place_name,
opacity = 1,
color = ~pal_bin(total_freq),
radius = ~total_freq*1000,
popup = Akutagawa$all_titles) %>%
addLegend(pal = pal_bin,
values = ~total_freq,
title = 'Akutagawa Ryunosuke')
Here is what I got:
Map showing data from Akutagawa Ryūnosuke’s works.
You can view the full functioning map here.
As we found out last time, the most frequent places (and also names) were connected with four countries: France, Germany, England and Russia. For Akutagawa the most attractive place is France (mentioned in 20 works), which is potentially linked to his deep interest in French literature. Nevertheless, my attention is also attracted to Russia. I can see that Russia is mentioned in 10 works. I wonder what special meaning the image of Russia has in these works and if there is a common emotional shade or particular atmosphere that is reflected across these works. Akutagawa mentions Russia together with other countries like in “Mensura Zoili,” uses Russian narrators such as the opera singer Irina Burskaya in “Karmen,” and records incidents between Russian parties such as Tolstoy and Turgenev in “Woodcock”.
The visualization also allows us to see that France and Paris are also mentioned in “Woodcock.” The connection between different places in fiction is also a very important subject, which is worth further researching. Thus, in “Woodcock” France and Paris are mentioned in dialogues and depicted as part of the personal experiences of Tolstoy and Turgenev.
You can make a map of other writers just by changing a few lines. This visualization allows you to compare writers by their interest to other places and cultures, investigate places mentioned in fiction and find more or less popular places, which reflect interests in other cultures and the borders of a single work.