Look at the following Japanese sentence from Uchimura Kanzō’s 内村鑑三 Denmarukukoku no hanashi デンマルク国の話 (1911):
今日は少しこの世のことについてお話しいたそうと欲います。Full text accessible here.
One of the first things that someone unfamiliar with the Japanese language may notice is that there are no spaces between the words. Indeed, this is typical of all Japanese writing and governs the way we approach Japanese texts when we want to analyse them digitally. Unlike languages such as English where the boundaries of words are clearly defined by the white space that precedes and follows a word, those of us working with Japanese require code or tools to tell our computers and our programs where Japanese words start and finish. Today I will introduce readers to one of the simple ways that they may approach the process of segmenting Japanese with Python. This is primarily aimed at Python novices who are using a Mac, but it also assumes that the reader has installed python, knows how to use PIP, and knows their way around a text editor.
There are numerous Japanese tokenizers and text segmenters that can be used with Python. Famous amongst these are nagisa, TinySegmenter, JUMAN, and KyTea. I had initially hoped to introduce readers to nagisa or TinySegmenter which I have successfully used on another computer, but during the composition of this piece I kept running into syntax errors in my own code, so I decided to introduce readers to the tokenizer fugashi instead. fugashi is very easy to use and is also highly adaptable, so it is a great option for those new to Python.
One can install fugashi through their terminal or console by typing:
pip install fugashi
Fugashi requires a dictionary. Users can use their own dictionary (see documentation on the project’s PyPI page), but for the sake of this tutorial we will use UniDic-Lite which can be installed alongside fugashi by typing the following in one’s terminal or console:
pip install fugashi[unidic-lite]
Now we’ve installed fugashi we can start segmenting our text. I created a folder containing a .txt file (the text we scraped from Aozora Bunko 青空文庫in a previous tutorial) and opened it in my text editor Visual Studio Code. Here I created a new file.
The first thing we need to do is import fugashi. Like this:
Then we need to import their text. There are several ways that we could do this, but I have opted to adopt the following lines of code.
filename = 'aozoratext.txt' with open(filename) as file_object: text = file_object.read()
In other words, we have told the program to open aozoratext.txt (our .txt file, which can be replaced with another .txt file remembering to change the filename in the code), read it, and use its contents as our text.
Fugashi’s own Sample Code shows us a much easier way to import our text simply type:
text = "[Copy-paste your text here]"
I thought that this would prove clunky for those of us who want to segment a large amount of data, and so opted for a method where we ask the program to read the contents of a file.
Whichever process we hitherto used; the next step is to let fugashi do its work. Using fugashi’s Sample Code as our basis, we can include the following two lines of code to get our program to segment our text.
tagger = fugashi.Tagger() words = [word.surface for word in tagger(text)]
Finally, we want to be able to see the result, so we will add the line:
So the sum total of our code will look something like this:
import fugashi filename = 'aozoratext.txt' with open(filename) as file_object: text = file_object.read() tagger = fugashi.Tagger() words = [word.surface for word in tagger(text)] print(*words)
We could easily clean this up and add some basic extra functionality, such as adding exceptions for when the .txt file isn’t found. Something like the following perhaps:
import fugashi filename = 'aozoratext.txt' try: with open(filename) as file_object: text = file_object.read() except FileNotFoundError: message = 'The file ' + filename + ' cannot be found.' print(message) else: tagger = fugashi.Tagger() words = [word.surface for word in tagger(text)] print(*words)
Now when we run the code we will be able to successfully segment the words in our text and can thereafter begin to digitally analyse our text.
Well at least within the limitations that fugashi provides. The way fugashi segments lemmas may not be to everyone’s liking. If we go back to the first sentence of Denmarukukoku no hanashi, the output from fugashi looks something like this:
今日 は 少し この世 の こと に つい て お 話し いたそう と 欲 （ おも ） い ます 。
One will note that tsuite ついて is rendered as tsui te つい｜て and imasu います as i masu い｜ます. Many of us would like to treat these as single units, but fugashi (or more correctly its use of UniDic) does not allow us to do this since it separates the dictionary forms of words from their suffixes. For more on this and on using fugashi more generally, see the paper “fugashi, a Tool for Tokenizing Japanese in Python” by the tool’s creator Paul McCann. In addition to this, it must be noted that fugashi works much better with modern Japanese than with the Japanese used in Uchimura’s text, but it provides us with a good start. If we change the dictionary from UniDic-Lite to a dictionary focusing on Japanese contemporaneous to the text, something like Kindai Bungo UniDic for instance, we will likely get even better results! These issues aside fugashi is a great tool for segmenting Japanese, especially for those of us just starting out with Python.