Making a Basic Textual Analysis program in Python

Whether we are involved in Japanese Studies or Islamic Studies, Near Eastern Studies or African Studies, we are all likely to interact with historical texts written in Romance and Germanic Languages, and for our research, we may want or need to analyze data about those texts. To this end, I created a basic Python program for analyzing English-language historical texts (particularly secondary sources), although with the tweaking of a few parameters it could easily be used to assess texts in other Romance or Germanic languages. Since I am a novice when it comes to Python, the code, which I wrote in Visual Studio Code, is a hotchpotch of different things that I have figured out or learnt from YouTube, Stack Overflow and other sites. Nevertheless, I found writing the code an insightful and enjoyable experience and hope that others may find the finished product useful in their research.

The code is available on my GitHub repository. Included with the code is a .txt version of William Elliot Griffis’s The Religions of Japan from the Dawn of History to the Era of Meiji (1895) taken from Project Gutenberg (see: http://www.gutenberg.net/1/5/5/1/15516/), which I have been using to test the program, however, other users will want to substitute this for a .txt file of the text that they wish to analyze. In order to do this, the user’s chosen .txt file should be saved within the BasicTextAnalyzer’s folder on their computer and the filename ‘Religions.txt’ found in the 9th line of the code should be replaced with the filename of that .txt file.

Using BasicTextAnalyzer

When the code is run, it will provide the user with four options; ‘Analyze,’ ‘Frequency,’ ‘Compile,’ and ‘Search.’ The user types which of the options they would like to perform and the code runs.

Screen Shot 2019-07-01 at 18.28.58

The code and test running the ‘analyze’ function.

 

If the ‘Analyze’ option is chosen the program will rapidly gather some statistical information on the text and print this for the user to read. It will list the following:

  1. Approximate total number of words.
  2. Total number of characters.
  3. Total number of characters (without spaces).
  4. Approximate number of sentences.
  5. Average words per sentence.
  6. Average characters per word.

Such information is useful for gauging the length of the text and details about its composition, which may provide important insights for those using the program to compare multiple texts. By looking at sentence length and the average number of characters per word the user will also gain some insight into the readability of the text. The following table displays the details for the test text, The Religions of Japan from the Dawn of History to the Era of Meiji:

Detail Number
Total words 125,007
Total characters with spaces 762,213
Total characters without spaces 645,314
Approximate number of sentences 8,134
Average words per sentence 15.368
Average characters per word 5.162

From the results, we can see that the text is longer than the average, modern, academic book which is likely to be between 80,000 and 100,000 words long. We can also see that the average word length in the text (5.16 characters per word) is slightly longer than the average word length in English as a whole (4.5 characters per word), which may indicate that that the text is difficult to read.

Screen Shot 2019-07-01 at 18.29.39

A section of the code.

There are some potential problems with the program’s output when the ‘Analyze’ option is chosen. The program recognizes sentences based on certain types of punctuation, therefore, if the program was used to analyze a modern text containing hyperlinks, for example, an incorrect sentence count would be given. This would also influence the calculation of the average number of words per sentence. Numerically categorized lists in the text (such as the one which appears above in this article) may also produce an incorrect estimation of sentence length. Nevertheless, as a rough estimating tool the program is rather effective.

The second function is accessed if the user types ‘Frequency’. This function will list all the words that appear in the text and their frequency ordered according to frequency. This function is again useful for those comparing several texts, but can also provide useful insights into the text itself. For example, whereas the test text is about religion in Japan we can see that it primarily focuses on Buddhism, a word which features some 496 times. On the other hand, Shintō (which appears in the .txt file as ‘Shint[=o]’ and is therefore represented in the frequency data as ‘shint’) appears only 243 times, and Christianity 119 times. I imagine that this function may be important for those conducting linguistic research.

Screen Shot 2019-07-01 at 16.32.25

Using the ‘frequency’ function.

The third function is entitled ‘Compile’. This function imports pyspellchecker so users will need to ensure that they have installed this package. Nevertheless, the files are included in the BasicTextAnalyzer’s GitHub repository. The program checks the words which appear in the text against a list of frequent English words. I opted to use a list of the 10,000 most frequently used words in Project Gutenberg (2006) since I thought that it would work particularly well when analyzing older, English-language texts. Nevertheless, the user can change the list of words used by the program by replacing the ‘WordFrequency.txt’ in both the code (line 76) and in the BasicTextAnalyzer’s folder with a list of words of their own choosing. The function will return the number of words that do not appear in the user’s word frequency list, and following a further prompt will display these words. If the user wants to save these words they can tell the program to do so and it will save a file entitled ‘UnknownWordList.txt’. This is a particularly useful feature for those interested in compiling dictionaries (classically understood) or those compiling word lists for use in Python. The concept of this function arose from my own experiences of trying to create a dictionary in Python. I found this to be a useful and quick system to find and add words to my word list. It was also partially inspired by The Digital Orientalist’s article ‘Tackling Poetry with Python (2)’ and the associated code, PslaterSearch.

Screen Shot 2019-07-01 at 18.30.18

Using the ‘compile’ function.

The final function is entitled ‘Search’. This allows the user to search the text for a single word and view the number of times it appears in the text.

Screen Shot 2019-07-01 at 18.30.54

Using the ‘search’ function.

Conclusions and Future Work

The programming language, Python, allows us to build simple programs that can quickly analyze a given text. This article explored the creation of one such program, which may be suitable for analyzing historical, English-language secondary sources. There are, of course, some problems with the way the BasicTextAnalyzer works, some of which are noted above, but I think that it will provide users with an easy to use and accessible tool to quickly analyze texts.

In the future, I intend to edit the code so that it works with Japanese. Nevertheless, I am very happy for others to take the code, edit it, and rework it into something that works for them.

Remember to check out the GitHub repository.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: