Web scraping is a technique that allows us to extract and copy specific pieces of data from a website. Web scraping is not always ethically unambiguous and can be legally dubious depending on the country or terms and conditions of the website. For those interested, James Densmore and Justin Abrahms have posted accessible introductions to the ethics of web scraping in Towards Data Science and Quick Left respectively. Despite ethical considerations, for those involved in the textual analysis, web scraping can be a speedy and useful means to pre-process a text. In this post, I will explain how to make a simple web scraping program with Python aimed at beginners like myself to be used on the Japanese text database, Aozora Bunko 青空文庫.
Uchimura Kanzō’s Denmarukukoku no hanashi reprinted by Iwanami Shoten (2012).
As an example text, I will be using Aozora Bunko’s HTML version of Uchimura Kanzō’s 内村鑑三 Denmarukukoku no hanashi デンマルク国の話 which can be found here. If we look at this version of Uchimura’s text on Aozora Bunko, we see that at the bottom of the page numerous pieces of information about publication and input are included. These pieces of information may interfere with digital analysis of the text including word count and word frequency. Furthermore, in order to analyze the text using Python we might desire to have a copy of the text in plain text format. As such, our web scraping program needs to do two primary things: extract relevant data from the website and write it to a .txt file.
In order to make the web scraper I followed the guide made by Traversy Media which can be seen in the below video. The main difference between Traversy Media’s web scraper and my own is the output. Traversy Media’s web scraper writes to a CSV file, whilst the one I am describing in this post writes to a .txt file. Additionally, the web scraper described herein will return text written in Japanese characters. For those who are interested Traversy Media’s code can be found here.
Installing Required Packages
I assume the reader has some familiarity with Python, and that it and a text editor are already installed on the reader’s computer. As such, the first step is to install two Python packages. Firstly, we will install Requests, which will allow us to make a HTTP request (see the explanation given by Code Academy here). To do so type the following into your terminal or console:
pip install requests
Secondly, we will install Beautiful Soup, a HTML parser that will provide us with the means to extract certain information from a given webpage. To install the latest version of Beautiful Soup (details here) type the following into your terminal or console:
pip install beautifulsoup4
Writing the Code
Now we can begin writing the program. The source code is available on my GitHub Repository, but will also be included throughout this post.
In order to use the packages that we installed above we will need to import them into our program. As such the first few lines of our code should look like this:
import requests from bs4 import BeautifulSoup
We then need to make our HTTP GET request in order to access the data from the web page. This can be seen in the below line of code. The URL in this line of code is for Uchimura’s Denmarukukoku no hanashi on Aozora Bunko, but by changing it we can use the program on different web pages.
response = requests.get('https://www.aozora.gr.jp/cards/000034/files/233_43563.html')
When the encoding used by the site and the program don’t match.
At this stage we also need to ensure that our program uses the same encoding as Aozora Bunko or it won’t return Japanese text (see an example of this in the above picture). The encoding will likely change according to the website or web page that is being scraped, but it can easily be checked with the Chrome Developer Console or by viewing the source code. Aozora Bunko uses Shift JIS (Shift Japanese Industrial Standards) and therefore the program should include the following line of code:
response.encoding = 'shift_jis'
Next we will use Beautiful Soup to extract relevant data from the web page (the data from our HTTP GET request known in our code as response), but first we need to decide which data we want to keep. I would like to remove the publishing information found at the bottom of the page whilst retaining the name of the author and the text’s title, subtitle, and text.
The source code for the web page.
In order to know which parts of the page our program needs to extract we need to look at the source code of the page. There are two ways that we can do this. When viewing the web page we want to scrape in our browser we can open a new tab to view the page’s source code by pressing CTRL + U (on a PC) or Command + Option + U (on a Mac). Alternatively, if we are using Google Chrome we can press CTRL + Shift + J (on a PC) or Command + Option + J (on a Mac) in order to open Chrome’s Developer Console. I favor the second option since we can identify different parts of the web page simply by hovering over different parts of the code. Incidentally, in the case of Aozora Bunko the aforementioned encoding of the web page can be seen in the top line of the source code regardless of the method that you choose.
As can be seen in the below image all the data that we wish to extract (author information, title, subtitle, and text) are included under the <body> element in the source code.
Viewing the source code with Chrome’s Developer Console.
As such, when we parse the text using Beautiful Soup we only need to parse the sections of the web page included under the <body> element. This is encapsulated in the next piece of code:
soup = BeautifulSoup(response.text, 'html.parser') contents = soup.find_all('body')
Next we need to extract specific pieces of data, namely the title, subtitle, author information, and text. By using Chrome’s Developer Tools we can find out which elements the data we want are included within (see the below picture). Through reading the source code and hovering over different parts of it we can see that the data we want is identified by the classes “title,” “subtitle,” “author,” and “main_text.”
Finding the data we need in the source code.
As such, we should code our program to find the textual data from each of these classes as illustrated in the section of code below. For formatting purposes, we can also include the replace() function in order to get rid of white space. It must be noted that the code may require editing to work with other pages on Aozora Bunko since some texts may lack subtitles or titles which will cause the code to return an error when run.
for content in contents: title = content.find(class_='title').get_text().replace('\n', '') subtitle = content.find(class_='subtitle').get_text().replace('\n', '') author = content.find(class_='author').get_text().replace('\n', '') text = content.find(class_='main_text').get_text().replace('\n', '')
As noted in the introduction we want to write all of this information to a .txt file. The final part of the code allows us to do this. This final section will create (or if it already exists) add to a .txt file entitled aozoratext.txt. It will print the headings “Title:,” “Subtitle:,” “Author:,” and “Text:,” followed by the relevant data from each of these identically named classes in the source code for Uchimura’s Denmarukukoku no hanashi.
f = open('aozoratext.txt', 'a+') f.write(('Title: ') + title + ("\n")) f.write(('Subtitle: ') + subtitle + ("\n")) f.write(('Author: ') + author + ("\n")) f.write(('Text: ') + text + ("\n"))
Running the Code
Now that the code is written we can run it and it will produce an output akin to that seen in the following image.
Using a web scraper to extract data from a single web page as described above may strike the reader as an over-engineered alternative to copying and pasting. Indeed, when faced with a single text it is rarely necessary to web scrape. Moreover, since Aozora Bunko offers .txt versions of the texts on its site for download, if one is only dealing with a single page, web scrapping isn’t necessary. Despite all this, web scraping is a potential useful tool for the pre-processing of a large number of texts or a large amount of data. As such, knowledge of web scraping is potentially useful for researchers engaged in the Digital Humanities and is a type of program suitable for coding novices like myself and those wishing to learn more about Python and its potential uses. I hope that this post will help and be used by beginners in the Digital Humanities to get familiar with web scraping and simple Python programs.