Introduction to Programming with Chinese

This post was prepared together with Tilman Schalmey.

Find here the tutorial files for this post.

As digital technologies develop, it gets harder and harder to ignore them – and even if one does not work with computational analysis, we still increasingly rely on text databases. When using digital databases, understanding the textual history of sources collected in them involves more than knowing the lineage and the editorial practices that might have influenced the contents of a text before they were added to a database. Digitization brought a plethora of new practices and editorial decisions to be made, and digital copies are often far removed from the physical ones that they were supposed to represent. So, understanding the history of digital texts requires an understanding of the procedures and concepts that lie behind them.

At the same time, a basic understanding of programming can present multiple new ways to search for patterns, useful passages, and find peculiarities in the sources. With this in mind, we have prepared a short tutorial that shows what is involved in preparing a text in Classical or Modern Chinese for computational analysis. It contains a collection of files with annotated Python code, structured in Jupyter Notebooks, that shows:

  • Basic programming concepts
  • Loading a text
  • Using Regular Expressions to do structural markup of a corpus
  • Replacing character variants and converting the text into full and simplified forms
  • Tokenizing the text (i.e. detecting the word boundaries)
  • Removing stop words for statistical analysis
  • Obtaining corpus statistics

The good news is that, if you have a Google account, you do not need to install anything on your computer to go through the tutorial – just open the files in the shared folder, read the explanations, and press the “play” button in the code cells to see what happens. 

If you are opening this from your phone, make sure that you are using a web browser, and not the Google Drive app (or in the Drive click on the file -> three dots at the top -> Open with -> Google Colaboratory).

If you don’t have an account – either check the pdf files or check the “00_Introduction” file for instructions on how to install Jupyter Notebooks on your computer – this file also has some additional information about the notebooks.

Good luck with your programming journey and let us know if you have any questions! 

Leave a comment