Some thoughts on regular expressions

What are regular expressions and why use them?

Those who are interested in the digital humanities or corpus linguistics are very likely to come across regular expressions (or in short regex) – a syntax system that is used for advanced search and replacing in texts based on patterns. They are reminiscent of programming languages, and although a big part of text processing and everyday functioning of the internet – such as the reminders that your password doesn’t have enough special symbols to be valid – relies on regular expressions, people more often than not choose to avoid them because of their cryptic nature.

For my own part, I regret that I didn’t learn them early enough and that my compulsory BA course in computer literacy did not include them: regex saved me from hours of tedious formatting and text search and is now instrumental in many of my research activities.

So what are the benefits of regular expressions? “Pattern search” is rather vague, but this has its own benefits: the use cases are only limited by our creativity. Here are some of the possible applications:

  • Fuzzy search – If there are multiple character variants or spellings that can be used, one can search for all of the them at the same time. This works for historical figures, who might have had pseudonyms or multiple ways of spelling their name. It also works well for searching one’s notes when one is not sure of the exact wording.
  • Bulk formatting – Different journals require different formatting – if one finds out that they need to redo everything, one expression can save hours of manual work. At the same time I – like many other sinologists – regularly use https://ctext.org/, but the less popular texts there can be messy. Regex can be used to clean them up if there is any pattern behind the formatting: so something like this can be turned into a marked corpus like this with relatively little effort.
  • Search complex relations – Regular expressions can help catch connections between different entities in texts – as long as they can be formally described. One example is data for social network analysis. There are many ways to extract connections between entities in a text, but regex can be used as a key tool to do so. For example, if we study a work of literature and assume that two characters are connected if they appear in the same paragraph (or scene in a play), regex can be used to extract these connections.

Learning to use regular expressions

There are plenty excellent introductions to the topic on the internet, so instead of creating a new one I will point to some useful resources. My own addition to the topic will be talking about working with non-Latin scripts, as regular expressions were created to deal with English alphabet and require some workarounds even when an occasional umlaut appears.

Some starting points and guides for getting acquainted with the topic include the Programming Historian’s introduction to regular expressions, Donald Sturgeon’s introduction to regular expressions for those working with Chinese texts, and the tutorials on RegexOne and PY4E.

Since programmers have realised that regular expressions are not easy to deal with, they have created a whole range of websites and tools that can help one to check what an expression will find in a text and provide hints and cheatsheets – these are the perfect places to practice and test ideas. My personal favourite is RegExr. RegExr is not overly complicated and beginner-friendly, but at the same time it enables Unicode-based searches – I will explain why this it so important later.

If you want to start using regular expressions to work with big texts, there are several options available. If the goal is only to search, but not manipulate texts, most corpus linguistic tools such as Antconc offer a possibility to use a simplified version of regex. However, a more advanced user might want to use specialised text editors like Notepad++ for Windows and BBEdit for Mac (the free versions are more than enough) – regular expressions usually hide in the search window under the “Grep” option. And, of course, there is always an option to use command line or to write your own program: most programming languages support regex, although there might be some small differences in how they are used.

Working with non-Latin scripts

Now, as I mentioned above, regular expressions were created to work with English. So, most of the tutorials start with saying something like this: “if you want to search for words only, i.e. omit everything that is not a letter, use expression \w (w for ‘word’) or a range [A-z] (as in ‘everything between capital A and small z’)”. There is an obvious problem here: ‘\w’ will not work even with umlauts, and although some of the non-Latin scripts do have alphabets (so [A-я] will work for Russian and [あ-んア-ン] – for Japanese kana), it is hard to imagine a similar approach, when it comes to hieroglyphic scripts.

There are several workarounds for this. The first is to avoid looking for script altogether and make peace with getting some dirty results containing punctuation. The results can be improved by creating negative ranges and defining characters as something that are not punctuation, spaces etc. In this case, an expression for getting one Chinese character will be something similar to [^。,、/|?~!@#¥%…… \n\t\s0-9].

And yet, there is a place, where characters (or elements of many other scripts more or less used on computers) do constitute a range, and that is Unicode: it was created to systematically represent and encode as many writing systems as possible for digital use. Different scripts have specific ranges of codes associated with them – and, fortunately enough, these can be used for regex. So, Chinese (with some overlap with Japanese) characters can be found with this expression: [\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD].

This also comes handy when looking for rare characters and radicals that are not in the standard fonts and, in general, for working with less popular scripts. For more information I suggest looking into this resource and then searching for Unicode ranges for particular script of interest – Wikipedia page on Unicode is a decent place to start.

This approach will not work with all tools, but regexr.com, Notepad++ and BBEdit have this functionality, and in most cases this will be more than sufficient.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s