Google Docs and OCR: Some Experiments Transcribing Japanese Language Texts

In the past year or so I have been seeing and hearing an increasing amount about the capabilities of Google Docs to transcribe scans in PDF format into editable documents. In the Digital Orientalist, Editor for Syriac Studies,Ephrem Ishac, has explained how to perform OCR on Syriac texts using Google Docs and Editor for Islamic Studies, Alex Mallett, has recently explained the process using Arabic texts. Reading Ishac’s and Mallett’s contributions, who have both praised Google’s OCR for its accuracy, I wondered how effective Google would be at transcribing Japanese language texts. Today I will experiment with the automatic transcription of Japanese language texts using Google Docs.

The Process

The process for transcribing a text using Google Docs is very simple and involves only three steps:

  1. Scan or photograph the text.
  2. Upload your scan or photograph in PDF format to your Google Drive.
  3. Right click on the PDF in Google Drive and then click “Open With” -> “Google Docs.”

Scanning

For the purpose of today’s experiments, I made several scans with Adobe Scan, took some photographs with the camera on my phone converting them to PDF after transferring them to my computer, and found some random PDF files on the hard drive of my computer. I opened them in Google Docs using the above noted process and checked the results against the originals.

Results and Lessons

1. Don’t Scan with Adobe Scan

The first thing I learnt was that PDFs created with Adobe Scan generally do not produce good results when one attempts to automatically transcribe them using Google Docs. A lot of the scans that I had made with Adobe Scan returned completely nonsensical strings of symbols and characters.

Gibberish result of scans from Tokushige Asakichi 徳重淺吉, Ishin seiji shūkyōshi kenkyū 維新政治宗教史研究 (Tokyo: Rekishi Toshosha, 1974).

I think this is caused by Adobe Scan’s own recognition system, the output of which is just being imported into the Google Doc. See, for example, the below images from Suzuki Norihisa’s 鈴木範久 Seisho no Nihongo: Honyaku no rekishi 聖書の日本語: 翻訳の歴史 (Tokyo: Iwanami Shoten, 2006) – the left image was made by importing the PDF into Google Docs, and the right image was created by copying and pasting the text from the PDF created by Adobe into Microsoft Word. Despite differences in formatting the reader will notice that, generally speaking, the same characters were rendered in each document.

Suzuki’s Seisho no Nihongo in Google Docs (left) and Microsoft Word (right).

I found this odd since when I tried importing scans of historical documents from my own collection that I scanned using Adobe Scan early last year, the system returned an output in Chinese script. Even though the output was nonsensical, this at least illustrated that somewhere in the process a piece of software was recognizing that the text contained Chinese characters. I wonder if something has changed with Adobe’s software during the past year.

Inaccurate transcription of a scan of a historical document in my own collection.

2. PDFs from Journals or Online Repositories Work Well

The second thing I learnt was that Google Docs was extremely accurate when transcribing PDFs that I had downloaded directly from journals or online repositories. Formatting always changed, but when the text was written horizontally is was a simple task to fix.

Scan of Yano Kenichi’s 矢野健一, “Kentōshi to rainichi ‘Tōjin’: Kōhō Tōchō o chūshin toshite,” 遣唐使と来日「唐人」 : 皇甫東朝を中心として (2012) (left) and its transcription in Google Docs (right).

3. Use the Camera on your Phone

The third thing I found was that Google Docs works fairly well with photographs taken on one’s phone. I used Air Drop to move the photographs to my Mac, opened them in preview and then exported them as PDFs before uploading them to Google Drive. Again there were formatting issues, but Google Docs was able to recognize and render the contents of the text accurately. Below is a comparative example from Usuyama Toshinobu’s 臼山利信 “‘Korona ka’ no onrain kyōiku to komyunikēshon,” 「コロナ禍」のオンライン教育とコミュニケーション (2021). The first image is Google Docs’ transcription based on the photograph from my phone, the second image is the transcription based on a scan from Adobe Scan.

Transcription from photograph taken on phone and exported to PDF on the computer.

Transcription from scan made with Adobe Scan.

4. Not Great with Vertical or Multi-Directional Text

The fourth thing I discovered was that the process doesn’t work well with texts written vertically. Whilst Google Docs recognized the characters in texts written vertically and displayed them in a meaningful way, there were various formatting problems that would make post-input editing of longer texts quite taxing.

Suzuki’s Seisho no Nihongo photographed with my phone. Accurate transcription, but poorly formatted.

Formatting issues became even more apparent in vertically aligned texts that I downloaded from journals or online repositories. A five page book review by Higashibaba Ikuo 東馬場郁夫 was rendered vertically by Google Docs, but prefence was given to a left-to-right (rather than right-to-left) reading meaning that the order of the sentences completely changed. Combined with this only a few lines appearing on each page making the transcription a total of 82 pages long!

Part of the transcription from a book review.

I also tried to open some multi-directional texts in Google Docs, but this produced large-scale formatting issues. Of course, these issues reflect a wider problem with the world of digital tools which are more often than not designed for use with European languages or those which adhere to similar conventions such as text direction.

5. No Cursive, But…

It goes without saying that the process won’t work with texts written in cursive (although I tried it), but it appears to work acceptably with handwritten text. See the following comparison:

Handwritten text.

Transcription of the handwritten piece in Google Docs.

Conclusions

Using Google Docs to transcribe Japanese from scanned or photographed texts works well with particular sorts of documents, but has some limitations. I found that the method works best with PDFs of papers downloaded from journals or repositories, or PDFs created from photographs taken with a smartphone. It also functions well with non-cursive, handwritten pieces. The method usually produces some formatting issues, which can be easily resolved if the original text was written in horizontally aligned script, but can create difficulties if the text was vertically or multi-directionally aligned. It doesn’t seem to work well with items scanned using Adobe Scan and similar issues may arise when using other scanning apps. The Google Docs method of transcribing texts will likely become a key part of many of our research arsenals, but there is a long way to go before it can adequately transcribe the plethora of different textual forms we find in Japanese and likely other Asian languages.

One thought on “Google Docs and OCR: Some Experiments Transcribing Japanese Language Texts

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s