This is a guest post by Eric H. C. Chow. For more information, see at the end of this post.
Introduction
The conversion of printed text into searchable, editable digital formats using Optical Character Recognition (OCR) technology has become a cornerstone for disseminating historical and archival textual materials. Many libraries and archives around the world are embarking on ambitious digitization projects to make their collections more accessible to the public and researchers. This article describes an experiment with the use of Gemini Pro Large Language Model (LLM) for OCR, particularly focusing on digitized news clippings written in traditional Chinese.
It is not uncommon for libraries and archives to possess tens of thousands news clippings requiring OCR. An example of such a project is the Chinese Newspaper Clipping Database by the Hong Kong Baptist University Library, which currently hosts the scans of more than 40,000 valuable newspaper clippings from Chinese newspapers published in Hong Kong and China between the 1940s to 1970s, covering various topics in religion, education, and various other societal issues. Many of these clippings have been digitized into image files, but without the OCR of the text. The database would become extremely useful for humanities scholars if the article texts of these clippings became fully searchable.
Figure 1: Scan of news clipping taken from the Chinese News Clipping Database, Hong Kong Baptist University Library.
Existing OCR Solutions
The need for high-quality OCR stems from many challenges – the sheer volume of clippings, degradation in condition and quality of the original copies of the clippings due to ageing. Another challenge lies in the intricacies of the Chinese writing system; the writing direction (left to right, top to bottom) is different from the English writing system, for which many modern OCR software are optimized for. These factors, coupled with the fact that truncated texts are often arranged in non-linear layouts in news articles, frequently cause erroneous or unusable OCR outputs that require labor-intensive editing, resulting in a severe bottleneck in the digitization efforts.
Figure 2: Text selection after OCR being performed in Adobe Acrobat Pro
Figure 2 shows highlighted texts recognized by built-in OCR function of Adobe Acrobat Pro. Upon visual inspection, not only does Adobe Acrobat’s OCR fail to recognize the main headline and certain words in the article body, all of the vertical lines in the texts are (mis)interpreted as running from top to bottom of the image, where in fact the first paragraph of the writing was truncated, separated by a horizontal rule, and continues at the bottom half of the article.
Figure 3: OCR being performed in ABBYY FineReader PDF.
Figure 3 shows the OCR result of another popular OCR software, ABBYY FineReader PDF. Similar to Adobe Acrobat OCR, it fails to recognize the main headline (the red bounding box indicates it has detected the existence of visual contents, but it is not recongised as text). Overall, ABBYY performs better in discerning text truncation by separating recognized text into blocks (represented by green bounding boxes). However, it is evident that human effort is still needed to piece different parts of the OCR text outputs together to form a continuous string of article text.
Simple OCR with Google Vision Pro LLM
Since Google’s official launch of Gemini Vision Pro in February 2024, its abilities in processing text, images and performing OCR have been documented by various LLMs enthusiasts. Motivated by these findings, I experimented with using Gemini for OCR of newspaper clipping images, in hopes of identifying a better OCR solution for this specific task.
Figure 4: Gemini prompt testing interface
Figure 4 shows a screenshot of the Gemini prompt testing interface available on Google Cloud Console. It allows users to insert media, craft prompts, and adjust LLM parameters to produce and test the LLM outputs. In the first part of my experiment, the news clipping image is inserted into the prompt along with a simple instruction Perform OCR in Traditional Chinese on the provided image. The temperature, a parameter that determines the LLM output to be more random and creative (high temperature value) or more predictable (low temperature value), was set to 0, aiming for accuracy and consistency.
Figure 5: Gemini Vision Pro LLM response to the OCR request
The resulting output (shown in Figure 5) is very promising. The model was able to extract both the main headline and subheadline (which both Adobe Acrobat and ABBYY FineReader failed to extract), and almost the entirety of the headline and article text were recognized correctly. It detected an extra character in the headline, and failed to extract the text which appeared just after the truncation of the first paragraph. However, it was able to recognize the text truncation, and output a continuous paragraph of readable text. Interestingly, Gemini has “hallucinated” (or generated) “者,” in place of the missing text (see Figure 6). It therefore seems to “fill in the gap” based on the writings before and after the missing text. This may seem “smart” at first glance, but is actually unhelpful for human proofreaders to identify missing OCR text for manual enhancement.
Original article text:…並登記中等學校以上失業工人幹部短期訓練班,開始訓練工人幹部以便開展以工代賑各項工程… Gemini output text:…並登記中等學校以上失業者,以便開展以工代賑各項工程… |
Figure 6: Comparison of original article text with Gemini output
JSON Output Format
After OCR, the next part of the digitization workflow is for other systems or programming scripts to ingest the text outputs for further processing and storage. With a carefully crafted prompt, LLMs are known to be able to produce responses in specific output formats. Therefore, the next natural step of my experiment is to append an additional instruction in the prompt: output the OCR results into JSON, which is a common format for data exchange between systems and programming scripts. Furthermore, as LLMs are known to be able to extract contextual information from a given input, I also included metadata fields such as the headlines, authors, publication date and publisher of the news article in the prompt, in order to test out Gemini’s ability in extracting these contextual information. The revised prompt is as follows:
| 1. Perform OCR in Traditional Chinese on the provided image. 2. Output the results in Traditional Chinese in the following JSON format: { “headlines”:[ “headline1”: <headline 1>, “headline2”: <headline 2>, “headline3”: <headline 3>, … ], “date”: <YYYY-MM-DD>, “publisher”: <publisher>, “author”: <author>, “fulltext”:[ { “paragraph1”: <full text of paragraph 1>, “paragraph2”: <full text of paragraph 2>, “paragraph3”: <full text of paragraph 3>, … }, ], } |
Gemini response to the revised prompt is shown in Figure 7.
Figure 7: Gemini output in JSON format
Again, the output is very promising. Firstly, it is indeed in valid JSON format (I ran it through a JSON format validator and it is 100% well-formed). Furthermore, the headlines and paragraphs separation were correctly identified. More surprisingly, the author’s name 璋 was also correctly identified. This shows Gemini’s understanding of the text string 璋十月廿四日寄 (which translates into “Posted by Zhang (璋) on October 24”).
On the other hand, the publisher’s name and publication date in the response were incorrect. This information was actually embedded in the purple-color ink stamp that overlaps the article’s main headline (see Figure 8). This is the reason why the information was so difficult to be recognized by the LLM.
Figure 8: Article metadata embedded in purple-color ink stamp
Interestingly, there is evidence that Gemini is still able to obtain partial information from the ink stamp. Although it mistakenly used information extracted from 璋十月廿四日寄 for the day and month of publication, it did correctly appended 1950 as the publication year, which was partially visible from the ink stamp (一九五〇). Furthermore, although the publisher name in the output was 人民日報 instead of 大公日報, which may seem like a hallucination at first glance, but one could argue that 人民 and 大公 share visual similarities, and it was only the occlusion of the main headline text that reduced the recognition accuracy.
Conclusion
This preliminary experiment with Gemini Vision Pro shows a high potential in OCR of news clippings scans in traditional Chinese, and may reduce time and manual effort required to convert vast quantities of news clipping scans into fully searchable text. The resulting outputs are highly usable due to Gemini’s ability to discern truncated text arranged in non-linear layout, extract important metadata, and output the OCR results into well-formed JSON format. However, some of the extracted metadata may be inaccurate, and still require human intervention to proofread and correct.
___
Dr. Eric H. C. Chow is the Digital Scholarship Manager of the Hong Kong Baptist University Library. He manages the development of programmes and projects in digital research and education, with a focus on digitisation, digital curation and visualization of cultural heritage collection data. He has previously written a post titled Digital Humanities Approaches to Navigating the Early Colonial Hong Kong History.
Cover image created by Ms. Annie Sit, Hong Kong Baptist University Library.

One thought on “An Experiment with Gemini Pro LLM for Chinese OCR and Metadata Extraction”