Creating the largest Juren Dataset with ChatGPT: A Journey through Digital Humanities. (Part One)

This is a guest post by Jiajun Zou. See bio at the end of this post.

Part I Tianyige Ming Juren Records

Creating a dataset from scratch is an arduous task for scholars in the digital humanities. However, the advent of technologies like ChatGPT and other large language models has transformed how we approach coding and create regular expressions. These advances now allow even those with minimal programming skills to leverage AI for complex tasks. The technical barriers that once existed have vanished. There is no longer a need for a professional coder to act as an intermediary, nor is there a need to coordinate appointments and engage in lengthy email exchanges with human experts to design and execute plans.

The silent yet powerful revolution that began in 2023 casts doubt on the values we have held for decades in education. Is human-to-human collaboration still as effective and productive compared to collaboration between humans and machines? Are specialized skills in the humanities or sciences still valuable to society from a productivity standpoint? These are the questions I pondered in 2023 while creating a dataset that all predecessors in my field, including well-funded institutions and academics, had failed to develop. Instead, the world’s largest list of premodern government official records came from a graduate student with a computer who incessantly communicated with chatbots to design, reflect, and overcome challenges—something that was never taught in any graduate seminar before 2023.

It seems appropriate that such a post is warranted and that thoughts about the future of data creation and human productivity maximization require us to question the conventional wisdom upheld by most. The past methods of teaching, learning, and researching are no longer sufficient. Reliance on the expertise of others is fading as every individual now has access to an arsenal of knowledge more powerful and effective than the best and most well-trained human advisors—at minimal cost. The empowerment of individual researchers through AI and the collaboration between humans and machines are just beginning.

The First Step

In the first half of 2023, I utilized ChatGPT for coding enhancements and improved the accuracy of OCR and effectiveness of regular expressions to extract records of over 27,000 individuals from official provincial examination records. These records are the most authoritative source available, yet they represent only a portion of the dataset—about a quarter of the estimated total of 100,000 records. Many records have unfortunately been lost over time.

Expanding the Dataset with Qing Dynasty Provincial Gazetteers

After using ChatGPT for coding enhancements and improved the accuracy of OCR and effectiveness of regular expressions to extract records of over 27,000 individuals from official provincial examination records, I sought to extract the remaining Juren from alternative sources such as provincial gazetteers from the Qing dynasty. This part involved overcoming different challenges, including the handling of existing digitized primary sources available from online archives or databases. Here, the main challenge was designing and executing regular expressions to effectively extract data for 98,000 individuals. Additionally, with the aid of ChatGPT, I developed a tool to download necessary pages from these databases for further corrections.This phase yielded data on over 98,000 individuals before data cleanup, making it the largest source of Juren records. After integrating these records with the missing records from provincial gazetteers and the original records from Tianyige, the dataset grew to a total of 97,700 individuals, accounting for the removal of duplicates. The dataset is now pending further advanced data cleaning steps and a disambiguation process to prepare it for wider usage.

The Historical Significance of the Juren Dataset

For over a century, scholars from China and around the world who have studied the examination system have faced significant challenges due to the limited data available in ancient Chinese texts. This has necessitated a substantial investment of time, money, and resources. Historically, the most frequently utilized data has been the Jinshi records, which for the Ming dynasty numbering approximately 24,580, which have underpinned numerous pivotal studies and arguments in examination history by scholars such as Ho Pingti and Benjamin Elman. These records have been essential for both quantitative and qualitative research across various disciplines, including economics, history, and literature.

The reliance on Jinshi records persisted until 2023, when advances in artificial intelligence, particularly the ability to use natural language for coding, democratized data handling in historical research. This technological leap allowed researchers, including myself, to bypass traditional coding barriers. As a historian who codes with natural language, I managed to shatter previous limitations by single-handedly creating the largest Juren dataset to date, comprising 93,700 individuals.

This dataset has enabled the creation of new maps and perspectives that challenge longstanding assumptions about geographical disparities in the historical examination system. For example, analysis reveals that Juren candidates predominantly come from areas with flat plains and lower elevations, which is typical of the northern regions of China. Conversely, in the southern provinces such as Fujian, Jiangxi, Zhejiang, Guangdong, and Guangxi, where over half the terrain is mountainous, candidates are concentrated in the few lower elevation areas available, typically near provincial capitals.

This geographic disparity raises pertinent questions about why powerful examination prefectures are predominantly found in the south and not uniformly distributed across the country. It also prompts a reevaluation of how geography and transportation have historically influenced examination outcomes. For instance, the success of a prefecture like Putian in Fujian might not stem from its own merits but rather from the geographical disadvantages faced by its in-province competitors.

This insight challenges the traditional narrative of southern excellence and northern mediocrity in the examination system, suggesting that the density of talents in the south results more from topographical factors than from inherent regional superiority. My dissertation explores these dynamics further, arguing that it is the core-periphery relationship within southern provinces that has historically facilitated their success, not merely the intrinsic strength of certain prefectures.

The Juren dataset is critical to this research, offering a new lens through which to examine how natural factors, rather than human efforts, shape examination outcomes. While this post focuses on the dataset creation process, the full potential of this data is still being carefully explored in my ongoing academic work.

Image 1 Topographical Map of China using Foursquare Studio’s Default Elevation Absolute as Basemap with Juren Distribution Heatmap on top showing  a total of 93,770 Counts of Ming Juren

Analyzing the Geography of Historical Examination Candidates: Insights from a Heatmap with Elevation as background

The heatmap illustrates the distribution of examination candidates across various prefectures, revealing a concentration in certain areas. This concentration is particularly notable in the southern prefectures known for their examination activities. The use of Foursquare’s elevation map as a base layer highlights the geographical advantage of these prefectures, situated in lower elevation areas, unlike their counterparts in more challenging terrains. This geographical factor likely influenced the distribution of candidates, making it such that only in South China do we see a concentration of powerful examination prefectures because South China’s topography has an uneven geographical disparity. How map like this became possible? This takes us to the creation of the juren dataset.

 This post outlines step by step how a dataset is created that ultimately made the juren map possible while demonstrating that humanists and technology can work together in the coming of age of artificial intelligence.  

Step 1: Acquiring the Sources

The initial step in the dataset creation process involves sourcing the necessary documents. Duxiu.com, renowned as the most extensive academic book database for Chinese literature, offers a unique feature for scholars— the ability to export specific pages of a book. This capability is particularly useful when accessed through a university subscription, allowing scholars to meticulously select and export the pages that contain the vital information for their research. This targeted approach enables efficient collection of data, setting the stage for the subsequent OCR processing.

Step 2: Image Preprocessing for Enhanced OCR Accuracy

Given that the images obtained from Duxiu or scanned from physical books often suffer from quality imperfections, preprocessing these images is crucial to enhance text sharpness, which directly influences OCR accuracy. The comparison in the two images below highlights the distinction between the original page from Duxiu and its enhanced counterpart. The original, while legible, lacks the clarity needed for optimal OCR performance.

To address this, I utilized ComicEnhancerPro, a freely available tool renowned for its comprehensive image enhancement capabilities. This tool’s ‘autolevel1’ feature was instrumental in automatically refining the text quality. Additionally, I employed its automatic illumination enhancement to improve background lighting, alongside manual adjustments to Gamma and overall image quality. It’s important to note that the effectiveness of these adjustments can vary from one document to another, necessitating a bespoke approach to optimize each page’s readability and, consequently, the OCR accuracy.

A critical step following the image enhancement is the conversion of these optimized images back into PDF format without a loss of quality. This conversion must preserve the full quality of the enhanced images to ensure that no detail is lost before proceeding to the OCR stage. This meticulous process lays the groundwork for a successful text extraction, which is pivotal in constructing a comprehensive and accurate dataset.

Image 2 Snapshot of Juren List from Tianyige Collection of Ming Provincial Examination Record

Image 3 Snapshot of Text Enhancement Using Comic Enhancer Pro

Step 3: Optimizing PDFs for OCR with ABBYY FineReader

During the OCR optimization process, I faced persistent issues with OCR accuracy, despite substantial improvements in text clarity achieved using Comic EnhancePro during the preprocessing stage. The transition to OCR presented numerous difficulties until a pivotal discovery was made, facilitated by ongoing consultations with ChatGPT and a rigorous trial and error process.

A significant insight was gained regarding the handling of images in OCR software such as ABBYY FineReader and Adobe Acrobat. Typically, dragging an image into these applications triggers an automatic conversion to PDF format, which inadvertently degrades the image quality without notice. This silent degradation often goes unnoticed by the software but can be the root cause of subsequent OCR inaccuracies.

To address this, I introduced a crucial step into our workflow: merging all preprocessed images into a single PDF while ensuring the color image quality parameter was set to 100%. This adjustment is vital for preserving the pristine quality of the preprocessed images throughout the PDF conversion, marking a turning point that significantly enhanced the smoothness of the OCR process and markedly reduced the incidence of OCR errors.

This refined approach emphasizes the importance of meticulous document preparation for OCR. It highlights the need to maintain image quality throughout every stage of the process to achieve optimal OCR outcomes.

Image 4 Snapshot of Image to PDF Tool for Merging Converted images to PDF

Step 4: Utilizing Regular Expressions to Decode Text Patterns and Rectify OCR Errors

A screenshot of a computer

Description automatically generated

Image 5: The prompt I created and sent to ChatGPT in creating a code for regular expression

A screenshot of a computer program

Description automatically generated

Image 6: ChatGPT responded to my prompt with a code tailor to my regular expression need

Converting textual information from images into a structured format suitable for database inclusion presents unique challenges, especially with historical records like those of the Ming provincial examination candidates. Our goal is to transform text such as “第一名 馬中錫 直隸故城縣學生 易” into a structured format across five columns: “Rank,” “Name,” “Location,” “Status,” and “Specialization.”

Decoding with Regular Expressions:

Rank Identification: We use the pattern “第X名” (where X is any number or character) to identify the rank. This pattern helps allocate the extracted data to the “Rank” column, effectively categorizing the candidate’s rank.

Specialization Segregation: Specialization is indicated at the end of each record by one of five terms: “春秋,” “禮記,” “書,” “詩,” “易.” A pattern match is used to isolate and assign this data to the “Specialization” column.

Segmenting Name, Place, and Status: Once “Rank” and “Specialization” are identified, the remaining text “馬中錫 直隸故城縣學生” includes the candidate’s name, location, and status. The “Location” is the most identifiable element, thanks to the finite number of administrative divisions. Using comprehensive databases like the Harvard CBDB, which catalogs historical administrative units, we precisely extract location data. For example, “直隸故城縣” is identified as a location within the 直隸 province and assigned to the “Location” column. The text preceding the location is labeled as the “Name,” and the text following it as the “Status.”

Adapting to Geographical Variations: Our methodology adapts to accommodate variations in administrative nomenclature across different provinces and historical periods. Regular expressions are tailored to identify and classify different administrative units, such as “府” (prefectural seats) or “衛” (military posts), especially in provinces like Yunnan and Guizhou where administrative control was less direct.

Data Validation: After processing, the dataset undergoes a manual review, particularly for the “Name” column, to ensure accuracy. Computational techniques are employed to verify the other columns, optimizing the cost-efficiency of the process. The finalized dataset, which includes 26,858 entries with complete information on names, locations, and specializations despite some data gaps, is made accessible through platforms like the Harvard Dataverse. This access is complemented by visualizations of candidate specialization distribution on WorldMap.

This comprehensive approach, combining the precision of regular expressions with manual verification, demonstrates the potential of digital tools to transform historical textual records into structured, analyzable data, thereby opening new avenues for research in the digital humanities.

Image 7 Snapshot of Excel Table of 27,148 counts of Ming Juren from Tianyige Juren Dataset

Conclusion

Following the steps outlined above, we have successfully created a dataset comprising over 27,148 records of Ming dynasty Juren. This dataset is instrumental for studying various aspects of the examination candidates, such as their specialization choices, and provides insights into the ranking and status of Juren. Although this dataset represents only a quarter of the total estimated Juren population and may not fully illustrate the regional distribution, it stands as the largest and most accurate firsthand source of Juren candidates from the Ming dynasty to date. Further details on the compilation of the complete dataset of 97,700 Juren candidates are discussed in Part II.

____

Author Biography: Jiajun Zou is a Ph.D. candidate in History at Emory University. His dissertation examines the provincial examination candidates of the Ming dynasty, focusing on regional performance disparities. He argues that the outcomes of these examinations are random and that regional performance gaps are influenced by non-human factors such as geography and transportation. Zou poses a simple yet crucial question: Why are powerful examination centers found only in certain southern provinces and not elsewhere in China?


2 thoughts on “Creating the largest Juren Dataset with ChatGPT: A Journey through Digital Humanities. (Part One)

  1. This is awesome work. I’ve been putting off learning much about GPT, but your use of it like this is inspiring.

Leave a reply to Julie Sullivan Cancel reply