Principles for Creating Lasting Digital Archives: A Researcher’s Perspective

This is a guest post by Amy E. Harth, PhD. Bio at the end of the post.

Cover photo by Joshua Sortino on Unsplash

Digital archives are a type of digital humanities project used by an increasing number of researchers. Researchers rely on digital archives when in-person archives do not exist, are otherwise unavailable to them and, increasingly, because the nature of their research requires digital archives (e.g., electronic and internet-based files).

As a humanities researcher, my experience with digital archives is from the user’s perspective. As such, I lack the technical proficiency of those who create digital archives. At the same time, I have found attention to specific characteristics of digital archives to be important. These characteristics are reflected in the four principles below, which organize my perspective on what researchers are looking for in digital archives. My hope is that this information helps researchers, digital archivists and technologists communicate about their needs and limitations to build more useful and sustainable archives.

1.       Validity

When examining an archive, the first question most researchers ask is whether the information is relevant to their work. Relevance is, of course, a personal judgment as related to their specific project. Yet, even relevant archives can be difficult to use if the search results, organization or files seem questionable.

Establishing archival validity has two main components. First is the reputation of the archive. This can be established by examining the credibility of the creators and their affiliated organizations. Credibility can also be assessed through design. Designs that look particularly dated, sketchy advertising posing cybersecurity risks, and other elements that make an archive look outdated or haphazard diminish – or even eliminate – credibility.

As shown in Figure 1 below, the About Us page on the website for the Genocide Archive of Rwanda provides information about the Aegis Trust, Kigali Genocide Memorial and the archive and documentation department to establish credibility as an independent non-governmental archive.

Figure 1: The About Us page at the Genocide Archive of Rwanda.

Second is the usefulness of the search and organization of the information returned by the archives. Valid search can be defined as returning results that meet the search criteria and not returning results that do not meet the criteria. Using the Boolean operator “NOT” has limited effectiveness in many searches. For example, When searching for archival materials about African people or countries but not about African Americans, many search engines seem unable to understand the “NOT” operator. This makes the search results less valid and accurate. In these situations, it can be unclear if any results pertain to the actual search topic as the search results may be dominated by results that were not properly excluded from the search. If the search returns 10,000 results, it can be impossible to tell what percentage is about the actual search term vs. the mixed-in result.

To address this issue, archival records can include better metadata to support their search engines. For example, if searching an archive for videos of commercials, each video can be labeled with categories reflecting the people and places it is about, and the country and continent where the video was created and distributed. This would help reduce both false positive and false negative results. If a UK advertising firm created a video about McDonalds that was distributed in South Africa, Africa, then it should be possible to search under distribution by continent (Africa) and find this result. Similarly, if the video was about McDonalds with a Safari theme set in South Africa but distributed in Australia, then adding this title/topic detail would help find this item when searching for Africa in the topic and/or title category.

Helping ensure that search results, particularly results regarding groups of people, are more accurate would be especially valuable for digital humanities research since it focuses on people.

Digital archives can also explain their metadata to assist users, as shown in Figure 2 below from the Queer Zine Archive Project About page: 

Figure 2: The Queer Zine Archive Project About page.

2.       Reliability

To make digital archives useful for research, search results need to be consistent. Consistency refers to the ability to conduct the same search and get the same (or similar) results, especially over time. If a researcher builds their research on the premise that the archive contained 50 images of people being vaccinated against polio from 1955-1960 as of 2020, it should be possible to repeat the search using date delimiters and find 50 records.

To make this possible, there should be two date options that work together. Using the example above should be an image date field where the researcher would select 1955-1960 and a date-added-to-archive field that could restrict results to 2020 and before. Using these delimiters the search results should continue to return 50 results even as new data is added to the archive. 

Date fields should have multiple options, such as year, month and year, full date, date range, and before or after specified date as shown in Figure 3 below in the date field from the University of Wisconsin Libraries American Geographical Society Library Digital Photo Archive – Africa.

Figure 3: Example date field from the University of Wisconsin Libraries American Geographical Society Library Digital Photo Archive – Africa.

Furthermore, some library databases have issues with loading results. Their results pages have what is known as an “infinite scroll,” which reloads the page and jumps the user to new results, making it difficult to see results. Researchers may also be unsure if they saw all the relevant results and may have less confidence that other researchers would be shown the same results. Furthermore, infinite scroll is a known accessibility issue. A better option is to load individual results pages with a specific number of items per page, which can be adjusted. Often options include 25, 50 and 100 results per page, which can help address the infinite scroll reliability issue as shown in Figure 4 below from the Daily (Liberian) Observer Digital Archive.

Figure 4: Results per page options from the Daily (Liberian) Observer Digital Archive.

3.       Transparency

To determine that the digital archive is credible and useful, it should have, at a minimum, information about three elements: how the archive was created and by whom, how it is updated and how to search. Transparency reinforces validity and reliability. If it is clear how the information is obtained and by whom, it can be more accurate, and this establishes credibility. Furthermore, explaining the update process helps researchers explain their methods to other scholars. For example, the archive can explain the update process in a dedicated section noting that all archive materials, including new materials, include a date in which the material was added to the archive to help with making search results reproducible. Finally, if changes that interfere with reproducibility need to be made, explaining this helps scholars who are looking for materials mentioned in other researchers’ work understand why materials may not be found using the same methods.

As shown in Figure 5 below, the Trans-Atlantic Slave Trade Database has a detailed section on methodology, providing comprehensive transparency for researchers.

Figure 5: The Trans-Atlantic Slave Trade Database methodology section.

4.       Longevity

As part of planning and budgeting, consider the changing software and systems landscape. How will the archive need to communicate with other technologies in the future? What methods are changing? What service providers are changing protocols? How can archivists plan for these changes rather than be surprised that archives are no longer interoperable. While this may be behind-the-scenes information for many researchers, it explains the platforms used, which can be part of establishing credibility and confidence in the creating and update process. If it seems too technical for the intended audience, including a technical specs section, or offering a contact to reach out to for technical questions about the archive, may address concerns about ensuring longevity of the systems supporting the archive.

To ensure a truly sustainable digital archive, creators should work toward including a 50-year website maintenance plan in their digital archive budget. At a minimum this is to maintain the website, including hosting and storage, “unchanged” from a last activity date so that the digital archive doesn’t disappear. Budget-wise, fees for most modest digital archives should not be unreasonable. Fees for archives with high storage or high power demands may be more challenging.

One of the most difficult parts of doing research based heavily (or solely) on digital archives is that archives may disappear suddenly without notice. This can derail work in progress, and, for completed work, prevent other scholars from visiting materials consulted and/or reproducing the researchers’ conclusions. A best practice is to communicate a longevity plan as part of the creation and updating process. How many years have been secured in the budget? What funding is needed to extend the longevity plan? Empower researchers using the archive to advocate for its sustainability.

____

Amy E. Harth, PhD (they), is a white, queer, non-binary, fat, disabled anti-oppression scholar-activist. As principal of Amy Harth Coaching and Consulting, they use the latest research on equity and oppression to help leaders achieve their strategic goals. Connect with Amy on LinkedIn.

One thought on “Principles for Creating Lasting Digital Archives: A Researcher’s Perspective

Leave a comment