The Digital Repository at CHM
If you’ve heard about digital preservation in the news, chances are it was a story about loss or potential loss. For almost 20 years now, we’ve been warned that we could be facing a “digital dark age, ” meaning that very little of what we create using computers will be preserved for future generations.1 The challenges facing digital preservation are real, but all too often these warnings have not been paired with examples of what people in the field have been doing to face them.
In this post, I’m going to talk about some of the steps the Computer History Museum (CHM) has taken in recent years to ensure that it will be able to preserve its already substantial and growing digital collections, focusing specifically on the Museum’s new digital repository.
CHM’s Digital Collections
First, a bit of background. The Museum’s digital collections fall roughly into three categories:
1.Donations of material in already digital formats
This includes both software (applications, code, etc.) and data (word processing documents, image files, and so on). Much of it is stored on older physical media (tape, cartridge, floppy disk, etc.) that needs to be read with specialized equipment; although, we are increasingly seeing donations arrive in the form of files on modern hard drives or downloads from shared cloud storage folders.
2.Digital copies made from physical items
These range from image scans of photographs and paper documents to audio and video files created through digitizing items from our audiovisual collection.
3.Museum-created digital content
The Museum records all of its oral histories and live programs, along with some special events such as the Fellow Awards and Exponential Center launch. The oldest recordings are on tape formats, but for many years now this production process has been entirely digital. Video is recorded onto cards in cameras and then copied onto hard drives for editing. There is no longer a one-to-one correspondence between a given piece of physical storage media and a segment of video.
This third category has been one of the main drivers behind the Museum’s push to build a more robust digital storage and preservation infrastructure. With the adoption of high-definition standards, the volume of in-house video added to the Permanent Collection has grown rapidly in recent years, from just over 5 TB in 2011 to over 10 TB each year from 2013–2015, to over 15 TB in 2016. As 4k video becomes more common, this volume is only going to increase.
Enter the Digital Repository
By 2011, it became clear that the Museum’s existing storage infrastructure, already straining to keep up with the then much smaller digital collection, was in need of an upgrade. That year, with the aid of a grant from Google.org, the Museum began work on building a digital repository. After extensive preparation and testing, we officially, albeit quietly, launched the repository in April 2015.
The purpose of the repository is to provide stable, redundant, long-term storage for all of the Museum’s digital collections. Although the repository is not exactly a place, it can be seen as analogous to the facilities that the Museum uses to store its physical collection. And just as the Museum has had to develop procedures for managing the physical collection, similar procedures needed to be developed to facilitate the ingest, retrieval, and monitoring of the digital collection.
Indeed, it would be a mistake to look at the repository storage system alone as fulfilling the needs of preservation. Rather, it’s the combination of the storage system and the processes we’ve put in place to bring material into the repository so that we can continue to access it that makes up the core of the Museum’s digital preservation system. I briefly describe both below.
The storage component
After evaluating a range of options, the Museum decided to build and manage its own storage infrastructure. Without getting into too much technical detail, the repository consists of three servers, all of which are RAID setups using ZFS for the file system and Ubuntu as the operating system. One server operates as the main server, with the other two operating as mirrors. The servers are physically separate, with one located offsite. Additionally, all data is backed up to LTO tape on an incremental basis, with full backups made every six months.
There are many benefits to this design:
- We are not locked into any particular hardware or software vendor’s stack. The principal applications are open-source, and hardware components can be swapped in and out as needed.
- It is ready to scale up. When we first launched the repository, it had a total capacity of 68 TB. Thanks to a generous donation of 4 TB hard drives from Toshiba, we were able to replace all of the original hard drives with larger ones, expanding the capacity to 114 TB. At the time of this writing, approximately 40 TB of this space is already in use, with 2–4 TB being added each month. As we get closer to filling the current capacity, we will expand the repository again.
- There are multiple layers of redundancy. The combination of the RAID setups, ZFS, and the use of physically separate servers, along with the tape backups, means that there are always multiple copies of each file at any given time. This greatly reduces the risk of losing data if one of the servers or backups gets damaged.
- ZFS provides built in data integrity checking and error correction, which protects against file corruption. This was one of the reasons we chose it over other file systems.
- We have the resources to administer the system. Not to be overlooked among the technical details is the fact that we needed to build a system that we could continue to maintain over time. The most sophisticated storage system in the world isn’t suitable for long-term preservation if you can’t sustain it past the initial building and implementation phase.
Ingest, retrieval, and management
In some ways, the digital repository is purely data storage: We could send virtually any type of data to it, in almost any arrangement, and the system would replicate it, monitor it, and back it up to tape. But, of course, the requirements of managing a digital collection are higher than that. At the risk of stating the obvious, at a minimum we need to be able to track each item we’ve sent to the repository, check if it is still there, and easily retrieve a copy of it when needed.
This is where another application comes in: Archivematica. Archivematica is open-source digital preservation software designed for libraries, archives, and museums. Archivematica is not itself a storage system, but is meant to be used in combination with a storage system. As we’ve implemented it, Archivematica acts as a sort of additional layer on top of the digital repository.
We use Archivematica to:
- Package all items sent to the repository. An Archivematica “package” is really just a standardized directory structure. It includes the files being stored, metadata about those files, and logs related to the packaging process itself. Every file in a package, along with the package itself, is given a unique identifier, and all metadata is stored using established standards.2 Furthermore, the package’s directory structure itself conforms to a standard known as the BagIt specification, which was originally developed by the Library of Congress to make it easy to verify that the contents of a directory have not changed after being transmitted from one location to another.
- Send packages to the repository. Everything sent to the repository goes through Archivematica first, which makes a record of each ingest. This lessens the “out of sight, out of mind” risk that can come up when people store files on digital media and then forget how to find them again.
- Search and retrieve packages and files. Archivematica indexes every package it stores, making it retrievable through search. It is possible for staff to download both entire packages or specific files from within packages. To facilitate searching the physical and digital collections at the same time, we rely on Mimsy, our central collections management database, which stores all of our accession and catalog records. Mimsy records and Archivematica packages are linked through their unique identifiers.
- Check package integrity and completeness. Archivematica incorporates a tool that will validate any package conforming to the BagIt specification. Given the unique ID of a package, the tool will report back whether the package has changed. If it has, the tool will describe how the package differs from what was expected. This checking is complementary to the lower-level error checking built into ZFS, which will report if something is corrupt, but not if a file is intentionally modified, added, or deleted.
To sum up, everything added to the digital collection is:
- Packaged, via Archivematica, into a standard directory structure and then sent to the repository.
- Replicated, within the digital repository, on physically separated hardware, and backed up to tape.
- Regularly monitored and checked for errors.
- Linked, using identifiers, to corresponding records in Mimsy, the Museum’s central collections management database.
By continuing to rely on Mimsy as the central cataloging system, we have been able to avoid needing to maintain two separate cataloging systems, one for physical material and one for digital. This also makes it possible to ingest material into the repository sooner rather than later, as items do not need to be fully cataloged and described before they can be ingested.
In the 18 months that the repository has been in production, we have been able to ingest over 40 TB of material, most of it video. There is still quite a lot of work to do to migrate our existing collections from legacy storage into the new system but we are well underway. And with this baseline of bit-level preservation established we are able to commit more time to working on some of the more complex challenges of digital preservation, such as how to maintain the ability to render file formats in the future and how to preserve and execute old software, topics that will be the subject of future blog posts.
Can we guarantee that this repository infrastructure will last forever? To be honest, forever is a long time and technology changes rapidly enough that it’s hard to believe that the Museum will be running the same infrastructure, just with newer hardware, in 20 or possibly even 10 years. But what we can say is this: As long as the Museum exists, it will be committed to preserving the digital collection, whatever the form that preservation infrastructure will take.
Putting It All Together
Since all of that may still seem a bit abstract, I want to end with two concrete examples from the Museum’s moving image collection. The first are the videos from the 1986 ACM Conference on the History of Personal Workstations, which we posted online earlier this year. Almost all of these videos exist in the Museum’s collection only in the form of U-matic tapes. U-matics were a once common format that is still readable today with the right equipment.
The Museum, however, has not been able to read U-matics in-house for some years, and as a result sent the tapes to the Bay Area Video Coalition to be digitized. Until the digitization was completed, there was no way of knowing whether the videos would be playable. But now that they have been digitized, they are readily available for use and re-use. There is also a very real chance that the content of these tapes, as video files, will be preserved and remain readable for longer than the physical tapes themselves, which depend on the continued availability of U-matic playback equipment. Finally, because the files are in the repository, we can check on the package at any given time to make sure that it remains complete.
The second example I want to highlight is from CHM’s Revolution exhibition. The Museum shot a large amount of video for use in the exhibit, only a fraction of which made it into the final videos that you can see in person or online. For example, in the gallery for Computer Graphics, Music, and Art, you can watch about three minutes of Max Mathews, one of the most influential figures in the history of computer music, demonstrate his radio baton. This video shows just a portion of a longer demonstration, which you can watch below or on our YouTube channel.
This video exists only in digital form. Shot in HD in 2010, it was never recorded onto a physical tape. It has been accessioned into the collection, cataloged, and ingested into the digital repository, thus ensuring that people will be able to find and view it in the future, whether or not YouTube continues to exist.
That, I think, is the ultimate value of having the repository: not preservation just for preservation’s sake, but for the sake of being able to provide continuing access into the future.
Building, and now maintaining, the repository has truly been a team effort. I’d like to thank everyone on the digital repository team for their hard work: Paula Jabloner, Al Kossow, Edward Lau, Ton Luong, German Mosquera, and Vinh Quach.
- The earliest publication I could find using the metaphor of a “digital dark age” is Terry Kuny, “A Digital Dark Ages? Challenges in the Preservation of Electronic Information” (1997).
- You can find more technical documentation of the Archivematica archival package structure, and the metadata standards used are Dublin Core, METS (Metadata Encoding and Transmission Standard), and PREMIS (preservation metadata).