Repository Architectures

Original Author: Nic Weber
Editing & Updates: Bree Norlander

In the most generic sense, data repositories provide for the publication and long-term preservation of data. As a sociotechnical infrastructure, data repositories also play a key role in data discovery, in data packaging, and in the day to day work of curation. In this chapter I will build upon DC 1’s discussion of a data repository as a layered “architecture” for curation. In doing so I will introduce the ISO standard for an Open Archival Information System (OAIS), and relate this to contemporary repository software architectures that facilitate data publishing, access, citation, and preservation.

Repositories

The last 20 years have seen data increasingly published to the web as structured information, free for sharing and reuse. We have, thus far, discussed multiple innovations that have made this increase in data collection and publishing possible, including how data are practically stored, retrieved, and packaged for reuse. Early efforts at increasing data access focused specifically on how to embed data in electronic publishing environments (Abiteboul et al, 2000) and how to provide programmatic access to data that were stored on remote servers (Richardson et al, 2013). Over the last decade technologies have been developed to better connect different components of the data publication lifecycle - starting from a small number of hard to use proprietary repositories to a diverse range of (slightly easier to use) open-source options. These repositories depend on an “architecture” - that is a complex and highly coordinated integration of software, hardware, and human services.

Layers of a Data Repository

In Data Curation 1 we discussed the idea that repositories are a series of technical “layers” or a “stack” of technologies - each layer consists of a set of services and interfaces that allow data to be reliably preserved and published for reuse. The layers of a repository are, roughly, as follows:

Data Preservation + OAIS

Throughout the course we have defined data as “information objects playing the role of evidence.” At the lowest level of abstraction all digital “information objects” are a binary sequence of 1’s and 0’s - that is, they are bits of information that are encoded on storage media. At the hardware level, data repositories provide a way for data (as encoded binary information) to be reliably stored on media such as an optical or spinning disk. At the software layer, a data repository provides an interface so that curators can manage and prepare data for archiving at the hardware layer.

The preservation component of a layered repository architecture can’t be overstated in importance - it is this combination of hardware and software that enable data repositories and curators to reliably provide for long-term archiving of data such that data remain accessible indefinitely.

To better understand the services and functionalities of a repository in terms of preservation, I will briefly introduce and discuss the Open Archival Information System (OAIS).

The OAIS is a conceptual model (sometimes called a “reference” model) for describing the design and responsibilities of reliable long-term preservation repositories. OAIS was developed by NASA data curators in the late 1990s, but has since been adopted as an International Standard (ISO).

As a conceptual model OAIS does not specify particular software or hardware requirements, but instead describes the core functions necessary to carry out reliable data preservation in any repository. In short, an OAIS provides a vocabulary for different data repository designers, curators, and administrators to generically describe the practical services and technical components that are necessary for reliable long-term preservation.

The OAIS literature can be a bit dense and difficult to comprehend on first read. I will attempt to give just a preliminary overview of this conceptual model below. (If you are interested in this kind of content - you should definitely take Professor Weber’s Digital Preservation course where you will discuss these concepts in greater depth. But, as emerging professional curators it is helpful to at least understand the core concepts of an OAIS, and how they apply to the curation services of a data repository.) In lecture this week I will describe the OAIS as it applies to a use case from the Qualitative Data Repository (QDR). So, just a warning that if this seems overly conceptual or abstract - be sure to watch the lecture so that you have a concrete working example to apply these terms. (Caveat: Nic Weber is the Technical Director of QDR.)

OAIS Roles

In an OAIS there are three specific roles that humans play:

One of the fundamental concepts of an OAIS is a “Designated Community” - this is the specific set of producers and consumers that are served by an OAIS. Management has the responsibility of clearly defining a designated community for an OAIS, and monitoring their emerging needs. In the OAIS a designated community is assumed to have a particular “knowledge base” - this is what that community can be reliably expected to know about the holdings of a data repository. In an OAIS most of the work in generating metadata, and packaging data are in service of the “Designated Community” needs. A curator necessarily has to have a deep understanding of the designated communities needs, and monitor these shifting or changing needs over time.

OAIS Information Packages

In an OAIS all content is described in terms of an “information package” consisting of data, metadata, and a machine-readable log of any preservation actions that have been taken by an OAIS. There are three specific types of information packages in an OAIS:

OAIS Core Services

Any OAIS is expected to carry out 6 core activities: Ingest, Archival Storage, Data Management, Preservation Planning, Access, and Administration

Collectively, each of these services are carried out by the Management role in an OAIS, and often apply to a particular information package as it moves through an OAIS (from Producer to Consumer).

The OAIS Model

The following diagram is a quick (but admittedly busy) overview of the three major components of an OAIS (Roles, Information Packages, and Services). The diagram should be read moving from right to left - that is Producers deposit a SIP, Management transforms a SIP to AIP and also provides services (within the OAIS), and Consumers request AIPs that are delivered as a DIP. (See - I told you there was a lot of jargon.) Inside the OAIS all core services are performed (this is denoted by the black lined boundary separating Consumers and Producers from the repository).

To reinforce one important point about the OAIS model - as depicted in this diagram - it is conceptual. It provides a reference language that is not specific to any one repository, but can be used to describe roles, services, and information packages that are managed by ANY data repository.

Data Repository Frameworks

As I mentioned at the beginning of this chapter, the curation community has over the last decade developed a number of repository frameworks that make the day to day work of curating, publishing, and preserving data practically possible. These frameworks are often marginal in their differences - they each use a slightly different set of hardware and software components to carry out specified functions of an OAIS. But, these marginal differences have important implications for institutions in selecting, implementing, and running a data repository. A repository framework like CKAN for example is content agnostic - it doesn’t have any specific features that are developed for a particular designated community. Instead it is a highly modifiable open-source technology that can be used by governments, scientific institutions, or industry that have access to general hardware. Somewhat oppositely, a repository framework like Dataverse is developed specifically for social science data. Dataverse provides functionality to social science data curators that need to implement specific metadata standards when creating an AIP, and is configured to run on the type of hardware that is often available to a university IT staff. Dataverse also has features that allow for Consumers and Producers to register for OAIS services based on their university credentials (e.g. that they are a verified member of a particular university). This registration process is an important way for Management to implement what an OAIS describes as Access controls - who can deposit data, who can search for and obtain data, etc.

The three most prevalently used data repository frameworks that you are likely to encounter are Fedora, CKAN, and Dataverse. Fedora, the repository we haven’t yet discussed, is a highly extensible framework. This means that Fedora provides a core set of features that have, in turn, a core set of hardware dependencies. These core features can be modified and tailored by any institution that has unique needs for serving a “designated community”. This extensibility is a major feature of Fedora, and has resulted in what can seem like a dizzying array of repository frameworks that differ by name and feature-set. Mike Giarlo, a data curation developer, has helpfully provided an overview of the evolution of Fedora in the image below.

I provide this so that you can get a quick overview of the ways a repository framework like Fedora can be modified and reused across institutions based on their specific needs for serving a designated community. It is worth noting that even as someone who has worked in this field for 10 years, it wasn’t until Mike provided this diagram that the relationship between different repositories was even remotely clear to me. So, if reviewing and reading documentation about repositories seems confusing - welcome to the club!

The important concept to take away from this brief overview of existing frameworks available to data curators is that there are multiple different ways in which to practically set up a storage and preservation environment. The particular services and roles that are played within a repository are shaped by what the framework is designed primarily to achieve (e.g. serving one particular community versus another), but ultimately Management in an OAIS has the discretion to set policies and to govern data as they best see fit. It is in this sense that a repository is a sociotechnical infrastructure - its the combination of people and technologies that practically serve designated communities.

Summary

This module has reviewed the major architectural components of a data repository, including the software, hardware, policy and governance “layers”. In describing the relationship between hardware and software layers we looked specifically at a conceptual model, the Open Archival Information System (OAIS), to better understand how preservation is carried out between different stakeholders - including Consumers, Producers and Management. Less abstractly, I also introduced the idea of a repository as a framework that is practically implemented with specific hardware and software configurations that are meant to serve a particular designated community. In the readings below, there are links to the technical specifications for each of these repository frameworks - I highly encourage you to pick one of these and skim the documentation to understand exactly how concepts in the OAIS model are implemented.

Chapter References

Lecture

Readings

Required

Read this very brief description of digital libraries (cyberinfrastructure) from the National Science Foundation program launched in the mid 1990. It provides a nice and very brief description of how data repositories have emerged from early digital library funding: Link

Now, let’s fast forward to a current landscape of many different repository software platforms available. The following post from the IQSS staff at Harvard’s Dataverse provides an excellent table comparing existing data repository services. Pay attention to the categories being compared, and how this related to the affordances of the software (also see the ‘optional readings’ section for more literature like this). Link

Next, read this “overly honest” report from the University of Illinois Library (the largest circulating academic library in the world) on developing a data repository:

Review documentation for just one repository platform listed below. Be sure to also look at an example of the platform’s deployment.

Samavera (Open-source repository for universities and institutional repositories)

Dataverse (Open-source repository for social science data)

Fedora (Open-source repository with semantic capabilities - often used by science repositories)

CKAN (open-source data repository - often used for civic data)

Clowder (Open-source for long-tail data)

Additional optional reading:

Exercise

There is no exercise this week. However, if you would like to apply some of the concepts that have been reviewed in the Chapter and in Lecture you can see this optional component of your protocol assignment for selecting and evaluating a repository framework.