Assignment: Data Collection Policies

Introduction

Most data repositories draw a distinction between self-deposit, submission-based, and opportunistic data collecting.

  • Self-deposit is exactly what it sounds like - data producers or collectors upload their data directly to a repository. Curation takes place post-hoc to make sure that there are no viruses, that data are within copyright protections, etc (Typical curation tasks like cleaning of data, creation of additional metadata, structuring of data, etc, are not typically applied to these kinds of collections.)
  • Submission repositories require users to propose a deposit and follow strict guidelines before their data is accepted. Curators in this role act as a gatekeeper for what comes into the repository, and also perform preliminary work that will lead to more useful data over the long-term.
  • Opportunistic is also exactly what it sounds like - data curators serve as the collectors of data and thus are expected to both evaluate as well as prepare the data for eventual deposit into a long-term storage repository.

Regardless of how data are acquired there should be clear criteria for which kinds or types of content are relevant to a repository, and which are not. This criteria might also include the size, formats, and even subjects of a data collection. Libraries and Archives have historically referred to these kinds of policies as collection development. Data collections (or data repositories) differ in terms of their collection policies in that they are concerned with not only what types of data should be collected, but also what form and with what additional relevant resources should data be accepted. In short, a data collection policy brings together the readings and lecture we did around data cleaning, packaging, and sharing.

Here are some valuable examples:

Protocol Deliverable
Establish the policies that will govern your data collection policy. This should include (at minimum):

  • What are the criteria that make a dataset, or collection, relevant to your protocol? What, if anything, might exclude a relevant dataset?
  • What will you require to be included for a minimum viable package of data? (e.g. the data must have a clear license, a format that is (or can be converted to) an open standard, and enough contextual documentation that structured metadata can be created).
  • What formats will you accept, and what formats will you not accept?
  • How big can a collection be (in size)?
  • What kind of preservation guarantees will you make for each format?

You should feel free to LIBERALLY borrow from existing repository policies or protocols that we have looked at this quarter. You should also feel empowered to look at other protocols, or other repositories and decide how to answer some of these questions.