Course Overview

Data Curation II examines a broad range of practical and conceptual issues in the field of data curation. As the second in a sequence of courses offered at the UW iSchool, Data Curation II offers an in-depth study of the semantics and structure of data as an information object used in scholarship and society. This knowledge will be useful when applied to practical activities such as data normalization, cleaning, or packaging for sharing and reuse. We will also pay particular attention to the repository architectures (technical design) and information infrastructures (technological and social connectivity) that constrain and afford data access. Throughout the quarter we will draw on research, case studies, and practical working examples in order to examine key challenges in the field. See the video below for an extended introduction to the course structure this term.

Graduate School and a Pandemic

The last time this course was taught, it was the very beginning of the pandemic for those of us in the United States. No one knew what was going to happen but we knew that life was going to be different. Professor Weber wrote this to his Spring 2020 class:

There is no easy way to put this - its going to be a challenging quarter. I usually put a note in my syllabus that says something along the lines of "...coursework occasionally needs to take a second, third, or fourth priority in your life. All of us have commitments that exceed the expectation of being fully present each week, and I don't expect you to be operating at 100% engagement each day." Even this qualification seems out of touch given the current climate we are operating in for the Spring 2020 quarter. There will be demands on your time, attention, emotions, and cognitive ability to engage with substantive material over the next 10 weeks. There are many unknowns that we are each managing. I have relaxed my own personal expectations for this quarter, and I encourage you all to do the same. This doesn't mean that I will give anything short of 100% effort to delivering a high quality course, but lets operate with a healthy and charitable amount of good will towards each other and towards ourselves for the next 10 weeks.
I encourage you all to read my colleague Anna Hoffman's very personal and incredibly eloquent take on the challenges that we face in the quarter that lies ahead.

In one year a lot has changed and a lot hasn’t. All I really want to say is that I plan to be flexible, understanding, and to give as much as I possible can to this class. Please reach out to me with feedback about workload, individual needs, and struggles (and of course let me know what is going well!) and I will see where we can make changes.

Accessibility

Workload and Expectations for Engagement

My expectations for you are that you engage with the content, meet with me regularly either individually or as a group, and keep me up-to-date on your progress. I know this may sound vague as a participation rubric, but If you meet these criteria you will receive full participation points. I will reach out to you if I am concerned about your level of participation.

The Protocol project that makes up a substantive amount of work at the conclusion of this quarter will have some examples from previous courses. The pandemic may place barriers on equaling the depth or scope of content that student’s in previous courses have had the freedom to accomplish. See my previous comment about relaxing expectations and demands of ourselves - this doesn’t mean that we should give anything less than what we are capable of - but I understand the challenges we are all facing.

Assignments and exercises have listed deadlines, but these are negotiable. As lited in the syllabus, I request that you reach out to me at least 24 hours prior to a deadline if you need to make alternate arrangements. You can turn in assignments at any time throughout the quarter and I will provide feedback as is necessary for you to understand and contextualize the goal of the assignment. I have two requests related to this policy:

  1. Please make an earnest effort to keep up with work. The course is designed to take advantage of progressive knowledge gain. Practice or exercise in one week often is the baseline for starting an assignment the next week. There may be times when you need to make alternate arrangemnets, but give the stated timeline your best effort.

  2. Please try to communicate when a priority deadline is not realistic for you to meet (my email is norlab@uw.edu). This should be before the deadline has passed if at all possible. You do not need to explain why. If you are falling behind and I havent’t had have any communication from you I may reach out with an email. Please don’t interpret this as any form of pressure other than my expressing a concern.

How this class is organized

There are two aspect of this class that will require your time and attention.

  1. Class sessions (Weeks 1-10) follow the typical structure of an iSchool online course.
  2. Your Protocol assignment has a weekly activity or assignment that you will complete (either as a group or individually). The assignments vary in terms of the effort that will be required - some are involved and will take hours to complete, others are simple readings that will help you think about your protocol.

Weekly content: each week of the course will consist of four or five (depending on the week) components: 1. Written content on the website; 2. Readings (required and optional); 3. Recorded lectures; 4. Exercises; 5. Assignment (a piece of your final deliverable).

Topic Overview

Week Topic Exercise Protocol Assignment
1 Class Introduction Introductions None
2 Tables, Trees, and Triples Cookie Recipes Data Pitch
3 Tidy Data Messy Recipe Data Definition, Scope, Audience
4 Data Integration 311 Data User Stories
5 Data Packaging Data Curator Demo Data Collection Policies
6 Repository Architectures None None
7 Data Acquisition, Search, and Discovery APIs Data Transformations and Quality
8 Metadata Application Profiles None Metadata Application Profiles
9 Ontologies and Linked Data None Data Licensing
10 Emerging Topics None Presentations


Grades

Your grade for this course is made up of four different components: Participation, Exercises, Protocol Assignments, and your final Protocol.

What to do if you find a mistake in this text

I have done my best to customize and adapt Professor Weber’s course content. Since this is my first time preparing the content in this form there will inevitably be mistakes or usability issues. If you find a mistake, an issue, or want to make a recommendation for improvement please send me an email norlab@uw.edu. I promise I want to hear about it.

Acknowledgements

I wouldn’t be teaching this course without the support of Nic Weber and Carole Palmer. A significant portion of the content for this course was originally created by Professor Weber and I want to be fully transparent about that. I couldn’t have done this without his work. I also want to include the acknowledgements from Professor Weber regarding the content he created:

I have had a lot of help putting together these materials. Where possible I have documented these individuals and resources throughout. Some explicit thanks is necessary - Jackson Brown, Andrea Thomer, Peter Organisciak, Katrina Fenlon, and Alex Garnett have contributed ideas for exercises, readings, and reviewed portions of each chapter.

Introduction to Data Curation II

Original Author: Nic Weber
Editing & Updates: Bree Norlander
This introduction provides an overview of the conceptual content that we will engage in for DC II. The goal is to refresh your memory about concepts that were covered in DC I and how we will use these ideas throughout DC II.

Types of Knowledge

This class builds upon some foundational concepts and ideas that were introduced in Data Curation I. The emphasis of DC I was establishing some best (or good enough) practices for managing and preparing data for meaningful reuse. In introducing conceptual foundations in DC I, I also argued that practicing data curators need to acquire two types of knowledge: Declarative and Procedural. If you’ve taken a class with me before you know that I think about this type of framework as being broadly useful for all types of learning. But, this framework is especially relevant for data curation.

To review: Procedural knowledge is how to do something, and declarative knowledge is knowing that something is the case. In other words, understanding something about the factual underpinnings of a concept. Less abstractly - let’s use an example from our (now) everyday lives:

Declaratively we know that a virus outbreak has certain distributions that are the direct result of collective action. If we take no action then there will be an increase in the rate of viral infections. But, if we collectively employ techniques like ‘social distancing’ then the infection rate will remain lower, essential services provided by public health workers will experience less of an impact, and society in general can effectively respond to incidents of viral infection. We probably all remember hearing the phrase ‘flatten the curve’ to refer to this concept - that is taking collective action to decrease the height of a plotted distribution of infections over time.

Image 1 Caption: Visual depiction of distributions in 'flattening the curve' of infections related to Covid-19. This image is courtesy of the CDC and Huffington Post's data vizualization team

Procedurally - How do we flatten a curve? Well, we employ techniques like social distancing, we wear masks, we wash our hands, we shut down broad sectors of the economy, we adhere to recommendations from the CDC, etc. It is only through knowing how to effectively combat a viral outbreak that we can put our declarative knowledge into practice. And, we can use realtime data to see whether or not we are actually achieving a ‘flattening of the curve’.

Image 2 Caption: Ben Baldwin created this helpful graph showing the mortality progressions of each state over time. You can see his data here https://github.com/guga31bb/covid19

Data curation is not so different from this example. We often need to first obtain some declarative knowledge about data and the ways that it was collected, organized, and analyzed. Understanding the intricacies of these issues helps us to understand how we can design effective interventions. In other words, only after we obtain some declarative knowledge about data can we start to employ procedural techniques for preparing data to be managed and used effectively over time.

In DC I we focused a good deal of our time and effort on obtaining declarative knowledge. In DC II we start to tilt more strongly towards practicing and learning procedural techniques for putting this knowledge to work. In this course we will attempt to help you do this in two ways:

  1. Deepen your declarative knowledge about data, management, organization, and encodings that are specific to the types of open data that we are able to reliably access and use via the web. By focusing in particular on open data we have the ability to practice with and use real world examples.
  2. Expose you to tools and strategies for employing these skills effectively in curating open data. Putting declarative knowledge to work takes practice, and it gets easier when we are able to abstract away from the particulars of any one context or any one tool’s affordances (and limitations).

Working Definitions

Before we begin diving into the content of DC II, it will help to revisit some definitions that were established in DC I:

Data: Data are various types of digital objects playing the role of evidence. Type and role distinctions have to do with the relational nature of data. A type is rigid (such as a file format) and a role is fluid (it can change given a context). A simple example will help make this clear: I am a person. This is a type. Regardless of any external circumstances I will remain a person. I am also a professor. This is a role that I play. I might, depending on the success of my tenure case, not be a professor in the future. This is simply a professional title that I have for a small amount of time. Data have similar rigid and fluid properties - A tabular dataset will have a type of structure (rows, columns, and values). Unless we take some purposeful action to transform this data it will remain tabular as a type. But, this tabular data might be evidence of some real world phenomena - it might be a set of species occurrence records, the precipitation and temperature of a particular place, etc. This evidential role can shift and change depending on who is using the data, and for what purpose.

FAIR Data: An emerging shorthand description for open research data - that is applicable to any sector - is the concept of F.A.I.R. FAIR data should be Findable, Accessible, Interoperable, and Reusable. We will discuss this concept in a bit more depth throughout this quarter. But, having this shorthand definition of what we try to achieve in doing data curaiton is a helpful reminder for the steps needed to make data truly accessible over the long term.

Data curation: Data curation is the active and ongoing management of data throughout a lifecycle of use, including its reuse in unanticipated contexts. This definition emphasizes that curation is not one narrow set of activities, but is an ongoing process that takes into account both material aspects of data, as well as the surrounding community of users that employ different practices while interacting with data and informaiton infrastructures. In DC II there is a particular emphasis on understanding the minutiae of information infrastructures as they relate to curatorial interventions.

Metadata: Metadata is most simply a set of standardized attribute-value pairs that provide contextual information about an object or artifact. Metadata, at an abstract level, has some features which are helpful for unpacking and making sense of the ways that descriptive, technical, and administrative information become useful for a digital object playing the role of evidence. Metadata can and often does include the following:

In DC I we also differentiated data and metadata based on its structure.

Structured Metadata is quite literally a structure of attribute and value pairs that are defined by a scheme. Most often, structured metadata is encoded in a machine readable format like XML or JSON. Crucially, structured metadata is compliant with and adheres to a standard that has defined attributes - such as Dublin Core, EML, DDI.

Metadata Schemas define attributes (e.g. what do you mean by “creator” in ecology?); Suggest controls of values (e.g. dates = MM-DD-YYYY); Define requirements for being “well-formed” (e.g. what fields are required for an implementation of the standard); and, Provide example use cases that are satisfied by the standard.

Structured metadata is, typically, differentiated by its documentation role.

Unstructured Metadata is meant to provide contextual information that is Human Readable. Unstructured metadata often takes the form of contextual information that records and makes clear how data were generated, how data were coded by creators, and relevant concerns that should be acknowledged in reusing these digital objects. Unstructured metadata includes things like codebooks, readme files, or even an informal data dictionary.

A further important distinction that we made about metadata is that it can be applied to both individual items, or collections or groups of related items. We referred to this as a distinction between Item and Collection level metadata.

Expressivity Vs Tractability. One of the key concepts we discussed in DC I is the idea that all knowledge organization activities are a tradeoff between how expressive we make information, and how tractable it is to manage that information. The more expressive we make our metadata, the less tractable it is in terms of generating, managing, and computing for reuse. Inversely, if we optimize our documentation for tractability we sacrifice some of the power in expressing information about attributes that define class membership. The challenge of all knowledge representation and organization activities - including metadata and documentation for data curation - is balancing expressivity and tractability.

Normalization literally means making data conform to a normal schema. Practically, normalization includes the activities needed for transforming data structures, organizing variables (columns) and observations (rows), and editing data values so that they are consistent, interpretable, and match best practices in a field. (Note from Bree: For those of you with statistics and data science experience, note how this differs from statistical normalization techniques. I mention this because I’ve been in conversations where the use of this term has meant different things to different people causing confusion.)

Data Quality: From the ISO 8000 definition we assume data quality are “…the degree to which a set of characteristics of data fulfills stated requirements.” In simpler terms data quality is the degree to which a set of data or metadata are fit for use. Examples of data quality characteristics include completeness, validity, accuracy, consistency, availability and timeliness.

Summary

These basic concepts give us some of the declarative knowledge we need to begin to do data curation work. Over the next ten weeks I will continue to offer working definitions about basic concepts related to DC II readings and topics. It’s important to keep in mind these are ‘working definitions’ - they may have important qualifications or need refinement in any particular context. The point of this long review isn’t to ask you to memorize these concepts, but instead provide something like a ‘term of reference’ or knowledge base that we can continue to refine and improve over the course.

In DC II we are going to tackle topics that help us significantly expand this declarative knowledge. We will look at how to practically decide which type of encoding or representation of data best match the needs of a given community. We will extend the idea of data normalization to include some best practices in creating ‘tidy’ or useful data for computation. We will also look at how to integrate or combine different datasets, such that we can balance something like expressivity and tractability in reusing this data for a new purpose.

I will also introduce some more advanced topics that relate to processes of acquiring, packaging, linking, and representing data such that it can be published to the web.

Exercise

Introduce yourself on the Canvas discussion board. This is not a graded exercise.