Metadata Application Profiles

Original Author: Nic Weber
Editing & Updates: Bree Norlander

There are an abundance of explanations and tutorials for how metadata should be created under domain specific constraints, but there is surprisingly little attention paid to principles that underlie all knowledge organization and representation activities that are required to create valid metadata. Drawing inspiration from the data science concept of “Tidy Data” this chapter introduces a set of principles for creating Tidy Metadata. These principles are domain-agnostic - they can and should be applied to any setting in which accurate description, retrieval, and discovery are necessary.

Introduction

The underlying principles of creating metadata come from fields such as knowledge organization and representation, and are applied practically to domain specific data. Ecology data have different needs than do, for example, cultural heritage materials. As a result, many discipline or domain specific metadata standards have been developed to accommodate these different resources.

At a broad level of abstraction, metadata provides for both human and machine-interpretable information necessary for accurate retrieval, discovery, and use of digital objects. But, often metadata creators are not formally trained in either knowledge organization or knowledge representation. For data curators, this poses a number of challenges:

In the Introduction module there are simple definitions of core concepts related to metadata that will be helpful to review before reading the rest of this module.

Semantics and Syntax of Metadata

In the most simple form, metadata consists of ‘attribute - value’ pairs that describe some property of data, or a set of data. Recall that in the introduction, we established that attributes are “the defining features of a class or sub-class, and refer to instances. An instance is a member of a class if it has all of the attributes of that class.”

Attributes and values can be expressed, semantically, as descriptive, administrative, or technical information about data.

Syntactically, we can express or encode attributes using a number of standards. Machine-readable metadata are most often syntactically expressed as JSON or XML. Human-readable metadata can be expressed as a simple table, and its encoding can take a variety of forms - from plain text files to Excel tables.

Tidy Metadata

Tidy data establishes some general principles that should apply to the structure and representation of tables. These principles should be applicable across all kinds of tables that have variables, values, and various kinds of observations. The most simple formulation of tidy data is:

Metadata should have some similar principles. That is, there should be some general rules that we can follow to describe attributes, values, and their corresponding relationship to instances of a class. Below are the principles I will propose for tidy metadata as it applies to tables of data.

The properties of a dataset are expressed as an attribute-value pair that conforms to a schema:

  1. Attributes are declared by a namespace.
  2. Values are, where possible, constrained by a controlled vocabulary (this does not apply to free-text fields).
  3. Schemas are published to the web.

Let’s unpack each of these statements so that it makes sense in the context of data curation.

“Schemas are published to the web”: a metadata schema establishes and defines data elements (attributes) and the rules governing the use of data elements to describe a resource (for detailed metadata strategies see: Gourley, D., Zhang, A. (2014). Creating Digital Collections: A Practical Guide. United Kingdom: Elsevier Science.). Schemas are, in plain language, the rules of engagement for creating metadata. That is, they govern what are the valid and invalid use of an attribute-value pair to describe a dataset. Schemas then have to be public in order to be validated. Publishing a schema to the web means that the schema must be at a resolvable web address (a url) and should be encoded in a machine-readable language (e.g. XML or JSON). Schemas should, where possible, use definitions of elements (attributes) as a unique namespace.

“Attributes are declared by a namespace: In publishing a schema to the web, we should also take care to define the use of an attribute such that each attribute has a unique location where the definition and explanation of its use is publicly accessible and identifiable in a schema. The attribute namespace has a subtle, but important relationship to a schema. A schema can be made up multiple namespaces, each namespace can be a part of a different schema (I’ll offer an example below so that this is less abstract).

“Values are, where possible, constrained by a controlled vocabulary”: Recall that in the module on Tidy Data, we discussed the appeal to authority control for standard units of measurement. In metadata we want to rely upon this authority control in a similar way - this helps to standardize what types of values an attribute can have, and provide clear guidance for how these values should be constrained.

Tidy Metadata Examples

A simple way to express metadata describing the painting “Mona Lisa” might look something like this:

Attribute Value
Creator Leonardo da Vinci
Year_Created c. 1503-1519
Medium Oil on poplar
Dimmensions 77 x 53
Location Musée du Lovre, Paris


The metadata elements in this table are arranged as pairs of attributes and values that generically describe the painting “Mona Lisa”.

Putting our tidy metadata principles to work - we could do the following:

Values constrained by a controlled vocabulary: Just as with tidy data, we want the values of our table to have an authority control. The value of Creator is a good initial candidate. We could express this name in a number of ways, but ultimately we want it adhere to a standard. The Union List of Artists Names (ULAN) provides an authority file that describes artists, and their attributes (derivative names, nationality, etc.). We should then be able to find a controlled vocabulary in ULAN for Leonardo da Vinci and be able to control (or constrain) the value in our metadata record based on this definition. We could take the same approach for Location, Medium, Dimensions - each value - in our metadata.

Attributes declared by a namespace: The attributes in our metadata table should also have a definition that is available at a namespace. For Creator we could appeal to a standard like Dublin Core - which is used broadly to create descriptive information about cultural heritage objects. The plain language definition for a creator in Dublin Core is available here, and the namespace for this attribute is https://purl.org/dc/terms/creator. In declaring the namespace for the Creator attribute we are saying that this is not our own unique idea of who created a painting, but a specific definition (that of Dublin Core) of what a creator of a work of art means.

Brief Note on Readability

In metadata creation, there is often a distinction made between human-readability and machine-readability. These distinctions are exactly as they appear - human-readable metadata is that which can be easily interpreted by an actual person. The table above is an example of a human readable metadata record. Machine readable metadata means that (similar to our discussions of encoding data in Tables, Trees, and Triples) a machine can interpret and act upon a metadata record. When we appeal to namespaces, and publish schemas to the web - what we are doing, as data curators, is creating a machine-readable metadata record. These two approaches to creating metadata don’t necessarily have to be a tradeoff. Think about how we create independence in data representation (at the logical level) and remember that we can use “data” to feed into “graphic user interfaces”. Metadata is no different. We can create machine-readable metadata, and publish this metadata so that it is accessible in a graphic user interface. Privileging the creation of machine-readable metadata, as I do with tidy metadata principles, is simply in service of trying to create sustainable metadata records. We can always translate a machine-readable record into a human-readable record, but the same is not true in reverse. It takes substantially more effort to transform a human-readable record into something a machine can interpret.

Publish a schema to the web: In using controlled vocabularies for our values, and namespaces for our attributes, we want a tidy way to reference these in a machine-readable format. This is what we can achieve by making our schema available on the web. For the sake of this example, let’s assume that we are only going to use attributes from the Dublin Core standard. If this is the case, then we can depend on the schema for Dublin Core that is already publicly available on the web at http://purl.org/dc/elements/1.1/ - this is a web address with a machine-readable set of definitions for each attribute in our metadata record.

A Valid Dublin Core Record

Let’s continue to work from the assumption that we are creating a metadata record for the Mona Lisa. When we are finished, our tidy metadata record should look something like this, when expressed as XML

<?xml version="1.0" encoding="UTF-8"?>

<metadata
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:dcterms="http://purl.org/dc/terms/">

<dc:title>Portrait of Lisa Gherardini, wife of Francesco del Giocondo, known as the Mona Lisa (the Joconde in French)</dc:title>
<dcterms:alternative>Mona Lisa</dcterms:alternative>
<dc:creator>Leonardo da Vinci</dc:creator>
<dc:description>This portrait was doubtless started in Florence around 1503. It is thought to be of Lisa Gherardini, wife of a Florentine cloth merchant named Francesco del Giocondo - hence the alternative title, La Gioconda. However, Leonardo seems to have taken the completed portrait to France rather than giving it to the person who commissioned it. After his death, the painting entered François I's collection.</dc:description>
<dc:publisher>Musee du Lourve</dc:publisher>
<dc:contributor>Musee du Lourve</dc:contributor>
<dcterms:created xsi:type="dcterms:W3CDTF">1503-01-01</dcterms:created>
<dcterms:created xsi:type="dcterms:W3CDTF">1509-12-31</dcterms:created>
<dc:type xsi:type="dcterms:DCMIType">Image</dc:type>
<dcterms:medium xsi:type="dcterms:IMT">Oil on Poplar</dcterms:medium>
<dc:identifier xsi:type="dcterms:URI">https://www.louvre.fr/en/oeuvre-notices/mona-lisa-portrait-lisa-gherardini-wife-francesco-del-giocondo</dc:identifier>
<dc:source xsi:type="dcterms:URI">https://www.louvre.fr/en/oeuvre-notices/mona-lisa-portrait-lisa-gherardini-wife-francesco-del-giocondo</dc:source>
<dc:language xsi:type="dcterms:ISO639-2">FRA</dc:language>
<dc:language xsi:type="dcterms:ISO639-2">ITA</dc:language>
<dc:language xsi:type="dcterms:ISO639-2">ENG</dc:language>

</metadata>

Tidy Metadata:

Using Tidy Metadata Principles for MAPs

Tidy metadata principles start to take on more meaning when we apply them to the types of data curation challenges we face in building schemas, identifying attribute namespaces, and controlling values for unique resources. In practically performing data curation we won’t always have the advantage of describing resources that have just one best set of attributes or just one standard to appeal to. Often times we will necessarily want to combine different attributes and different standards to create a unique, but standard way of describing a resource. This is, essentially, the gist of a metadata application profile. An application profile allows a specific institution or metadata creator to combine the most appropriate attributes and authority controls for describing resources they manage.

I think of a metadata application profile like a playlist - sometimes it’s perfectly reasonable to put on Fiona Apple’s latest (and you should definitely listen to this album [says Professor Weber]) and take a walk. But, sometimes we want to create a mix of the best individual tracks from different albums for our cross-country drive.

Metadata application profiles combine the best attributes, from different schemas, necessary for accurate description. In practice, this means that we have to create a machine-readable way to define and govern each of the attributes that we include in a metadata application profile. There are generally two approaches to creating a MAP:

1. Combine attributes from existing schemas: The most common way to create a MAP is to reuse attribute definitions that have already been published. We can do this, practically, by declaring in our record which attributes we are using from which schemas.

In XML, this might look like the following combination of Dublin Core and the Data Documentation Initiative (DDI) which is used in social sciences:

<metadata
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:ddi="http://ddi:profile:3_1">

<dc:title>Portrait of Lisa Gherardini, wife of Francesco del Giocondo, known as the Mona Lisa</dc:title>
<ddi:language xsitype="ISO639-1"/>English</ddi:language>
<ddi:language xsitype="ISO639-1"/>French</ddi:language>
<ddi:language xsitype="ISO639-1"/>Italian</ddi:language>

One interesting note about MAPs: In some cases it is not only a matter of selecting the best attributes for description, but modifying an attribute set so that it can accurately describe the role that data are playing as evidence in some field. An excellent example of this is the Darwin Core metadata profile - which is used to describe species occurrence records in natural history. Darwin Core uses numerous attributes defined by Dublin Core (hence the name) but modifies their definition slightly to be interpreted accurately by natural historians. Think for example about the attribute Creator in the Dublin Core attributes - this is defined as “An entity primarily responsible for making the resource”. Unless we want to get into a deep philosophical discussion about creationism and the origins of species, this is an attribute definition of a cultural heritage resource that poorly aligns with the needs of a natural historian. To skirt this philosophical dilemma, Darwin Core removes the need for a creator in its MAP, and instead defines concepts like “Occurrence” and “Event of Observation” to better meet the needs of natural historians.

2. Publish your own unique attribute namespace: There may be cases where, after searching existing schemas, there is a need to create a new attribute that can uniquely describe a resource. In this case, the institution that is creating metadata will take on the responsibility of defining and publishing a namespace that defines a unique attribute. For most cases, creating a unique attribute namespace should be a last resort. It is difficult to practically sustain a unique published namespace - it requires making sure that the attribute definition remains accessible publicly on the web, and that any machine that attempts to process (or understand) your attribute definition can do so indefinitely. That being said, there are times when this is necessary.

Here are two examples of institutions that have created unique attribute namespaces:

Tidy Metadata Best Practices

In this module, I have laid out some principles for how metadata should be created. These are principles that, in practice, may require modification or may be unnecessary given the time intensive task of describing digital data.

Metadata is, as we’ve discussed throughout the quarter, a tradeoff between expressivity and tractability. Throughout this module I am arguing that tractable metadata is that which follows a tidy metadata principle of, where possible, reusing existing schemas, attributes, and controlled vocabularies. This significantly reduces the need to sustain metadata standards, and allows for our metadata records to be broadly accessible to both humans (through a graphic user interface) and machines (through encoding).

A few other important notes for practicing tidy metadata principles.




One parting note - There is a concept which doesn’t fit neatly into our discussion of MAPs and Tidy Metadata, but is important to think about when approaching metadata for data curation. I’ll try to briefly summarize the idea of a 1:1 principle and unpack why this is important for data curation in the concluding section.

Principle: 1:1 Relationships

The principle whereby related but conceptually different entities, for example a painting and a digital image of the painting, are described by separate metadata records. (Reference)

The one to one (1:1) principle holds that metadata records should correspond to one, and only one entity (or instance) of a class (Hillmann, 2005, sec. 1.2). This principle was first articulated in the context of cultural heritage metadata where related, but conceptually different instances are often difficult to interpret.

The canonical example of the 1:1 principle is a photograph of the Mona Lisa, and the actual painting of the Mona Lisa by Leonardo Da Vinci. If we search for all of the records that describe the Mona Lisa we can find numerous metadata records about the Mona Lisa with the creator Leonardo da Vinci. Here are the results from searching for “Mona Lisa” in the Europeana database which aggregates metadata across European cultural heritage institutes. If we actually look closely at these records - we will find multiple examples of the creator being named as “Leonardo da Vinci”.

But, this surely isn’t the case. There is one and only one instance of a painting by Leonardo da Vinci named Mona Lisa. If following the 1:1 principle, there should be one and only one record that claims the creator of Mona Lisa is Leonardo da Vinci. All other instances are replications, or different instantiations (e.g. photographs, trinkets, t-shirts, etc) of the original painting that were created by someone other than Leonardo da Vinci.

In his doctoral thesis, Richard Urban empirically examined violations of the 1:1 principle (2012). He found that metadata creators, from professional catalogers to those untrained in knowledge organization, applied this principle unevenly. The result is a broad confusion about what items a cultural heritage institution actually holds, and which are replications or secondary sources.

At its core, the 1:1 principle is trying to disentangle the complex relationship between digital objects and their various manifestations. In data curation, we often need to adhere to this principle when we are describing complex datasets that contain multiple tables, multiple versions of a table, or multiple instances of the same data. To practically do this, we can follow a few basic rules:

  1. Recognize item-level and collection-level relationships: Item-level and collection-level metadata is meant to provide a way for individual objects (data) to be described in a part/whole relationship. Collection level metadata is often about describing attributes of the aggregate of a set of resources, rather than any one resource specifically. But, these relationships may be hard to disentangle upon first glance. The first step I take when dealing with complex relationships like this is to try to specify the autonomy of the data being described. Does it make sense as a stand-alone dataset that can be accurately interpreted, or does it have a relationship with additional resources that are needed for meaningful reuse. If the latter is the case, then there is a need to create a collection level description.
  2. Classes and instances (this is very similar to item-level and collection-level descriptions): When approaching a dataset or table, often we need to determine what class this instance of data belongs to. Remember our Homer, Lisa and Bart Simpson example from the Tables, Trees, & Triples module? We have instances (characters) of the class (Simpson family). The membership in the class is defined by attributes such as having a parent that is named Simpson. In adhering to 1:1 principle we should create metadata that can represent these relationships without violating the idea that one record corresponds with one instance. That is, we can create a description of a class (Simpsons) and instances of the class (characters), but we should not be creating a Simpson family record if it contains only one character description. Instead - we can use the collection-item level descriptions to accommodate this relationship. We could, for example create a record that describes all Simpsons, and then individual records for each character. We might also nest these attributes, such that only our Collection-level metadata records describes the instances of the class (members). In Dublin Core we can use attributes like is_part_of to define these types of collection-item level relationships (read more here).
  3. Reuse existing records: Just as we should take care to reuse existing namespaces and schemas following tidy metadata principles, we might also attempt to reuse parts of existing metadata records that already exist. For example, if we were describing a photo of a the Mona Lisa we should be able to find existing records describing the painting, and existing records describing photos of original art works. We can then attempt to create a unique record of our photo of the Mona Lisa by reusing these existing schemas, namespaces, and attribute-value pairs. In this search for related items - we can further establish whether we are creating a unique record, or whether we are describing a unique resource for the first time. The former, finding and discovering related records, is useful if and only if the data we are curating is likely to exist in multiple places. This is a useful technique for downstream curation, but is of limited value to upstream curators.

Referenced works

Lecture

Reading

Application profiles:

Some examples of metadata application profiles:

Relevant (optional) readings:

Exercise

The exercise for this week is to work on your protocol, and in particular the Metadata Application Profile Assignment. Your readings this week describe metadata application profiles in detail.