Wrap Up

or “What I wish I had integrated into the curriculum prior to day 1”

Author: Bree Norlander

The main reading I wanted to find a place for (and ultimately never did - unless you count right now) this quarter is “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜” (and not just because it includes an emoji in the title, though that in itself is certainly novel). If you have time this week, read it. If you don’t, read it later.

It is past time for researchers to prioritize energy efficiency and cost to reduce negative environmental impact and inequitable access to resources — both of which disproportionately affect people who are already in marginalized positions. (p. 613)

we discuss how large, uncurated, Internet-based datasets encode the dominant/hegemonic view, which further harms people at the margins, and recommend significant resource allocation towards dataset curation and documentation practices. (p. 613)

This fourteen-page article about one particular type of dataset encapsulates so many issues I want you to think about as you embark on data-related (or other) careers. (Additionally there are issues regarding the response by Google to Timnit Gebru’s and Margaret Mitchell’s roles in publishing this article, but I’ll just focus on the content of the paper for now.)

Screenshot of a phone screen with some text - circled are the phone's auto-generated next word choices.

Let me set the stage. Stochastic Parrots is specifically about natural language modeling (used for auto-generating text, machine transcribing, automated speech recognition, and more) and the datasets used in the process:

One of the biggest trends in natural language processing (NLP) has been the increasing size of language models (LMs) as measured by the number of parameters and size of training data. (p. 610)

Here the authors are referring to two pieces of natural languague processing, models and datasets:

  1. Language models, which are calculations of probabilistic relationships between words (e.g. the likelihood that certain words will follow a given string of words), that are “trained” on
  2. datasets full of natural language examples. These datasets can come from anywhere in which there is spoken or written language to extract (books, emails, tweets, reddit posts, blogs, on and on).

And, size, which refers to the computational and storage size of language models. The datasets themselves are large, and then the probability modeling on top of the large datasets can be extremely resource intensive (in energy consumption, memory consumption, CPU workload, etc.).

As you work through this piece, you will likely come across technical terms that you are unfamiliar with and that likely have more to do with language models than with the datasets (though the authors really do an excellent job of defining almost everything which is such an asset to a technical paper). That’s OK. I want you to pay attention to the real-world environmental and ethical issues that arise when we think about natural language datasets and their use. That’s what we are doing as data curators, thinking through the preservation and re/use of the datasets we are making available.

In accepting large amounts of web text as ‘representative’ of ‘all’ of humanity we risk perpetuating dominant viewpoints, increasing power imbalances, and further reifying inequality. (p. 614)

OK, enough set-up…go read the article.

Now, what questions do you have about the role of data curation? Why is this article relevant? Here are some of the thoughts I walked away with:

Gender repartition of EU28 population, according to Eurostat. The collected data includes data for women and men. The category non-binary is not included in the dataset, and therefore made invisible.
Chart by Luc Guillemot

I present this article and these questions to you without answers. More than anything, I want you to have these questions in mind as you go about your work. You may end up in a position to advocate for better use of environmental resources, better vetting of datasets, more pre-mortem discussions of the data gathering process, and I hope you do. We need you in the field.

In summary, we advocate for research that centers the people who stand to be adversely affected by the resulting technology, with a broad view on the possible ways that technology can affect people. This, in turn, means making time in the research process for considering environmental impacts, for doing careful data curation and documentation, for engaging with stakeholders early in the design process, for exploring multiple possible paths towards longterm goals, for keeping alert to dual-use scenarios, and finally for allocating research effort to harm mitigation in such cases. (p. 619)

Additional Reading on the Topic

In addition to the costs (environment, financial, opportunity) and harms discussed in Stochastic Parrots, other researchers are looking very closely at some of the datasets used to train these models and finding copyright violations, content duplication, and genre biases (see Jack Bandy’s post).

Brandon Locke and Nic Weber have a chapter currently in publication called “Ethics of Open Data”. Nic has generously allowed you pre-publication access to this chapter, but you will need Canvas access to download it. It is an excellent overview of open data through the lens of virtue, consequential, and non-consequential (deontological) ethics including three relevant case studies. Future iterations of this course will no doubt include this text throughout the modules.

Advocacy and Solidarity Through Data Curation

Original Author: Nic Weber
Editing & Updates by: Bree Norlander

These past 18 months have put a spotlight on so many injustices, inequities, and wrongs. It has been a period of exceptional upheaval at systems that are deeply broken. Throughout the course content we have tried to make the content relevant in the examples used, but we have certainly at times fallen far short of what should be a course that equips you with skills to advocate and work on behalf of just outcomes. This is a work in progress and we aim to continue revising and making content that promotes justice, equity, and advocacy in data curation.

If you continue on at the iSchool we strongly recommend courses taught by our brilliant colleagues Megan Finn, Miranda Belarde-Lewis, Clarita Lefthand Begay, Marika Cifor, Jason Yip, and Anna Lauren Hoffman. A few that we recommend in particular are:

Data Feminism & Activism

I can’t say this any better than the blurb for a book authored by Catherine D’Ignazio and Laren F Klein, “Today, data science is a form of power. It has been used to expose injustice, improve health outcomes, and topple governments. But it has also been used to discriminate, police, and surveil. This potential for good, on the one hand, and harm, on the other, makes it essential to ask: Data science by whom? Data science for whom? Data science with whose interests in mind? The narratives around big data and data science are overwhelmingly white, male, and techno-heroic. In Data Feminism, Catherine D’Ignazio and Lauren Klein present a new way of thinking about data science and data ethics—one that is informed by intersectional feminist thought.”

Through a critical lens that takes seriously the role of fourth wave feminism this book, and the advocacy around data feminism, should be a central place to locate the role of curation in contemporary society. The ability to structure, organize, and encode data cannot (as we have discussed throughout the quarter) be divorced from broader societal and institutional forms of power. By approaching these topics of curatorial power rooted in a historical understanding of oppression and civil liberties we might be able to, regardless of our identity and positionally, be more effective allies. I don’t presume to know how, given varied contexts, to do this most effectively, yet. But the emergent scholarship around data feminism is an excellent place to begin.

Here are some relevant resources to get started (or continue):

Data For Black Lives

Straightforwardly - many of the systems of observation and enumeration in state-based data collection are deeply biased against people of color. From the use of flawed algorithmic systems in unemployment benefits to the facial recognition technologies that are employed with deep biases related to black faces.

Perpetuating discriminatory policies through technologies that collect and are trained on imperfect data about people of color is a multi-faceted problem. Overcoming and effectively dismantling these flawed technologies requires effective policy development, legal expertise, and ethically trained data professionals. There is no simple way in which we can educate, legislate, or adjudicate our way out of the biases of technologies that are being increasingly used for harm. As data curators our role is first to become informed about the ways in which both technologies for data collection, and resulting data do harm. Second, we can begin to intervene where our skills and our expertise are appropriate. This will likely include volunteering our time to investigate and describe problems related to data encodings, efficient data processing, and documentation which surfaces bias and attempts to clarify why data are never neutral.

Here are some relevant resources to get started (or continue):

Emerging Topics in Data Curation

Author: Nic Weber

Data Curation Future Directions

This final module discusses the future in data curation. Data curation is a fast moving field that responds to emerging needs of data producers, consumers, and users. Many emerging topics in data curation are beyond the scope of a 10 week quarter, but are worthy of our time and attention. Below, I cover some interesting topics and provide resources that you can use for future reference.

These are also not just my thoughts - I asked friends and collaborators to tell me what emergent topics they thought were most important for the future of data curation. Below are their responses (and my own thoughts).

Ontologies for Linked Data

In our previous chapter on linked data I described a rapidly evolving “cloud” or graph of linked data published to the web. Linked data are often governed by formal ontologies or vocabularies that include class, sub-class, properties and instances of data. These formalisms allow for data to follow a common markup (a standard), and these standards in turn enable relationships between data to be formally expressed and acted upon by machines.

The difference between a “vocabulary” and an “ontology” is often, ironically, semantic - ontologies are really just formal vocabularies that give linked data practitioners standard terms to define subjects, objects, and predicates for markup of data. As the W3C notes:

There is no clear division between what is referred to as “vocabularies” and “ontologies”. The trend is to use the word “ontology” for more complex, and possibly quite formal collection of terms, whereas “vocabulary” is used when such strict formalism is not necessarily used or only in a very loose sense. Vocabularies are the basic building blocks for inference techniques on the Semantic Web.

Even though I argued in our previous module that the semantic web has failed in many respects - where this field has succeeded is in providing useful ontologies for common data on the web. In the example I used for Austin, TX, the ontology that powered our linked data was DbPedia. By providing for standard ways to mark-up the subject, object, and predicates for Austin, TX, we were able to use simple declarations to query DbPedia to find the population of Austin, TX without having to explicitly state this factual information. In practice, this markup looked like the following:

<http://dbpedia.org/resource/Austin,_Texas>
<http://xmlns.com/foaf/0.1/based_near>
<http://sws.geonames.org/4671654/austin.html>

In terms of linked data - what we used were namespaces from the ontology that were defined on the web. If you click on the first namespace - you see all of what DbPedia knows about Austin, TX.

This allows for a plain language statement like “Austin has a population of 964,254” to take on a machine-readable statement like the following:

The key here is that DbPedia provided the subject and predicate statements - that is we used a resource in DbPedia to describe a particular place (Austin, TX) and we used a predicate from the ontology dbo:PopulatedPlace/populationDensity to connect the Subject to an Object (the actual population of Austin, TX). Our machines can interpret and know this semantic information simply through syntax. The ontology does all of the work in defining what a resource is, what a predicate means, and what the resulting object value is.

Ontologies are rigid in that they require very specific syntax, but they are also incredibly powerful in that they provide for so much of the standardized factual information we expect (and know) to already exist on the web. DbPedia is just one node in the emerging linked data graph. By understanding and using ontologies of linked data, as we discussed last week, we can connect our data (however locally specific it may be) to this web of information. The key to really understanding and using ontologies effectively is to know which ones exist, and to master their syntax.

Here are some resources to continue exploring linked data and ontologies:

Sensitive Data & Privacy

The curation of data containing sensitive contents, including personally identifiable information (PII), is an emerging challenge for curators. As more personal data is collected there are increasingly valuable applications for responsibly using PII to generate valuable knowledge. Here are two emergent examples that are worthy of our attention:

If you read just one thing about the census and differential privacy I highly encourage the following primer:

Emulation

Throughout the quarter we have discussed the transformation of data from closed or proprietary formats to open formats (or plain text encodings) which enable data to be reliably reused across different computing environments. There are typically two ways to achieve these transformations: Migration and Emulation.

Migration is, quite simply, the reformatting of data - which depends upon moving content encoded in one standard to another (e.g. Excel to CSV).

In digital preservation an alternative approach to format migration is the practice of emulation. Emulation is, broadly, the reproduction of a computing environment in which a data format is rendered. This allows software or a program to be used in any hardware or computing environment.

Most of us have experienced emulation in the context of legacy video games. For example, when we play a version of “Super Mario Bros” in a browser this is the emulation of the original Nintendo console. This allows us to port the functions of Nintendo to our contemporary computing hardware. We no longer need an actual Nintendo console to play the games that were originally developed for that hardware.

Similarly, emulation is being developed for interacting with and using software that is necessary for reproducing analyses, interacting with research artifacts (e.g. simulations or dynamic graphs), and rendering data. Emulation was once exceptionally difficult and prohibitively expensive to develop, but is being made easier by what are called “containers” or “virtual machines” that allow for anyone to reproduce a computing environment in which data were analyzed and used.

Examples and further reading on Emulation: