Modules
Directory of modules.
-
Week 3: Tidy Data
Overview of tidy data principles as they relate to data curation, plus extending tidy data to some of the underlying principles in organizing, managing, and preparing all kinds of structured data for meaningful use.
[video] [content] [exercise]
Required Readings:
- Course Content
- Rowson and Munoz (2016) Against Cleaning
- Wickham, H. (2014), “Tidy Data,” Journal of Statistical Software, 59, 1–23
Suggested Readings:
- Wickham, H. (2014), “Tidy Data,” Journal of Statistical Software (more code & examples than required reading)
- Tierney, N. J., & Cook, D. H. (2018). Expanding tidy data principles to facilitate missing data exploration, visualization and assessment of imputations
- Leek (2016) Non-tidy data
- Broman, K. W., & Woo, K. H. (2018). Data organization in spreadsheets. The American Statistician, 72(1), 2-10
- Tort, F. (2010). Teaching spreadsheets: Curriculum design principles
- Mack, K., Lee, J., Chang, K., Karahalios, K., & Parameswaran, A. (2018, April). Characterizing scalability issues in spreadsheet software using online forums
- Formatting data tables in spreadsheets: Data Carpentry Lesson
-
Week 4: Data Integration
Data integration as it operates at the logical level of tables, and data that feed into user interfaces.
[lecture] [guest lecture] [content] [exercise]
Required Readings:
Highly Recommended Readings:
Optional Readings:
- Halevy, A., Rajaraman, A., & Ordille, J. (2006, September). Data integration: The teenage years. In Proceedings of the 32nd international conference on Very large data bases (pp. 9-16)
- Abiteboul, S., Buneman, P., & Suciu, D. (2000). Data on the Web: from relations to semistructured data and XML. Morgan Kaufmann
-
Week 5: Data Packaging
How and why our work in developing metadata for data curation is paramount to sustainable access, and introduce a few broadly used standards for creating data packages.
[video] [content] [exercise]
Required Readings:
- Course Content
- Bechhofer, S., De Roure, D., Gamble, M., Goble, C., & Buchan, I. (2010). Research objects: Towards exchange and reuse of digital knowledge
- Skim this list of projects and tools for data packaging: Google Sheet or PDF
- Neylon (2017) Packaging data
Pick 1 of following to read or review in-depth:
-
Week 6: Repository Architectures
Builds upon Data Curation I discussion of a data repository as a layered architecture for curation.
[video] [content]
Required Readings:
- Course Content
- description of digital libraries (cyberinfrastructure) from the National Science Foundation program
- This post from the IQSS staff at Harvard’s Dataverse provides an excellent table comparing existing data repository services. Pay attention to the categories being compared, and how this related to the affordances of the software
- Fallaw, C., Dunham, E., … (2016). Overly honest data repository development. Code4Lib
Review documentation for just one repository platform listed below (be sure to also look at an example of the platform’s deployment):
- Samavera (Open-source repository for universities and institutional repositories)
- Dataverse (Open-source repository for social science data)
- Fedora (Open-source repository with semantic capabilities - often used by science repositories)
- CKAN (open-source data repository - often used for civic data)
- About
- Documentation
- Example deployments https://data.gov.au/ and Data.gov
- Some additional info on Data.gov.au’s CKAN
- Clowder (Open-source for long-tail data)
Suggested Readings:
- Amorim, R. C., Castro, J. A., Da Silva, J. R., & Ribeiro, C. (2017). A comparison of research data management platforms: architecture, flexible metadata and interoperability. Universal Access in the Information Society, 16(4), 851-862
- Lnenicka, M. (2015). An in-depth analysis of open data portals as an emerging public e-service. International Journal of Social, Education, Economics and Management Engineering, 9(2), 589-599. (see table 3 in particular for a comparative approach to Open Data portal evaluation)
- Cornell University Library Repository Principles and Strategies Handbook (I highly recommend this if you are looking for some background on how a University Library strategizes around digital infrastructures)
- Blanke, T., & Hedges, M. (2013). Scholarly primitives: Building institutional infrastructure for humanities e-Science. Future Generation Computer Systems, 29(2), 654-661
-
Week 7: Data Acquisition, Search, and Discovery
A review of the fundamental challenges that data curators face in making data discoverable.
[video] [content] [exercise]
Required Readings:
- Course Content
- Google Dataset Search: Building a search engine for datasets in an open Web ecosystem.
- Facilitating the discovery of public datasets
- Discovering millions of datasets on the web
Suggested Readings:
- Data Discovery Paradigms: User Requirements and Recommendations for Data Repositories.
- Understanding data search as a socio-technical practice.
- Scientific user requirements for a herbarium data portal.
- Scholar‐built collections: A study of user requirements for (Humanities) research in large‐scale digital libraries.
- Improving the discoverability and web impact of open repositories: techniques and evaluation.
Case Study (Optional):
-
Week 8: Metadata Application Profiles
Introduction to tidy metadata.
[video] [content] [exercise]
Required Readings:
- Course Content
- Application profiles:
- Heery, R., & Patel, M. (2000). Application profiles: mixing and matching metadata schemas. Ariadne, (25)
- The Singapore Framework for Application Profiles Note this is currently under revision by DCMI. You can catch up on their work here (and also see an example of use cases in the wild)
- Some examples of metadata application profiles:
Suggested Readings:
- Hebron, T. K. (2018). Extending and Adapting Metadata Audit Tools for Mountain West Digital Library Members Code4Lib Journal, (41)
- Curado Malta, M., Bermúdez Sabel, H., Baptista, A. A., & González-Blanco García, E. (2018). Validation of a metadata application profile domain model
- Stein, A., & Dunham, E. (2018). Meaningful Data Sharing: Developing the Illinois Data Bank Metadata Framework. Journal of Library Metadata, 18(2), 59-83
-
Week 9: Linked Data
Introducing some working definitions and providing an overview of concepts related to linked data and the promise, but ultimate failure of the semantic web.
[video] [content]
Required Readings:
- Course Content
- Allemang, D., & Hendler, J. (2011). Semantic web for the working ontologist: effective modeling in RDFS and OWL. Second Edition
- Read Chapter 1 for an introduction to SW’s concepts. If you are interested Chapter 2 gives a bit more detail on how the SW works, and Chapter 3 introduces RDF and knowledge modeling.
- Ontology Development 101 (Noy and McGuiness)
- Read Section 1 and 2; (3 and 4 are optional)
- Note - this is a classic formulation of what an ontology is and how to create one. The software they reference in building out the example is called Protege (free https://protege.stanford.edu/). If you are really keen you can follow along. (For reference - this short list from Wikipedia is quite helpful.)
- Ontology for Data Science
- Semantic Web for the Legal Domain
Suggested Readings:
-