Repository Architectures (OPTIONAL) Assignment
Assignment OPTIONAL: Repository Architectures
Most of us will never design a new data curation system from the ground up. We will, by and large, develop on top of existing software platforms that provide basic features. This means we can focus our time and attention on configuring these architectures to meet our specific needs.
There are many existing open-source data repositories from which we can choose. However, deciding between different repository platforms can be a daunting task. We have to balance time, skills, user needs, and the requirements for both hardware and other related software.
In this exercise we’ll try to use some basic criteria to profile the software underlying a repository platform, and then decide which repository we feel might ultimately be best for our data collections.
Side note - many of you will not need to evaluate a repository for the sake of your protocol, but this is a handy skill to practice so may still be worth your time.
For this exercise you can chose between four repository platforms.
- CKAN - by the Open Knowledge Foundation Demo
- Dataverse - by Harvard University Demo (see in particular this introduction
- Samavera - by a consortium of University Libraries and Cultural Heritage Institutions
- Fedora - by the DuraSpace consortium.
Evaluation criteria
In order to evaluate a repository we need clear criteria about what makes one more suitable to our users’ needs than any other.
Here is an example of a profile I created for the Qualitative Data Repository when our advisory board needed a simple explanation of why the technical team would invest a year of time and money into switching from one repository software platform (Fedora) to another (Dataverse).
In general we can evaluate a repository in two ways:
- Profiling the software that underlies a repository system: The underlying software for a repository can be thought of as the “repository architecture” (and is sometimes colloquially called the “repository stack”). An architecture typically consists of all the different components that are needed to realize the repository. An architecture typically includes an operating system, a server application, a database, a search engine, and a set of protocols that allow for data to be transferred between different components of the architecture. For example - the list below is the Dataverse architecture:
- Linux: RHEL/CentOS is highly recommended since all development and QA happens on this distribution.
- Glassfish: a Java EE application server to which the Dataverse application (war file) is to be deployed.
- PostgreSQL: a relational database.
- Solr: a search engine. A Dataverse-specific schema is provided.
- SMTP server: for sending mail for password resets and other notifications.
- Persistent identifier service: DOI and Handle support are provided. Production use requires a registered DOI or Handle.net authority.
- Creating user stories / use cases to understand what expectations our users might have about a repository. We have previous completed exercises on these more “user centered” techniques - so for now we will focus our time and attention on evaluating the underlying software of a repository.
In choosing a repository platform we should try to find the right balance between software (and our own skills) and user needs (or how well the software meets those needs).
Profile
Here is a template you can use when starting to evaluate a new repository platform:
Platform-Name
- Who are the producers of the repository?
- What is the license of the repository’s software?
- What are the hardware requirements of the repository?
- What are the related software requirements of the repository (in other words - what is its architecture)?
- What is the current release or version number that is available for download?
- How often are releases made available?
- How do repository developers report an error or bug found in a release?
- How many people are using the repository? What kinds of data do they curate?
- How is the software modified by those users? (e.g. are there plugins or extensible libraries?)
- What type of data is typically stored in the repository? Does the architecture afford any special features for particular kinds of data (e.g. geospatial data repository software might have a map plugin)
- How customizable is the metadata for the repository? Can new application profiles be easily created and implemented?
- Is the repository interoperable with other types of repositories? (e.g. Does the repository have an easily accessible API?)
Major Strengths:
Major weaknesses:
Overall Assessment:
Late Policy
- All assignments are due at 11:59 PM Pacific Time unless otherwise noted.
- If you cannot meet a deadline please get in touch with me 24 hours in advance of the due date to request an extension. - I have a "no questions asked" policy on extensions. I will not ask you to share the personal reasons why you need an extension, but if you would like to share with me for any reason please feel free. I am available listen.
- If you have not made arrangments with me, all late work will be automatically penalized 10% of the final grade of the assignment (e.g. If you would have received a 100% you will receive a 90%.) and will incur a 10% penalty for each additional day that the assignment is not complete.
- For those of you working in groups, please remember that your group members will be negatively impacted by any unexcused lateness on your part.