Posts Tagged ‘Technical’

“The Magic” – restructuring the EAD to RDF XSLT transform

Wednesday, October 12th, 2011

By Pete Johnston

In a previous post I described how I had used an XSLT transform to generate RDF/XML from the EAD XML representation of the Mass Observation Archive catalogue exported from the CALM archival data management system. My approach was to take the XSLT I’d created within the LOCAH project to process the Archives Hub EAD data as a starting point, and to amend and extend it to processs the MOA data.

In that post, I suggested that there were some aspects of the transformation process which were more “general” and based on structural conventions that were common to, maybe not all, but a large subset of EAD documents, while others were more specific/”local” to the particular content of the MOA data, and that it might be possible/useful to try to separate out these different parts of the processing to make it easier to apply only the general/generic processing and to “swap in” different “local” processing as required for different input datasets.

While thinking about this, I broke things down further:

  • Processing based on generic EAD structures which are used consistently across all EAD documents
  • Processing based on EAD structures which are used consistently across some fairly broad category of EAD documents. I’m thinking here of something like the set of EAD documents which follow the Archives Hub data entry guidelines, or maybe the set of EAD documents generated by export from CALM systems (I say “maybe” here because I don’t have enough experience to know how uniform this process is, and how much variation is possible)
  • Processing where the technique might be generally applied, but “local” configuration or parameterisation is required. For example, the keyword lookup approach I described in my earlier post might be applied to a range of different inputs, but one might want to look up a different set of keywords for the catalogues of the archives of 19th century industrialists on the one hand and those of late twentieth century poets on the other – either simply for the sake of efficiency (e.g. there’s no point in searching for “Hitler” in the 19th century industrialists’ case) or because one wishes to map a single “keyword” to a different “real world entity” in each case.
  • Processing which is very specific to the structure or content of the input data. For example, for the MOA case, the transform included some processing based on specific EAD unitid content (e.g. “If unitid starts with “SxMOA1/2/”, then extract a “topic name” from unittitle“. If this processing was applied to a different set of inputs, it might have no effect (because the test is not satisfied by any unitid) or it might have an unintended effect (if the test is satisfied and the processing is applied to a unittitle not constructed in that way – rather unlikely given the specific nature of the test in this case but still possible)

The previous version of the MOA XSLT used a single transform. I’ve tried to restructure it slightly to reflect these distinctions (or at least the last three of the four). In this new version, there are now three XSLT transforms:

  1. ead2rdf.xslt
  2. lookup-ead2rdf.xsl
  3. moa-ead2rdf.xsl

The first of these (ead2rdf.xsl) is a slightly “stripped down” version of the XSLT from the LOCAH project, which removes processing specific to the Hub data (e.g. the use of particular conventions to mark up controlaccess terms), and can be run stand-alone. Given the nature of the EAD format, I hesitate to say it is generic to all EAD documents: really, its design was driven by the structures of the particular documents I’ve had at hand, and it’s probably still more in the second category in my list above, rather than being completely “generic”. So for example, it makes the assumption that the agency that maintains the finding aid is the same as the agency that provides access to the archive, a restriction which is not required by EAD itself. But it does exclude the name/keyword lookups and some processing which was specific to characteristics of the Archives Hub data and the MOA data.

The second transform (lookup-ead2rdf.xsl) imports the first, and includes the lookup processing. The URIs of the two “lookup tables” (simple XML documents: see http://data.lib.sussex.ac.uk/files/massobservation/xslt/authnames.xml and http://data.lib.sussex.ac.uk/files/massobservation/xslt/keywords.xml for examples) are provided as parameters, so can be any URI, and different lookup files for different inputs can be provided at run-time.

The third XSLT (moa-ead2rdf.xsl) imports the second, and includes the MOA-specific processing. So running moa-ead2rdf.xsl provides the generic processing + the name/keyword lookups + the MOA-specific processing.

And if someone has a different set of EAD inputs where they want to apply some quite different rules, then they can create anotherarchive-ead2rdf.xsl which imports either the first XSLT above (if they don’t want name/keyword lookups) or the second (if they do want name/keyword lookups, for which they can also specify their own “lookup tables”).

I should emphasise that I did this as a fairly quick exercise to try to illustrate that it was possible to “modularise” the processing to separate out the “local” and the “general”. As I’ve suggested above, the separation I’ve made isn’t perfect and the base transform is probably not as “generic” as it might be. There are almost certainly more “elegant” and efficient ways of achieving the separation in XSLT. Nevertheless I found it a useful process to go through and I think it reflects some of the challenges of working with a format like EAD which combines “document-like” and “data-like” characteristics and allows a high level of structural variation.

Linking to LOCAH

Wednesday, October 12th, 2011

As readers of this blog will know, we followed closely in the footsteps of the LOCAH project and we are now linked to the Archives Hub dataset. Which is nice.

See http://data.lib.sussex.ac.uk/archive/doc/concept/moa/advertising as an example

Our other  external links are:

-DBpedia (for some places, people & organisations) e.g. http://data.lib.sussex.ac.uk/archive/id/organization/moa/communistpartyofgreatbritain

– Geonames (for some places) e.g. http://data.lib.sussex.ac.uk/archive/id/place/moa/blackpool

– LCSH (for some concepts) e.g. http://data.lib.sussex.ac.uk/archive/id/concept/moa/conscientiousobjectors

– VIAF (for some people) e.g. http://data.lib.sussex.ac.uk/archive/id/person/nra/churchillsirwinstonleonardspencer1874-1965knightprimeministerandhistorian

URIs. A decision

Tuesday, May 10th, 2011

The project team at Sussex (Jane Harvell, Fiona Courage, Chris Keene and myself) met for an hour yesterday to decide about the URI stem for our data.

We took 20 minutes. Did we make a hasty decision? No. Did we make a considered, long term, looking to the future sort of decision? Yes. The combined expertise round the table was very useful; Jane is very library and looks to the digital future, Fiona is very involved with the Keep and our identity when we are there, Chris knows about servers and how that bit works. We considered the comments from Rob Styles and the advice from Pete Johnston at Eduserv and we decided on;

data.lib.sussex.ac.uk/archive

We wanted something that could work with other archive collections (if we decide to make them into linked data) so a Mass Observation or Massobs stem was too exclusive. We also wanted to avoid creating lots and lots of URIs for the same thing in the future so a generic stem seemed the way to go.

Musings about URIs

Wednesday, April 20th, 2011

Choosing a base for our URIs. Easy right? The task was recently allocated to me. Should take all of 5 minutes and then I can sit back and sip my coffee at job well done. Simples.

Annoyingly, not quite yet.

First thing: The URIs will resolve to an actual web server. We’ve got loads of servers, hostnames and aliases (cnames) but which to use? We need a server and hostname that will be stable and permanent. In this rapidly changing world, changing services, and consolidation of servers (and a move towards that cloud stuff hosted services) what’s best to use?

Two potential base URI options:

  • data.lib.sussex.ac.uk
  • www.sussex.ac.uk/library/

The former was my immediate first choice, it fits in with the common naming practice ‘data.organisation.tld’ (admittedly with ‘lib’ in the middle, I don’t think we are ready to roll out an institutional data service just yet).

The latter was a consideration as it built on an already known and trusted URI on a institutionally embedded service: our University website (and corresponding infrastructure). Both URL and service are going to be around for the foreseeable future. What’s more they don’t require the Library to maintain any additional infrastructure. However, this didn’t fit in with the common convention used, might clash with other Library URLs. And there’s a risk: If the University moved to a new Content Management System it might break our URIs, especially if the CMS required full control of the ‘www.sussex.ac.uk’ namespace. Plus,  it just doesn’t look cool.

So currently thinking is http://data.lib.sussex.ac.uk/ – We can run it off a server here in the Library, which runs Apache and a number of other undemanding web services (wikis etc). This does require the Library to maintain it, which to be blunt, might be an issue if I leave. But there is nothing stopping us working with our IT Services and moving data.lib.sussex.ac.uk to a centrally run (or even third party hosted) server in the future.

Second issue…

Do we need a to create a ‘Mass Observation’ name space under http://data.lib.sussex.ac.uk/ e.g. http://data.lib.sussex.ac.uk/moa/ ?

In a nutshell, keeping it as http://data.lib.sussex.ac.uk/ keeps a simple URI, and allows us to merge in other datasets in to the same ‘pool’ (I don’t think pool is part of the Linked Data vocabulary but never mind).

However the risk is that should we wish to create more Linked Data sets in the future, whether for the Library Catalogue, the Institutional Repository or other Special Collections, how can we be sure the various identifiers, names and reference numbers will not clash between the different datasets? Will a Library Catalogue and Archive metadata be strange bedfellows?

I’ve been discussing this with Pete Johnston from Eduserv who has provided a lot of advice and things to consider. An example which came up in our discussions was:

http://data.lib.sussex.ac.uk/id/person/nra/churchillsirwinstonleonardspencer1874-1965knightprimeministerandhistorian

Would it not be desirable to have the above as a URI which could provide a description (of Winston Churchill) and links to both the MOA, other archives and mentions in the Library Catalogue, all from one URI ID?

My ignorance in this area is high, but my understanding is that these URIs will probably serve up information from elsewhere (i.e. our hopefully soon to be Talis Platform Store) and present it, will having one name space confuse things, as the URI will need fetch data from potentially various stores and sources to present to the requester (human, computer or otherwise).

Perhaps another option is to keep to one namespace, but separate it out in to collections further ‘down’ e.g.

http://data.lib.sussex.ac.uk/id/document/moa/1234

http://data.lib.sussex.ac.uk/id/document/anotherarchive/1234

Is that an option, or will it break conventions?

I should say at this point that we are using Designing URI Sets for the UK Public Sector as a guide for creating URIs and are trying to stick to their guidelines as much as possible.

However: I am torn between the grand, right(?), more technical nirvana of one name space. And the less risky approach of keeping the MOA data in it’s own name space silo (you can’t have a open data blog post without the s word).

The problem ultimately is that I am still so very new to this I find it hard to think about what the issues may be, or what the right and wrong approaches are.

So, I welcome your expertise, thoughts and insights. What would you recommend? What are we not thinking about that we should? What problems are we making for ourselves down the line? What is the right approach? And (building on that last question) what is the right approach considering our somewhat limited resources and time?

So ladies and gentlemen, you thoughts please? Please.