Posts Tagged ‘XSLT’

“The Magic” – restructuring the EAD to RDF XSLT transform

Wednesday, October 12th, 2011

By Pete Johnston

In a previous post I described how I had used an XSLT transform to generate RDF/XML from the EAD XML representation of the Mass Observation Archive catalogue exported from the CALM archival data management system. My approach was to take the XSLT I’d created within the LOCAH project to process the Archives Hub EAD data as a starting point, and to amend and extend it to processs the MOA data.

In that post, I suggested that there were some aspects of the transformation process which were more “general” and based on structural conventions that were common to, maybe not all, but a large subset of EAD documents, while others were more specific/”local” to the particular content of the MOA data, and that it might be possible/useful to try to separate out these different parts of the processing to make it easier to apply only the general/generic processing and to “swap in” different “local” processing as required for different input datasets.

While thinking about this, I broke things down further:

  • Processing based on generic EAD structures which are used consistently across all EAD documents
  • Processing based on EAD structures which are used consistently across some fairly broad category of EAD documents. I’m thinking here of something like the set of EAD documents which follow the Archives Hub data entry guidelines, or maybe the set of EAD documents generated by export from CALM systems (I say “maybe” here because I don’t have enough experience to know how uniform this process is, and how much variation is possible)
  • Processing where the technique might be generally applied, but “local” configuration or parameterisation is required. For example, the keyword lookup approach I described in my earlier post might be applied to a range of different inputs, but one might want to look up a different set of keywords for the catalogues of the archives of 19th century industrialists on the one hand and those of late twentieth century poets on the other – either simply for the sake of efficiency (e.g. there’s no point in searching for “Hitler” in the 19th century industrialists’ case) or because one wishes to map a single “keyword” to a different “real world entity” in each case.
  • Processing which is very specific to the structure or content of the input data. For example, for the MOA case, the transform included some processing based on specific EAD unitid content (e.g. “If unitid starts with “SxMOA1/2/”, then extract a “topic name” from unittitle“. If this processing was applied to a different set of inputs, it might have no effect (because the test is not satisfied by any unitid) or it might have an unintended effect (if the test is satisfied and the processing is applied to a unittitle not constructed in that way – rather unlikely given the specific nature of the test in this case but still possible)

The previous version of the MOA XSLT used a single transform. I’ve tried to restructure it slightly to reflect these distinctions (or at least the last three of the four). In this new version, there are now three XSLT transforms:

  1. ead2rdf.xslt
  2. lookup-ead2rdf.xsl
  3. moa-ead2rdf.xsl

The first of these (ead2rdf.xsl) is a slightly “stripped down” version of the XSLT from the LOCAH project, which removes processing specific to the Hub data (e.g. the use of particular conventions to mark up controlaccess terms), and can be run stand-alone. Given the nature of the EAD format, I hesitate to say it is generic to all EAD documents: really, its design was driven by the structures of the particular documents I’ve had at hand, and it’s probably still more in the second category in my list above, rather than being completely “generic”. So for example, it makes the assumption that the agency that maintains the finding aid is the same as the agency that provides access to the archive, a restriction which is not required by EAD itself. But it does exclude the name/keyword lookups and some processing which was specific to characteristics of the Archives Hub data and the MOA data.

The second transform (lookup-ead2rdf.xsl) imports the first, and includes the lookup processing. The URIs of the two “lookup tables” (simple XML documents: see http://data.lib.sussex.ac.uk/files/massobservation/xslt/authnames.xml and http://data.lib.sussex.ac.uk/files/massobservation/xslt/keywords.xml for examples) are provided as parameters, so can be any URI, and different lookup files for different inputs can be provided at run-time.

The third XSLT (moa-ead2rdf.xsl) imports the second, and includes the MOA-specific processing. So running moa-ead2rdf.xsl provides the generic processing + the name/keyword lookups + the MOA-specific processing.

And if someone has a different set of EAD inputs where they want to apply some quite different rules, then they can create anotherarchive-ead2rdf.xsl which imports either the first XSLT above (if they don’t want name/keyword lookups) or the second (if they do want name/keyword lookups, for which they can also specify their own “lookup tables”).

I should emphasise that I did this as a fairly quick exercise to try to illustrate that it was possible to “modularise” the processing to separate out the “local” and the “general”. As I’ve suggested above, the separation I’ve made isn’t perfect and the base transform is probably not as “generic” as it might be. There are almost certainly more “elegant” and efficient ways of achieving the separation in XSLT. Nevertheless I found it a useful process to go through and I think it reflects some of the challenges of working with a format like EAD which combines “document-like” and “data-like” characteristics and allows a high level of structural variation.