{"id":206,"date":"2011-10-12T08:20:37","date_gmt":"2011-10-12T08:20:37","guid":{"rendered":"http:\/\/blogs.sussex.ac.uk\/salda\/?p=206"},"modified":"2011-10-12T08:20:37","modified_gmt":"2011-10-12T08:20:37","slug":"the-magic-restructuring-the-ead-to-rdf-xslt-transform","status":"publish","type":"post","link":"https:\/\/blogs.sussex.ac.uk\/salda\/2011\/10\/12\/the-magic-restructuring-the-ead-to-rdf-xslt-transform\/","title":{"rendered":"&#8220;The Magic&#8221; &#8211; restructuring the EAD to RDF XSLT transform"},"content":{"rendered":"<p><strong>By Pete Johnston<\/strong><\/p>\n<p>In <a href=\"..\/..\/..\/..\/..\/2011\/05\/16\/the-data-transformation\/\">a  previous post<\/a> I described how I had used an XSLT transform to generate  RDF\/XML from the EAD XML representation of the Mass Observation Archive  catalogue exported from the CALM archival data management system. My approach  was to take the XSLT I&#8217;d created within the <a href=\"http:\/\/blogs.ukoln.ac.uk\/locah\/\">LOCAH project<\/a> to process the Archives  Hub EAD data as a starting point, and to amend and extend it to processs the MOA  data.<\/p>\n<p>In that post, I suggested that there were some aspects of the transformation  process which were more &#8220;general&#8221; and based on structural conventions that were  common to, maybe not all, but a large subset of EAD documents, while others were  more specific\/&#8221;local&#8221; to the particular content of the MOA data, and that it  might be possible\/useful to try to separate out these different parts of the  processing to make it easier to apply only the general\/generic processing and to  &#8220;swap in&#8221; different &#8220;local&#8221; processing as required for different input  datasets.<\/p>\n<p>While thinking about this, I broke things down further:<\/p>\n<ul>\n<li>Processing based on generic EAD structures which are used consistently  across all EAD documents<\/li>\n<li>Processing based on EAD structures which are used consistently across some  fairly broad category of EAD documents. I&#8217;m thinking here of something like the  set of EAD documents which follow the Archives Hub data entry guidelines, or  maybe the set of EAD documents generated by export from CALM systems (I say  &#8220;maybe&#8221; here because I don&#8217;t have enough experience to know how uniform this  process is, and how much variation is possible)<\/li>\n<li>Processing where the technique might be generally applied, but &#8220;local&#8221;  configuration or parameterisation is required. For example, the keyword lookup  approach I described in my earlier post might be applied to a range of different  inputs, but one might want to look up a different set of keywords for the  catalogues of the archives of 19th century industrialists on the one hand and  those of late twentieth century poets on the other &#8211; either simply for the sake  of efficiency (e.g. there&#8217;s no point in searching for &#8220;Hitler&#8221; in the 19th  century industrialists&#8217; case) or because one wishes to map a single &#8220;keyword&#8221; to  a different &#8220;real world entity&#8221; in each case.<\/li>\n<li>Processing which is very specific to the structure or content of the input  data. For example, for the MOA case, the transform included some processing  based on specific EAD <tt>unitid<\/tt> content (e.g. &#8220;If <tt>unitid<\/tt> starts  with &#8220;SxMOA1\/2\/&#8221;, then extract a &#8220;topic name&#8221; from <tt>unittitle<\/tt>&#8220;. If this  processing was applied to a different set of inputs, it might have no effect  (because the test is not satisfied by any <tt>unitid<\/tt>) or it might have an  unintended effect (if the test is satisfied and the processing is applied to a  <tt>unittitle<\/tt> not constructed in that way &#8211; rather unlikely given the  specific nature of the test in this case but still possible)<\/li>\n<\/ul>\n<p>The previous version of the MOA XSLT used a single transform. I&#8217;ve tried to  restructure it slightly to reflect these distinctions (or at least the last  three of the four). In this new version, there are now three XSLT  transforms:<\/p>\n<ol>\n<li><a href=\"http:\/\/code.google.com\/p\/salda\/source\/browse\/trunk\/xslt\/ead2rdf.xsl\" target=\"_blank\">ead2rdf.xslt<\/a><\/li>\n<li><a href=\"http:\/\/code.google.com\/p\/salda\/source\/browse\/trunk\/xslt\/lookup-ead2rdf.xsl\" target=\"_blank\">lookup-ead2rdf.xsl<\/a><\/li>\n<li><a href=\"http:\/\/code.google.com\/p\/salda\/source\/browse\/trunk\/xslt\/moa-ead2rdf.xsl\">moa-ead2rdf.xsl<\/a><\/li>\n<\/ol>\n<p>The first of these (ead2rdf.xsl) is a slightly &#8220;stripped down&#8221; version of the  XSLT from the LOCAH project, which removes processing specific to the Hub data  (e.g. the use of particular conventions to mark up <tt>controlaccess<\/tt> terms), and can be run stand-alone. Given the nature of the EAD format, I  hesitate to say it is generic to all EAD documents: really, its design was  driven by the structures of the particular documents I&#8217;ve had at hand, and it&#8217;s  probably still more in the second category in my list above, rather than being  completely &#8220;generic&#8221;. So for example, it makes the assumption that the agency  that maintains the finding aid is the same as the agency that provides access to  the archive, a restriction which is not required by EAD itself. But it does  exclude the name\/keyword lookups and some processing which was specific to  characteristics of the Archives Hub data and the MOA data.<\/p>\n<p>The second transform (lookup-ead2rdf.xsl) imports the first, and includes the  lookup processing. The URIs of the two &#8220;lookup tables&#8221; (simple XML documents:  see<a href=\"http:\/\/data.lib.sussex.ac.uk\/files\/massobservation\/xslt\/authnames.xml\" target=\"_self\"> http:\/\/data.lib.sussex.ac.uk\/files\/massobservation\/xslt\/authnames.xml<\/a> and <a href=\"http:\/\/data.lib.sussex.ac.uk\/files\/massobservation\/xslt\/keywords.xml\" target=\"_blank\">http:\/\/data.lib.sussex.ac.uk\/files\/massobservation\/xslt\/keywords.xml<\/a> for examples) are provided as parameters, so can be any URI, and different  lookup files for different inputs can be provided at run-time.<\/p>\n<p>The third XSLT (moa-ead2rdf.xsl) imports the second, and includes the  MOA-specific processing. So running moa-ead2rdf.xsl provides the generic  processing + the name\/keyword lookups + the MOA-specific processing.<\/p>\n<p>And if someone has a different set of EAD inputs where they want to apply  some quite different rules, then they can create anotherarchive-ead2rdf.xsl  which imports either the first XSLT above (if they don\u2019t want name\/keyword  lookups) or the second (if they do want name\/keyword lookups, for which they can  also specify their own &#8220;lookup tables&#8221;).<\/p>\n<p>I should emphasise that I did this as a fairly quick exercise to try to  illustrate that it was possible to &#8220;modularise&#8221; the processing to separate out  the &#8220;local&#8221; and the &#8220;general&#8221;. As I&#8217;ve suggested above, the separation I&#8217;ve made  isn&#8217;t perfect and the base transform is probably not as &#8220;generic&#8221; as it might  be. There are almost certainly more &#8220;elegant&#8221; and efficient ways of achieving  the separation in XSLT. Nevertheless I found it a useful process to go through  and I think it reflects some of the challenges of working with a format like EAD  which combines &#8220;document-like&#8221; and &#8220;data-like&#8221; characteristics and allows a high  level of structural variation.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>By Pete Johnston In a previous post I described how I had used an XSLT transform to generate RDF\/XML from the EAD XML representation of the Mass Observation Archive catalogue exported from the CALM archival data management system. My approach was to take the XSLT I&#8217;d created within the LOCAH project to process the Archives [&hellip;]<\/p>\n","protected":false},"author":23,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[126],"tags":[108,107,132,4139,157,158],"_links":{"self":[{"href":"https:\/\/blogs.sussex.ac.uk\/salda\/wp-json\/wp\/v2\/posts\/206"}],"collection":[{"href":"https:\/\/blogs.sussex.ac.uk\/salda\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.sussex.ac.uk\/salda\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.sussex.ac.uk\/salda\/wp-json\/wp\/v2\/users\/23"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.sussex.ac.uk\/salda\/wp-json\/wp\/v2\/comments?post=206"}],"version-history":[{"count":9,"href":"https:\/\/blogs.sussex.ac.uk\/salda\/wp-json\/wp\/v2\/posts\/206\/revisions"}],"predecessor-version":[{"id":222,"href":"https:\/\/blogs.sussex.ac.uk\/salda\/wp-json\/wp\/v2\/posts\/206\/revisions\/222"}],"wp:attachment":[{"href":"https:\/\/blogs.sussex.ac.uk\/salda\/wp-json\/wp\/v2\/media?parent=206"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.sussex.ac.uk\/salda\/wp-json\/wp\/v2\/categories?post=206"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.sussex.ac.uk\/salda\/wp-json\/wp\/v2\/tags?post=206"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}