“The Magic” – restructuring the EAD to RDF XSLT transform

October 12th, 2011

By Pete Johnston

In a previous post I described how I had used an XSLT transform to generate RDF/XML from the EAD XML representation of the Mass Observation Archive catalogue exported from the CALM archival data management system. My approach was to take the XSLT I’d created within the LOCAH project to process the Archives Hub EAD data as a starting point, and to amend and extend it to processs the MOA data.

In that post, I suggested that there were some aspects of the transformation process which were more “general” and based on structural conventions that were common to, maybe not all, but a large subset of EAD documents, while others were more specific/”local” to the particular content of the MOA data, and that it might be possible/useful to try to separate out these different parts of the processing to make it easier to apply only the general/generic processing and to “swap in” different “local” processing as required for different input datasets.

While thinking about this, I broke things down further:

  • Processing based on generic EAD structures which are used consistently across all EAD documents
  • Processing based on EAD structures which are used consistently across some fairly broad category of EAD documents. I’m thinking here of something like the set of EAD documents which follow the Archives Hub data entry guidelines, or maybe the set of EAD documents generated by export from CALM systems (I say “maybe” here because I don’t have enough experience to know how uniform this process is, and how much variation is possible)
  • Processing where the technique might be generally applied, but “local” configuration or parameterisation is required. For example, the keyword lookup approach I described in my earlier post might be applied to a range of different inputs, but one might want to look up a different set of keywords for the catalogues of the archives of 19th century industrialists on the one hand and those of late twentieth century poets on the other – either simply for the sake of efficiency (e.g. there’s no point in searching for “Hitler” in the 19th century industrialists’ case) or because one wishes to map a single “keyword” to a different “real world entity” in each case.
  • Processing which is very specific to the structure or content of the input data. For example, for the MOA case, the transform included some processing based on specific EAD unitid content (e.g. “If unitid starts with “SxMOA1/2/”, then extract a “topic name” from unittitle“. If this processing was applied to a different set of inputs, it might have no effect (because the test is not satisfied by any unitid) or it might have an unintended effect (if the test is satisfied and the processing is applied to a unittitle not constructed in that way – rather unlikely given the specific nature of the test in this case but still possible)

The previous version of the MOA XSLT used a single transform. I’ve tried to restructure it slightly to reflect these distinctions (or at least the last three of the four). In this new version, there are now three XSLT transforms:

  1. ead2rdf.xslt
  2. lookup-ead2rdf.xsl
  3. moa-ead2rdf.xsl

The first of these (ead2rdf.xsl) is a slightly “stripped down” version of the XSLT from the LOCAH project, which removes processing specific to the Hub data (e.g. the use of particular conventions to mark up controlaccess terms), and can be run stand-alone. Given the nature of the EAD format, I hesitate to say it is generic to all EAD documents: really, its design was driven by the structures of the particular documents I’ve had at hand, and it’s probably still more in the second category in my list above, rather than being completely “generic”. So for example, it makes the assumption that the agency that maintains the finding aid is the same as the agency that provides access to the archive, a restriction which is not required by EAD itself. But it does exclude the name/keyword lookups and some processing which was specific to characteristics of the Archives Hub data and the MOA data.

The second transform (lookup-ead2rdf.xsl) imports the first, and includes the lookup processing. The URIs of the two “lookup tables” (simple XML documents: see http://data.lib.sussex.ac.uk/files/massobservation/xslt/authnames.xml and http://data.lib.sussex.ac.uk/files/massobservation/xslt/keywords.xml for examples) are provided as parameters, so can be any URI, and different lookup files for different inputs can be provided at run-time.

The third XSLT (moa-ead2rdf.xsl) imports the second, and includes the MOA-specific processing. So running moa-ead2rdf.xsl provides the generic processing + the name/keyword lookups + the MOA-specific processing.

And if someone has a different set of EAD inputs where they want to apply some quite different rules, then they can create anotherarchive-ead2rdf.xsl which imports either the first XSLT above (if they don’t want name/keyword lookups) or the second (if they do want name/keyword lookups, for which they can also specify their own “lookup tables”).

I should emphasise that I did this as a fairly quick exercise to try to illustrate that it was possible to “modularise” the processing to separate out the “local” and the “general”. As I’ve suggested above, the separation I’ve made isn’t perfect and the base transform is probably not as “generic” as it might be. There are almost certainly more “elegant” and efficient ways of achieving the separation in XSLT. Nevertheless I found it a useful process to go through and I think it reflects some of the challenges of working with a format like EAD which combines “document-like” and “data-like” characteristics and allows a high level of structural variation.

Linking to LOCAH

October 12th, 2011

As readers of this blog will know, we followed closely in the footsteps of the LOCAH project and we are now linked to the Archives Hub dataset. Which is nice.

See http://data.lib.sussex.ac.uk/archive/doc/concept/moa/advertising as an example

Our other  external links are:

-DBpedia (for some places, people & organisations) e.g. http://data.lib.sussex.ac.uk/archive/id/organization/moa/communistpartyofgreatbritain

– Geonames (for some places) e.g. http://data.lib.sussex.ac.uk/archive/id/place/moa/blackpool

– LCSH (for some concepts) e.g. http://data.lib.sussex.ac.uk/archive/id/concept/moa/conscientiousobjectors

– VIAF (for some people) e.g. http://data.lib.sussex.ac.uk/archive/id/person/nra/churchillsirwinstonleonardspencer1874-1965knightprimeministerandhistorian

Final blog post

July 25th, 2011

This is our final blog post for the JISC RDTF (now Discovery) SALDA project on the completion of the six month project. I’m sure there will be more related blog posts here in the coming months.

Things we have produced

The SALDA Project has produced the following:

The catalogue data of the Mass Observation Archive is now available on the Talis Platform licensed under ODC-PDDL.

Simple text search http://api.talis.com/stores/massobservation/items

Sparql interface at: http://api.talis.com/stores/massobservation/services/sparql

The SALDA XSLT stylesheet is here licensed under modified BSD licence

Download the data in RDF

Chris Keene has created pages for open data at the University of Sussex Library:

http://data.lib.sussex.ac.uk/

The direct link to the SALDA produced data  from the Mass Observation Archive is here:

http://data.lib.sussex.ac.uk/data/mass-observation/

Some human readable examples of the data:

http://data.lib.sussex.ac.uk/archive/doc/person/nra/harrissonthomas1911-1976anthropologist

http://data.lib.sussex.ac.uk/archive/id/archivalresource/gb181SxMOA1

The data references terms from (amongst others) the following RDF vocabularies (thanks to Pete Johnston at Eduserv):

http://purl.org/dc/terms/
http://xmlns.com/foaf/0.1/
http://www.w3.org/2004/02/skos/core#
http://www.openarchives.org/ore/terms/
http://linkedevents.org/ontology/
http://data.archiveshub.ac.uk/def/

Pete has also produced browse pages for concepts, people and places which offer other ways into the data and are great for showing the data. This is in addition to our core deliverables and are not live yet.

In-house cataloguing guidelines

An unexpected result of the SALDA project was a review of our cataloguing procedures and the following guides were produced by myself and a colleague Adam Harwood who is currently cataloguing the University of Sussex Collection.

CALM_ISADG_Collection level This document maps the required ISAD G fields to the CALM fields with guidelines on how to populate the fields. We have also included the fields required for export to EAD using the Archive Hub report on CALM.

cataloguing procedures component level This document provides guidelines for completing componant level records in CALM.

Next steps

Now the data is on the platform, we will advertise it at open data days. We are working on a leaflet which invites anyone to work with our data and see what they can do.

We are working with our partners at the Keep on the IT infrastructure for the new development. The SALDA project opened dialogue on Linked Data and has provided a useful skills and knowledge set of another route to take in order to share data between the partners.

At Sussex, we are going to look at our collections and make a prority list of ones where the catalogue data could be turned into Linked Data by considering:

  • If we can make the data available under ODC-PDDL
  • what changes/ additions we need to make to the data and it’s structure
  • What the potiential uses/ benefits are

A a personal goal, I would like to work with archivists and developers to find common ground about Linked Data, about the understanding, the uses and the benefits. And what words we use to describe it and finding examples of it in use because Linked Data is very behind the scenes so can be hard to “sell” without an example of its use in human readable format. I also attended a brilliant “legal update for information professionals” workshop led by Niaomi Korn and Professor Charles Oppenheim which really got me interested in risk management which relates to the licensing part of the project.

Evidence of reuse

We have registered the dataset on CKAN and hope to be part of the current UK discovery competition

Skills

This has been a steep learning curve for me as project manager to get my head around the world of Linked Data.  All praise to Pete Johnston who is able to write in a way that I understand, yet still convey the level of technical detail that is required.

Pete has provided the expertise on the project, working with scripts devised for the Locah project and adapting them for SALDA. He has been working with Chris to move the data to the platform and the scripts used to our data.lib.sussex.ac.uk URI. You can read more about this in Chris’s blog post

We are grateful to all the team at the Locah project for forging the path ahead and allowing us to follow in their footsteps.

Chris Keene has created webpages for open data at the University of Sussex Library to keep open data on the agenda. Openness is reflecting the the strategic goals of the Library e-strategy: Search and discovery 2011-2015

We’ve all learnt more about archival metadata and EAD during the project.

Most significant lessons

Now then, these might be a bit basic and from my own experience.  I’m sure my technical colleagues could add to them though the lessons we have learnt and the processes we have been through in technical areas are well documented on this blog.

  • At the beginning, no one (archive colleagues, library colleagues, friends, family) will know what you are talking about when you mention Linked Data. When you show an example or try and explain it they will look blank. You need to work out a way of explaining and demonstrating it that can be understood.
  • Keep in regular contact with technical consultants if they are not part of the in-house team. We had a face to face meeting, phone calls and regular (weekly) email contact.
  • Think long term about the sustainability and future uses of the data even if it’s only a six month project. We thought long and hard about our URI stem to make it as generic and sustainable as possible and try and re-use URIs rather than making lots of new ones.

Converting EAD data to RDF Linked Data

July 25th, 2011

In my last blog post I discussed how to setup our server to handle the URIs being created within our Linked Data, and said the next step was for us to turn our EAD/XML data from Calm in to RDF/XML Linked Data.

This is a big step, until now our process looked something like this: Export EAD data -> send it to someone else -> Magic -> Linked Data!

Pete Johnston provided us with details of the magic part. In essence much of the complexity is hidden in an XSLT script (XSLT is a language to process XML in to different schemas, such as here, or in to HTML and other formats). He’s blogged about some of the decisions and concepts that have gone in to it. However, here, we can treat it like a black box. It’s still magic, but we know how to use it.

Converting EAD to XSLT using XSLT and Saxon

We use the Saxon HE XSLT (Java) version to the do transformation. It’s simple to download and setup. The basic core step is very simple: run Saxon, passing it the name of the EAD/XML file and the XSLT file. An example command line looks like this:

java -jar 'saxon9he.jar' -s:ead/ -xsl:xslt/ead2rdf.xsl -o:rdf/ root=http://data.lib.sussex.ac.uk/archive/

And there you have it, your EAD data is now RDF!

Before the data is loaded in to the Talis Platform store, there’s a couple more things we do.

Triples and Turtle

The first is the conversion of the RDF/XML in to the alternative RDF format N-Triples (and also Turtle) using the Raptor RDF parser.

RDF can be written and presented in a number of ways. Probably the most common method is using XML, partly due to the XML language being so ubiquitous, however it is very verbose and can be difficult to read by us humans.

Not only is N-Triples considered easier to read. but each line contains a fully complete and self-contained Triple (a Triple contains a subject, predicate and object, mostly expressed as URIs). While it isn’t too much of an issue here, this allows us to split up the data in to smaller chunks/files which can be POSTED to the Talis Platform.

Talis Platform

The Talis Platform is a well established Triple Store (think of a SQL database but with three part triples rather than records and tables). While you can run your own Triple Store using software such as ARC2, the Talis Platform provides a stable, robust and quick solution.

You interact with the Platform with standard HTTP Requests; GET, POST, DELETE etc. However for simplicity an interactive command prompt front end has been developed in Python called Pynappl. This allows you to simply specify the store you wish to work with, authenticate, and then use commands such as ‘store filename.rdf’ to upload data.

A simple script can upload our data to the Platform, uploading each n-triple file created above.

The final step is to try our the Sparql interface at:

http://api.talis.com/stores/massobservation/services/sparql

Here’s one to try:

SELECT * WHERE {
?a ?b <http://data.lib.sussex.ac.uk/archive/id/concept/moa/religion>
}

Summary

To take our EAD from Calm and turn it in to Linked Data we used a XSLT script written by Pete Johnston, used Saxon to transform the EAD/XML in to RDF/XML using the XSLT script. Then we converted the RDF/XML to RDF/N-Triples using Raptor. And finally we used Pynappl to upload this to the Talis Platform.

The XSLT scripts mentioned here can be found at:

http://data.lib.sussex.ac.uk/files/massobservation/xslt/

The RDF Linked Data is available for download, in addition to the SPARQL interface above:

http://data.lib.sussex.ac.uk/files/massobservation/rdf/

My Thanks to Pete Johnston of Eduserv for providing the process (with documentation) described above.

This page has been translated into Spanish by Maria Ramos from http://www.webhostinghub.com/support/edu

Cost/benefits of the open data approach

July 18th, 2011

We have been asked to assess how much it has cost us in terms of time and resources to make our data openly available, so here goes.

Our approach to the project was to have a dedicated project manager (me) working 0.5 FTE, using the skills of Pete Johnston for the transformation to Linked Data and the skills of Chris Keene (Technical development Manager for the Library) when required. This meant we were all dedicated to our tasks and  that someone was on top of the administration part of the project, as well as researching the licence and talking/presenting to groups and stakeholders whilst the technical transformation was taking place. This was a good use of time and resources and provided a bridge between the two sides.

We made a decision early on that we did not have time within the project allocation to re-structure the MOA data prior to tranformation as we would like  but we did work through 75% of it expanding name and organisation abbreviations to allow ways into the data. If we have re-structured the data within the CALM database putting dates in the date field, separating out title and description, this would have added at least another month to the project. It prehapes would have meant that there would have been less tweaking to the stylesheet that Pete made for the Locah project, but all worked out in the end as we approached it from a different angle, using lookup lists of keywords and people (See earlier blog posts here and here)

Benefits

The benefits of open data are harder to quantify. We are excited by the potential uses of our data ourside of the archive searchroom and one of the reasons we have used the ODC-PDDL  is so that we can be as open as possible and see what happens. The success of this project also means that open data is on the agenda in the Library (see Chris’s blog post).

Benefits for the Keep : cataloguing guidelines

I have reported back to stakeholders from the Keep as we need to look into how we can share our data and provide resource discovery of all our collections for visitors to the Keep. Having had a close look at our catalogue data for the project we are able to provide recommendations that will hopefully make it easier to export, share and transfer our data to existing or new systems. We have created some in-house cataloguing guidelines and the following guides were produced by myself and a colleague Adam Harwood who is currently cataloguing the University of Sussex Collection.

  • CALM_ISADG_Collection level This document maps the required ISAD G fields to the CALM fields with guidelines on how to populate the fields. We have also included the fields required for export to EAD using the Archive Hub report on CALM.

Our priority in this area to to concentrate on our existing collection level descriptions and any new catalogue componant records that we create. We will share these guidelines with colleagues from the Keep in the next few months.

Setting up our URIs and the Talis Platform

July 13th, 2011

Time to set up our URIs and upload our data to one of our Talis Platform stores.

In a previous post we discussed which URIs to use. We settled on http://data.lib.sussex.ac.uk/archive/ – we felt this should be stable, and allow for integration with other Special Collection records in the future (while not conflicting with other Library data).

We now needed those URIs to do something, at the moment they all just returned a 404 message (albeit a 404 message with a Rick  roll link).

As so often the case in this project this is where Pete Johnston came in. He had already set up the required code on his test server, and similar things had been put in place for the LOCAH project.

In total, all that is required is a few php/html files and a .htaccess file to handle rewrites (i.e. taking a URI and calling the script in question with the righthand bit of the URI as a parameter). The main script is an index.php file which on our server lives at www/data/archive/doc (which corresponds to http://data.lib.sussex.ac/archive/doc/).

Along with these files were a few dependencies, PHP libraries: paget, moriarty and ARC.

However this code needs to access data from somewhere, and to do this we need to put our data in to our new shiny Talis Platform store…

Talis Platform

The second part of this work was to upload our data to the Talis Platform. Talis had kindly created to stores for us: massobservation and massobservation-dev1, as part of their Connect Commons scheme.

Pete ran a set of scripts he had developed to upload our data to the dev1 store. We’re currently installing these on our own server so we can do this ourselves, and we’ll report more on them soon.

So that was that, without much fuss, we now had our data in our publicly available, Sparql query-able, RDF store. There probably should have been champagne.

Back to our server

So with our data now in a RDF store, libraries installed on server, files copied and in place, config edited to point to our store, it was time to point a browser at one our URIs and start debugging the first error message (which once resolved will lead to the next error message, and so forth). But… for the first time in my life, it just worked. This never happens. It left me confused, I had set aside hours of my diary for endless frustrations and here it was working. I felt cheated. But, once over the shock, I (and you too) could visit examples such as this: http://data.lib.sussex.ac.uk/archive/doc/archivalresource/gb181SxMOA1 (RDF/XML, JSON, Turtle)

Look! It’s our data… as Linked Data… live on the internet!

data.lib.sussex.ac.uk

I would like to see data.lib.sussex.ac.uk become more than just the Mass Observation Archive, and with that in mind I created a front end for the top level URL: http://data.lib.sussex.ac.uk/

This uses WordPress as the CMS (life’s to short to code the html/css files by hand).

For those interested, the .htaccess mod_rewrite looks like this:

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteRule ^archive/id/(.*)$ /archive/doc/$1 [R=303,L]
RewriteBase /
RewriteRule ^index\.php$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]
</IfModule>

The rule for the URIs is at the top (simply redirecting archive/id/* to archive/doc/*) if this rule is ‘matched’ then processing ends and the rest of the rules are ingnored ( [L] ), otherwise process the standard WordPress rules.

Next steps… (for this strand of work)

Install scripts on our server so that we can:

  • take a file of EAD date from Calm and transform in to a file of RDF/XML
  • convert this to a set of N-Triples files (which are easier to upload to the Platform store as each statement/triple (or if you prefer, fieldname and field value from a record) is complete and able to standalone, so the data can be uploaded in stages without complications.
  • Upload the files to the store.

Following in our footsteps

July 6th, 2011

Question: If others wanted to take a similar approach to your project, what advice would you give them.

Our advice at the start would be:

1. Get your data ready. We are working on our catalogue data to make it more structured so that we can be ready to export to other formats and make it more portable. Regardeless of whether it becomes Linked Data in the future, we are getting ourselves ready. This is also probably the most time consuming aspect. From personal experience, once you start looking at your catalogue data, you’ll find lots of things that you want to change or are missing or don’t make sense so the work starts to grow…

2. Are you in a position to licence your data? We chose the catalogue data of the Mass Observation Archive as we were confident of its provenance so we could make it fully open and available under ODC-PDDL. This hopefully will allow the greatest flexibility for people wanting to use the data and fits with the ethos of the project and the JISC Discovery strand

3. Find out about other similar projects! We at SALDA realise the value of these blog posts to anyone wanting to do a similar project to SALDA. We followed in the footsteps of the LOCAH project and were able to use their stylesheet and experience in tranforming archival data into Linked Data. We are working with the Pete Johnston from Eduserv whose knowledge and experience is invaluable. You can see his contribution to the blog here

4. Find examples of Linked Data in use, in human readable format so that you can show stakeholders, colleagues, friends what it is that you are on about. I use the BBC wildlife pages and how they link to Animal Diversity Web

Report from stakeholder meeting

June 15th, 2011

On 31st May, we held a meeting for stakeholders and other interested parties, to talk about the SALDA project and its impact on future developments. Attending the meeting were:

Karen Watson SALDA Project Manager, University of Sussex
Fiona Courage Special Collections Manager, University of Sussex
Jane Harvell Head of Academic Services, University of Sussex
Chris Keene Technical Development Manager, University of Sussex
Richard Fisher Business Analyst – ICT Services, East Sussex County Council
Elizabeth Hughes County Archivist, East Sussex Record Office
Jenny Hand Knowledge and information Manager, Royal Pavilion and Museums

The Keep is only two years away so naturally the focus is on finding systems that support all the partners and enable researchers to search and use the collections. Transforming all our data into Linked Data would enable cross searching and enhanced resource discovery, but there is an issue of time.  However, there was a general feeling in the room that the data preparation that we have identified that will allow mapping /easier transference  will make us ready for other solutions or to embark on a linked data exercise in the future.

Data potential

It was good to discuss the different ways that our collective data could be used outside of the reading room. There is potential for mobile apps, hack days and contributions to other projects – Brighton and Hove Museums already contribute to Culture Grid. SALDA has enabled us to see this potential with the added benefit that any enhancements and changes that we make to our catalogue data will improve resource discovery for researchers within our exisiting systems.

I’ve said before that alot of what we are talking about in terms of changes to our data is quite basic stuff (date in the date field etc) and a colleague pointed out that libraries had these conversations years ago. It is the diverse and complicated nature of archives that means that a one size fits all approach is difficult to achieve, so we are looking at the minimum we need to do, with our strategic goal being that we want to be ready to export / map / transfer our data and make it as portable as possible.

Local data for local people

June 6th, 2011

This  is to add to Pete’s  post on the transformation of our data. The SALDA project is really searching for a framework or a set of tools to enable us to transform our other archive collections into Linked Data. What we have discovered so far is that there is a model that we can apply to the data, based on the LOCAH model,  but there is some local tweaking that needs to be done due to the structure of our data.

Prior to 2009, our catalogue data was in HTML lists on our website or printed lists in our reading room. We imported the catalogues into our CALM database in summer 2009 and most information went into the title field. This meant that when we then exported the data to EAD there was no separate fields for date or description. I then revealed to Pete with my head in my hands, that we don’t use access points and this is what the LOCAH process was based around. Pete was optimistic in his outlook saying that there were good points about our data to focus on.

  • it was consistent in that it was all from one data provider
  • it was consistent in the format of the date and where it appeared in the data (albeit not in the date field)

We decided then to think about other ways into the data. I provided Pete with 28 names out of the data in authorised form using National Register of Archives rules . I was able to confirm that these were definitely those people, so when it says “Churchill” in the data, it is:

Churchill, Sir Winston Leonard Spencer (1874-1965)

Knight, prime minister and historian

not churchill insurance, churchill college etc.

I also provided 100 or so keywords that appeared in the data and covered subjects from air raids to sex including places and organisations (Labour Party, Communist Party) events  (the Coronation in 1953) and wider concepts like class, family, education and death.

Future proofing our data

Realising the limitations of our data as it stands in our archival management system has made the team at Sussex really look at how we catalogue things.  We need to future proof our data so that we can export or transform our data or map across to other systems easier. We are compiling cataloguing guidelines to ensure that all our collection level records are ISAD (G) compatible and that certain fields are always populated in our componant records. This is not a small change and it will take a long time to modify 67,000 existing records. This has been an unexpected by-product of the SALDA project and one that we can’t ignore.

The data transformation

May 16th, 2011

by Pete Johnston

I’ve been working on a first attempt at processing the Encoded Archival Description (EAD) XML output provided by Karen from their CALM database in order to generate RDF data for the Mass Observation Archive. My starting point has been the work done within the LOCAH project, to which I’ve also been contributing, and which is also transforming EAD data into linked data.

I’m making use of the same general approach as that we’ve used within the LOCAH project, so as background to this post, it’s probably worth having a look at some of the relevant posts on the LOCAH blog and/or at the initial dataset they have just released.

The “workflow” for the SALDA/MOA case is similar to that described in the first part of this post, with an additional preliminary step of exporting data from the CALM database into the EAD XML format. And as I’ll explain further below, for the SALDA case, the “transform” step will also include a small element of what I was calling “enhancement” – the augmentation of the EAD content with some additional data.

We’re making use of (more or less – more on this also below) the same model of “things in the world” as that we’ve applied in the LOCAH project (see these three posts for details 1, 2, 3); the same patterns for URIs for identifying the individual “things” – within a University of Sussex URI-space, as Karen and Chris have discussed in recent posts here; and (more or less) the same RDF vocabularies for describing those “things”.

EAD and the LOCAH and SALDA EAD data

As I noted in the first of those posts over on the LOCAH blog the EAD format is, by design, a fairly “flexible” and “permissive” XML format. It was designed to accommodate the “encoding” of existing archival finding aids of various types and constructed by different cataloguing communities, some with practices and traditions which varied to a greater or lesser degree. EAD also allows for variation in the “level of detail” of markup that can be applied, from a focus on the identification of broad structural components to a more “fine-grained” identification of structures within the text of those components. As a result the structure of EAD XML documents can vary considerably from one instance to the next.

The LOCAH project is dealing with EAD data aggregated by the JISC Archives Hub service. This is data provided by multiple data providers, in some cases over an extended period of time, and sometimes using different data creation tools – and one of the challenges in LOCAH has been dealing with the variations across that body of data. SALDA, on the other hand, is dealing with data a single data source, under the control of a single data provider – the MOA data is actually exported from the CALM database in the form of a single EAD document, albeit quite a large one!

So while the LOCAH input data includes EAD documents using slightly different structural and content conventions, for SALDA, that structure is regular and predictable, and furthermore some element of “normalisation” of content is implemented through the rules and checks performed by the CALM database application.

So far, so good, then, in terms of making the MOA EAD data relatively straightforward to process.

Index Terms

The data creation guidelines for contributors to the Archives Hub recommend the provision of “index terms” or “access points” using the EAD controlaccess element – names of topics, persons, families, organisations, places, genres or functions, whose association with the archival resource is potentially useful for people searching the finding aid. Those names are (in principle, at least!) provided in a “standardised” form (i.e. either drawn from a specified “authority file” of names or constructed using a specified set of rules) so that two documents using the same authority file or the same rules should provide the same name in the same form. In the process of transforming EAD into RDF within the LOCAH project, the controlaccess element is a significant source of information about “things” associated with the archival resource. Below is a version of the graphical representation of the LOCAH model, taken from this post. Data about the entities circled in the lower part of the diagram is all derived from the LOCAH EAD controlaccess data.

In the MOA data, however, no controlaccess terms are provided. Talking this over with Karen and Chris recently, however, made it clear that there are some associations implicit in the MOA data, and there are some “hooks” in the data which can provide the basis for generating explicit associations in the RDF data. This is probably best illustrated through some concrete examples.

“Topic Collections”

One section of the Mass Observation Archive takes the form of a sequence of “Topic Collections”, in which documents of various types are grouped together by theme or subject, the name of which forms part of the title of a “series” within the section, i.e. the series have titles like:

  • TC1 Housing 1938-48
  • TC6 Conscientious Objection & Pacifism 1939-44
  • TC7 Happiness 1938

Although the titles are encoded in the EAD documents as unstructured text (as the content of the EAD unittitle element), the text has a consistent/predictable form of: code number, name of topic, date(s) of period of creation.

We can take advantage of this consistency in the transformation process and, with some fairly simple parsing of the text of the title, generate a description of a concept with its own URI and name/label (e.g. “Housing”, “Conscientious Objection & Pacifism” or “Happiness”), and a link between the archival resource and the concept. (For this case, the dates are provided explicitly elsewhere in the EAD document and already handled by the transformation process.)

Series by Place

Within one of the “Topic Collections” (on air raids), sets of reports are grouped by place, where the name of the place is used as the title of the “file”. So again, it is straightforward to generate a small chunk of data “about” the place with its own URI and name/label, and a link between the archival resource and the place.

In both this case and the “topic collections” case, we can also be quite specific about the nature of the relationship between the archival resource and the concept or place. In the LOCAH case, we’ve limited ourselves to making a very general “associated with” relationship between the archival resource and the controlaccess entity, on the grounds that the cataloguer may have made the association with the archival material based on many different “real world” relationships. For these cases in SALDA, we can be more specific, and say that the relationship is one of “aboutness”/has-as-topic, which can be expressed using the Dublin Core dcterms:subject property.

Directives by Date

Another section of the archive lists responses to “directives” (questionnaires) by date. In these cases the dates are not provided separately in the EAD data, but again the consistent form of the title makes it relatively straightforward to extract and present the dates explicitly in the RDF data.

Keywords

Each of the above examples exploits some implicit structure in text content within the EAD document. A second approach we’ve applied is to scan the content of some EAD elements for words or phrases that can be mapped to specific entities (concepts, persons, organisations, places). In making this mapping, we’re really taking advantage of the fact that for the SALDA case we have a fairly well-defined context or scope, defined by the scope of the archival collection itself. So within that context, we can be reasonably confident that an occurrence of the word “Churchill” is a reference to the war-time Prime Minister, rather than to another member of his family, or a Cambridge college, or an Oxfordshire town.

Because this process involves matching to a set of known concepts/places/persons/organisations, and because it’s a relatively short list, I’ve taken advantage of this to extend the “lookup table” to include some URIs from DBpedia, Geonames and the Library of Congress LCSH dataset, which I use to construct owl:sameAs or skos:closeMatch/skos:exactMatch links to external resources as part of the transformation process.

“Multi-level description” and “Inheritance”

One of the general issues these approaches bring me back to is the question of “multi-level description” in archival description, and which I discussed briefly in a post on the LOCAH blog. Traditionally archival description advocates a “hierarchical” approach to resource description: a conceptualisation of an archival collection as having a “tree” structure single, with a finding aid document providing information about an aggregation of records, then about component subsets of records within that aggregation, and so on, sometimes down to the level of individual records but often stopping at the level of some component aggregation.

This “document-centric” approach carries with it an expectation that the description of some “lower level” unit of archival material is presented and interpreted “in the context of” those other “higher level” descriptions of other material. And this is reflected in a principle of “non-repetition” in archival cataloguing:

At the highest appropriate level, give information that is common to the component parts. Do not repeat information at a lower level of description that has already been given at a higher level.

There is some suggestion here of information of lower-level resources implicitly “inheriting” “common” characteristics from their “parent” resources – unless they are “overriden” in the description of the “lower-level” resource.

In practice, however, this “inheritance” is more applicable to some attributes than others: it may work for, say, the name of the holding repository, but it is less clear that it applies to cases such as the controlaccess “index terms”: it may be appropriate/useful to associate the name of a person with a collection as a whole, but it doesn’t necessarily follow that the person has an association with every single item within that collection.

The “linked data” approach is predicated on delivering information in the form of “bounded descriptions” made up of assertions “about” individual subject resources. So in transforming EAD data into an RDF dataset to support this, we’re faced with the question of how to deal with this “implicitly inherited” information: whether to construct assertions of relationships only for the resource for which they are explicitly present in the EAD document, or whether also to construct additional assertions for other “descendent” resources too, on the basis that this is making explicit information that is implicit in the EAD document.

In the LOCAH work, we’ve tended to take a fairly “conservative” approach to the “inheritance” question and worked on the basis that, in the RDF data the concept, person, place, etc named by a controlaccess term is associated only with the archival resource with which the term is associated in the EAD document.

For the SALDA/MOA data, I think an argument can be made – at least for some of the cases discussed above – for making such links for the “descendent” component resources too. For the “topic collections”, for example, it is a defining characteristic of the collection that each of the member resources has the named concept as topic. And a similar case might be made for the “place-based” series.

For the keyword-matching cases, an assumption that the association can be generalised to all the “descendent” resources would, I think, be more problematic.

The “foaf:focus question”

In the Archives Hub data that LOCAH is using, the controlaccess terms are (mostly at least) drawn from “authority files”. This is reflected in the LOCAH data model in a distinction between the “conceptualisation” of a person, organisation or place that is captured in a thesaurus entry or authority file record, as separate from the actual physical entity. So for the person/organisation/family/place cases, in the LOCAH transformation process, the presence of an EAD controlaccess term results in the generation of two URIs and two triples, the first expressing a relationship (locah:associatedWith) between archival resource and concept, and the second between concept and entity (person, organisation, place). This second relationship is expressed using a (recently introduced) property from the Friend of a Friend (FOAF) RDF vocabulary, foaf:focus.

For a concrete example from the LOCAH dataset, consider the case of the Sir Joseph Dalton Hooker collection, which is identified by the URI http://data.archiveshub.ac.uk/id/archivalresource/gb15sirjosephdaltonhooker. The description of that “Archival Resource” shows that the collection is “associated with” four other resources, identified by the following URIs:

http://data.archiveshub.ac.uk/id/concept/lcsh/antarcticadiscoveryandexploration

http://data.archiveshub.ac.uk/id/concept/person/nra/hookerjosephdalton1817-1911sirknightbotanist

http://data.archiveshub.ac.uk/id/concept/organisation/ncarules/britishnavalexpeditionantarcticregions1839-1843

http://data.archiveshub.ac.uk/id/concept/unesco/botany

If we look in turn at the descriptions of those resources, we see that they are all concepts (i.e. instances of the class skos:Concept) – even the second and third cases. And in those two cases the concept is the subject of a “foaf:focus” relationship with a further resource, of type Person and Organisation, respectively:

http://data.archiveshub.ac.uk/id/person/nra/hookerjosephdalton1817-1911sirknightbotanist

http://data.archiveshub.ac.uk/id/organisation/ncarules/britishnavalexpeditionantarcticregions1839-1843

I’ve tried to depict this in the graph below. I’ve omitted the rdf:type arcs for conciseness, and relied on colour to indicate resource type (blue = Archival resource; white = Concept; green = Agent (Person or Organisation).

So, the question is how/whether this applies for the SALDA/MOA cases I describe above.

For the “topic collections” case, the link is simply to a concept (a member of a “MOA Topics” “Concept Scheme”), and there isn’t a separate physical entity involved.

For the “place series” case, in theory we could introduce a set of concepts but I’m not sure there is any value in doing so – there is no external thesaurus/authority file involved, and I think it’s reasonable to simply make the direct link between archival resource and place.

The keyword matching case actually covers various sub-cases, and I need to think harder about them, but broadly I think we should try to avoid the complexity of the “intermediate” concept where it isn’t really necessary.

Summary

In short, while I need to do some more work on it, it’s been relatively straightforward to apply the model and the transformation processes developed in LOCAH to the MOA data.

What is perhaps more interesting is how we’ve “specialised” the fairly “general” LOCAH approach, based on Karen’s “local knowledge” of specific characteristics of the MOA data.

While it’s perhaps premature to draw general conclusions from this single case, I do wonder whether that the nature of the EAD format and the ways it is used may mean that this combination of the general and the local/specific turns out be a common pattern e.g. for a different dataset, a different set of “local”/specific characteristics might be identified and exploited in a similar fashion. Amongst other things, I should probably think about how this is reflected in the transformation process, e.g. whether it is possible to “modularise” the XSLT transform in such a way that it the “general” parts are separated from the “specific” ones, and it is easier to “plug in” versions of the latter as required.