A controversial title, but it neatly summarizes the feeling in the SALDA Project camp at the moment. Myself and Fiona Courage (Special Collections Manager and Curator of the Mass Observation Archive) are coming from an archive / library background and are very focused on the end result of the project which should be a set of Linked Open Data comprising our Mass Observation Archive catalogue records. But we’re not really sure what the benefit of this is, so I wanted to ask the RDTF community and beyond for help. We know it is early days for the project and we’re likely to look back at this blog post in a few months time and answer our own question, but right now we seem to have hit a bit of a brick wall.
What we are realising is that, unlike a lot of projects with archives that I have done in the past, it is perhaps not the end result that is most important but the journey to get there. What we hope to achieve with SALDA is skills and knowledge to make our catalogues Linked Data and use those skills and that knowledge to inform decisions about whether it would be beneficial for make all our data Linked Data. Our journey so far has taught us that we need to refine and review our catalogue data within our archival management system anyway and the changes we are making now in preparation are opening up the data to enable better search results already. When the Mass Observation Archive was catalogued in the 1970s and 1980s abbrievations seem to have been all the rage, so Communist Party became CP and Conscientious Objectors became CO to give a couple of examples. This was fine in a printed finding aid under the heading of Communist Party, and also fine on a HTML page that users scrolled through. It is not fine for keyword searching. This may seem very basic stuff, but I’m sure there are lots of archives out there that have records that make sense to the archivist and in the printed list, but will not be retrieved via a search engine. If resource discovery is our aim, then making our information clear and accessible is key. Following on from this basic idea of “finding what you search for”, is Linked Data a step on from this? Finding what you search for and a bit more?
I think there are several reasons ‘why linked data’, and I suspect that if you asked someone else then you might get a different set of answers – although we might expect some overlaps.
For me that absolute fundamental is the use of http URIs to identify things – which means you are able to contribute (and use) identifiers to a global pool. This means the ability to link to and from your data to other peoples.
To take a simple example I guess that most items in the Mass Observation archive have some geographical information attached (e.g. where the person who made the observation lived?) (apologies if my ignorance of the archive means I’m making a false assumption here)
I don’t know how places are identified in the metadata, but it seems a fair guess that ultimately it is a textual description such as ‘Brighton’. Within the context of your data this makes sense and is likely to be unique – but what if I want to combine your data with that of another archive and find things associated with Brighton?
I may have to rely on using the text string ‘Brighton’ – and I may find that there is more than one place called Brighton (e.g. Brighton Vermont), and of course there may be lots of variations in textual strings (“Brighton, UK”; “Brighton, England”), and if the same place is known by different names (Cologne, Köln) it becomes increasingly problematic. These issues mean that it may be difficult to be sure that two completely data sets are talking about the same place.
If instead all the relevant data sets chose to use an identifier for the place instead of just a textual description (although of course you’d have a textual description attached to the identifer), then you could ensure you were referring to the same place. If you chose an identifier that was going to be unique in a global pool of identifiers (such as an http URI), then you would know that it would always mean the same place, no matter what the context. Finally, if you used the same http URI as other people then merging data across disparate data sets becomes easy.
With places identifiers already exist in the form of ‘Geonames’ URIs. So we can use http://www.geonames.org/2654710 for Brighton and be sure we are talking about the town in Sussex, England without having to check further. Not only that, but the Geonames data gives us extra information – like longitude and latitude – so we could plot contributions to the Mass Observation archive on a map. Then, because another linked data resource ‘dbPedia’ also uses the Geonames URI, we can extract even more information from here (http://dbpedia.org/page/Brighton)
To see a very practical example of how this can really enhance the resources we use have a look at http://lucero-project.info/lb/2011/03/connecting-the-reading-experience-database-to-the-web-of-data/
You may also be interested in a couple of blog posts I wrote, and especially some of the comments:
http://www.meanboyfriend.com/overdue_ideas/2010/03/linked-data/
http://www.meanboyfriend.com/overdue_ideas/2010/04/whats-so-hard-about-linked-data/
There are many other reasons to use the Linked Data approach – it’s just that fo me the use of these shared identifiers is absolutely fundamental and the most powerful aspect of the approach
Following up on my comment this blog post from Leigh Dodds this morning captures what I mean better than I say it!
http://blog.kasabi.com/2011/03/23/context-remains-king-why-linking-is-the-next-big-thing/
I especially like the part where Leigh says:
“As a data provider, no matter how much energy you put into curation to make your data more comprehensive there will always be some additional external data, some additional context, that can add value. That value may be incremental to the majority of users, but it will be important to someone.”
Many thanks for this Owen, lots of good information and examples which helps us get it clear in our heads. I also like the comment by Leigh Dodds.
Hi Karen,
The record that you exemplify in the next post http://blogs.sussex.ac.uk/salda/2011/03/28/the-mass-observation-archive/ suggests that although there is location information within the records, it may not be in a particularly structured format (i.e. it looks like location is only included in textual descriptions, not as a separate metadata field).
Because of this although the example I give above still applies in many ways, it doesn’t feel like an ‘easy win’ as you don’t have the data easily to hand – you’d have to find some way of extracting the location data from the records first. This would be very valuable I think (just thinking of applications for researchers in social history etc.) but may not be the easiest thing to do.
What does jump out at me is the date information – browsing some records it looks like sometimes this is in it’s own metadata field, and even when it isn’t it appears in a standard place within things like the textual titles (the end – e.g. ‘Teenage clothes 1949’)
Expressing these dates in a linked data context would be very interesting – and there is some groundwork already laid by data.gov.uk as outlined in this post from Jeni Tennison http://www.jenitennison.com/blog/node/136 (see section on Dates, Times and Periods. This opens the possibility of linking both DBpedia events and Government data (although not sure what is available from the MOA period) via years – it feels like a great opportunity to draw in contextual data to the material in the MOA
Hi Owen,
Re: the last comment on things like dates, place names, “topic” names appearing in predictable forms within text fields, yes, that’s exactly the sort of “cue” I’m using in transforming the data.
Karen has given me various pointers for cases where there is (what I think of as) some “implicit structure” in the text fields, and the transform process will use that information to make those distinctions explicit in the RDF data generated, and to represent those “things” as resources in their own right.