Posts Tagged ‘Linked Data’

Linking to LOCAH

Wednesday, October 12th, 2011

As readers of this blog will know, we followed closely in the footsteps of the LOCAH project and we are now linked to the Archives Hub dataset. Which is nice.

See http://data.lib.sussex.ac.uk/archive/doc/concept/moa/advertising as an example

Our other  external links are:

-DBpedia (for some places, people & organisations) e.g. http://data.lib.sussex.ac.uk/archive/id/organization/moa/communistpartyofgreatbritain

– Geonames (for some places) e.g. http://data.lib.sussex.ac.uk/archive/id/place/moa/blackpool

– LCSH (for some concepts) e.g. http://data.lib.sussex.ac.uk/archive/id/concept/moa/conscientiousobjectors

– VIAF (for some people) e.g. http://data.lib.sussex.ac.uk/archive/id/person/nra/churchillsirwinstonleonardspencer1874-1965knightprimeministerandhistorian

Converting EAD data to RDF Linked Data

Monday, July 25th, 2011

In my last blog post I discussed how to setup our server to handle the URIs being created within our Linked Data, and said the next step was for us to turn our EAD/XML data from Calm in to RDF/XML Linked Data.

This is a big step, until now our process looked something like this: Export EAD data -> send it to someone else -> Magic -> Linked Data!

Pete Johnston provided us with details of the magic part. In essence much of the complexity is hidden in an XSLT script (XSLT is a language to process XML in to different schemas, such as here, or in to HTML and other formats). He’s blogged about some of the decisions and concepts that have gone in to it. However, here, we can treat it like a black box. It’s still magic, but we know how to use it.

Converting EAD to XSLT using XSLT and Saxon

We use the Saxon HE XSLT (Java) version to the do transformation. It’s simple to download and setup. The basic core step is very simple: run Saxon, passing it the name of the EAD/XML file and the XSLT file. An example command line looks like this:

java -jar 'saxon9he.jar' -s:ead/ -xsl:xslt/ead2rdf.xsl -o:rdf/ root=http://data.lib.sussex.ac.uk/archive/

And there you have it, your EAD data is now RDF!

Before the data is loaded in to the Talis Platform store, there’s a couple more things we do.

Triples and Turtle

The first is the conversion of the RDF/XML in to the alternative RDF format N-Triples (and also Turtle) using the Raptor RDF parser.

RDF can be written and presented in a number of ways. Probably the most common method is using XML, partly due to the XML language being so ubiquitous, however it is very verbose and can be difficult to read by us humans.

Not only is N-Triples considered easier to read. but each line contains a fully complete and self-contained Triple (a Triple contains a subject, predicate and object, mostly expressed as URIs). While it isn’t too much of an issue here, this allows us to split up the data in to smaller chunks/files which can be POSTED to the Talis Platform.

Talis Platform

The Talis Platform is a well established Triple Store (think of a SQL database but with three part triples rather than records and tables). While you can run your own Triple Store using software such as ARC2, the Talis Platform provides a stable, robust and quick solution.

You interact with the Platform with standard HTTP Requests; GET, POST, DELETE etc. However for simplicity an interactive command prompt front end has been developed in Python called Pynappl. This allows you to simply specify the store you wish to work with, authenticate, and then use commands such as ‘store filename.rdf’ to upload data.

A simple script can upload our data to the Platform, uploading each n-triple file created above.

The final step is to try our the Sparql interface at:

http://api.talis.com/stores/massobservation/services/sparql

Here’s one to try:

SELECT * WHERE {
?a ?b <http://data.lib.sussex.ac.uk/archive/id/concept/moa/religion>
}

Summary

To take our EAD from Calm and turn it in to Linked Data we used a XSLT script written by Pete Johnston, used Saxon to transform the EAD/XML in to RDF/XML using the XSLT script. Then we converted the RDF/XML to RDF/N-Triples using Raptor. And finally we used Pynappl to upload this to the Talis Platform.

The XSLT scripts mentioned here can be found at:

http://data.lib.sussex.ac.uk/files/massobservation/xslt/

The RDF Linked Data is available for download, in addition to the SPARQL interface above:

http://data.lib.sussex.ac.uk/files/massobservation/rdf/

My Thanks to Pete Johnston of Eduserv for providing the process (with documentation) described above.

This page has been translated into Spanish by Maria Ramos from http://www.webhostinghub.com/support/edu

Following in our footsteps

Wednesday, July 6th, 2011

Question: If others wanted to take a similar approach to your project, what advice would you give them.

Our advice at the start would be:

1. Get your data ready. We are working on our catalogue data to make it more structured so that we can be ready to export to other formats and make it more portable. Regardeless of whether it becomes Linked Data in the future, we are getting ourselves ready. This is also probably the most time consuming aspect. From personal experience, once you start looking at your catalogue data, you’ll find lots of things that you want to change or are missing or don’t make sense so the work starts to grow…

2. Are you in a position to licence your data? We chose the catalogue data of the Mass Observation Archive as we were confident of its provenance so we could make it fully open and available under ODC-PDDL. This hopefully will allow the greatest flexibility for people wanting to use the data and fits with the ethos of the project and the JISC Discovery strand

3. Find out about other similar projects! We at SALDA realise the value of these blog posts to anyone wanting to do a similar project to SALDA. We followed in the footsteps of the LOCAH project and were able to use their stylesheet and experience in tranforming archival data into Linked Data. We are working with the Pete Johnston from Eduserv whose knowledge and experience is invaluable. You can see his contribution to the blog here

4. Find examples of Linked Data in use, in human readable format so that you can show stakeholders, colleagues, friends what it is that you are on about. I use the BBC wildlife pages and how they link to Animal Diversity Web

The data transformation

Monday, May 16th, 2011

by Pete Johnston

I’ve been working on a first attempt at processing the Encoded Archival Description (EAD) XML output provided by Karen from their CALM database in order to generate RDF data for the Mass Observation Archive. My starting point has been the work done within the LOCAH project, to which I’ve also been contributing, and which is also transforming EAD data into linked data.

I’m making use of the same general approach as that we’ve used within the LOCAH project, so as background to this post, it’s probably worth having a look at some of the relevant posts on the LOCAH blog and/or at the initial dataset they have just released.

The “workflow” for the SALDA/MOA case is similar to that described in the first part of this post, with an additional preliminary step of exporting data from the CALM database into the EAD XML format. And as I’ll explain further below, for the SALDA case, the “transform” step will also include a small element of what I was calling “enhancement” – the augmentation of the EAD content with some additional data.

We’re making use of (more or less – more on this also below) the same model of “things in the world” as that we’ve applied in the LOCAH project (see these three posts for details 1, 2, 3); the same patterns for URIs for identifying the individual “things” – within a University of Sussex URI-space, as Karen and Chris have discussed in recent posts here; and (more or less) the same RDF vocabularies for describing those “things”.

EAD and the LOCAH and SALDA EAD data

As I noted in the first of those posts over on the LOCAH blog the EAD format is, by design, a fairly “flexible” and “permissive” XML format. It was designed to accommodate the “encoding” of existing archival finding aids of various types and constructed by different cataloguing communities, some with practices and traditions which varied to a greater or lesser degree. EAD also allows for variation in the “level of detail” of markup that can be applied, from a focus on the identification of broad structural components to a more “fine-grained” identification of structures within the text of those components. As a result the structure of EAD XML documents can vary considerably from one instance to the next.

The LOCAH project is dealing with EAD data aggregated by the JISC Archives Hub service. This is data provided by multiple data providers, in some cases over an extended period of time, and sometimes using different data creation tools – and one of the challenges in LOCAH has been dealing with the variations across that body of data. SALDA, on the other hand, is dealing with data a single data source, under the control of a single data provider – the MOA data is actually exported from the CALM database in the form of a single EAD document, albeit quite a large one!

So while the LOCAH input data includes EAD documents using slightly different structural and content conventions, for SALDA, that structure is regular and predictable, and furthermore some element of “normalisation” of content is implemented through the rules and checks performed by the CALM database application.

So far, so good, then, in terms of making the MOA EAD data relatively straightforward to process.

Index Terms

The data creation guidelines for contributors to the Archives Hub recommend the provision of “index terms” or “access points” using the EAD controlaccess element – names of topics, persons, families, organisations, places, genres or functions, whose association with the archival resource is potentially useful for people searching the finding aid. Those names are (in principle, at least!) provided in a “standardised” form (i.e. either drawn from a specified “authority file” of names or constructed using a specified set of rules) so that two documents using the same authority file or the same rules should provide the same name in the same form. In the process of transforming EAD into RDF within the LOCAH project, the controlaccess element is a significant source of information about “things” associated with the archival resource. Below is a version of the graphical representation of the LOCAH model, taken from this post. Data about the entities circled in the lower part of the diagram is all derived from the LOCAH EAD controlaccess data.

In the MOA data, however, no controlaccess terms are provided. Talking this over with Karen and Chris recently, however, made it clear that there are some associations implicit in the MOA data, and there are some “hooks” in the data which can provide the basis for generating explicit associations in the RDF data. This is probably best illustrated through some concrete examples.

“Topic Collections”

One section of the Mass Observation Archive takes the form of a sequence of “Topic Collections”, in which documents of various types are grouped together by theme or subject, the name of which forms part of the title of a “series” within the section, i.e. the series have titles like:

  • TC1 Housing 1938-48
  • TC6 Conscientious Objection & Pacifism 1939-44
  • TC7 Happiness 1938

Although the titles are encoded in the EAD documents as unstructured text (as the content of the EAD unittitle element), the text has a consistent/predictable form of: code number, name of topic, date(s) of period of creation.

We can take advantage of this consistency in the transformation process and, with some fairly simple parsing of the text of the title, generate a description of a concept with its own URI and name/label (e.g. “Housing”, “Conscientious Objection & Pacifism” or “Happiness”), and a link between the archival resource and the concept. (For this case, the dates are provided explicitly elsewhere in the EAD document and already handled by the transformation process.)

Series by Place

Within one of the “Topic Collections” (on air raids), sets of reports are grouped by place, where the name of the place is used as the title of the “file”. So again, it is straightforward to generate a small chunk of data “about” the place with its own URI and name/label, and a link between the archival resource and the place.

In both this case and the “topic collections” case, we can also be quite specific about the nature of the relationship between the archival resource and the concept or place. In the LOCAH case, we’ve limited ourselves to making a very general “associated with” relationship between the archival resource and the controlaccess entity, on the grounds that the cataloguer may have made the association with the archival material based on many different “real world” relationships. For these cases in SALDA, we can be more specific, and say that the relationship is one of “aboutness”/has-as-topic, which can be expressed using the Dublin Core dcterms:subject property.

Directives by Date

Another section of the archive lists responses to “directives” (questionnaires) by date. In these cases the dates are not provided separately in the EAD data, but again the consistent form of the title makes it relatively straightforward to extract and present the dates explicitly in the RDF data.

Keywords

Each of the above examples exploits some implicit structure in text content within the EAD document. A second approach we’ve applied is to scan the content of some EAD elements for words or phrases that can be mapped to specific entities (concepts, persons, organisations, places). In making this mapping, we’re really taking advantage of the fact that for the SALDA case we have a fairly well-defined context or scope, defined by the scope of the archival collection itself. So within that context, we can be reasonably confident that an occurrence of the word “Churchill” is a reference to the war-time Prime Minister, rather than to another member of his family, or a Cambridge college, or an Oxfordshire town.

Because this process involves matching to a set of known concepts/places/persons/organisations, and because it’s a relatively short list, I’ve taken advantage of this to extend the “lookup table” to include some URIs from DBpedia, Geonames and the Library of Congress LCSH dataset, which I use to construct owl:sameAs or skos:closeMatch/skos:exactMatch links to external resources as part of the transformation process.

“Multi-level description” and “Inheritance”

One of the general issues these approaches bring me back to is the question of “multi-level description” in archival description, and which I discussed briefly in a post on the LOCAH blog. Traditionally archival description advocates a “hierarchical” approach to resource description: a conceptualisation of an archival collection as having a “tree” structure single, with a finding aid document providing information about an aggregation of records, then about component subsets of records within that aggregation, and so on, sometimes down to the level of individual records but often stopping at the level of some component aggregation.

This “document-centric” approach carries with it an expectation that the description of some “lower level” unit of archival material is presented and interpreted “in the context of” those other “higher level” descriptions of other material. And this is reflected in a principle of “non-repetition” in archival cataloguing:

At the highest appropriate level, give information that is common to the component parts. Do not repeat information at a lower level of description that has already been given at a higher level.

There is some suggestion here of information of lower-level resources implicitly “inheriting” “common” characteristics from their “parent” resources – unless they are “overriden” in the description of the “lower-level” resource.

In practice, however, this “inheritance” is more applicable to some attributes than others: it may work for, say, the name of the holding repository, but it is less clear that it applies to cases such as the controlaccess “index terms”: it may be appropriate/useful to associate the name of a person with a collection as a whole, but it doesn’t necessarily follow that the person has an association with every single item within that collection.

The “linked data” approach is predicated on delivering information in the form of “bounded descriptions” made up of assertions “about” individual subject resources. So in transforming EAD data into an RDF dataset to support this, we’re faced with the question of how to deal with this “implicitly inherited” information: whether to construct assertions of relationships only for the resource for which they are explicitly present in the EAD document, or whether also to construct additional assertions for other “descendent” resources too, on the basis that this is making explicit information that is implicit in the EAD document.

In the LOCAH work, we’ve tended to take a fairly “conservative” approach to the “inheritance” question and worked on the basis that, in the RDF data the concept, person, place, etc named by a controlaccess term is associated only with the archival resource with which the term is associated in the EAD document.

For the SALDA/MOA data, I think an argument can be made – at least for some of the cases discussed above – for making such links for the “descendent” component resources too. For the “topic collections”, for example, it is a defining characteristic of the collection that each of the member resources has the named concept as topic. And a similar case might be made for the “place-based” series.

For the keyword-matching cases, an assumption that the association can be generalised to all the “descendent” resources would, I think, be more problematic.

The “foaf:focus question”

In the Archives Hub data that LOCAH is using, the controlaccess terms are (mostly at least) drawn from “authority files”. This is reflected in the LOCAH data model in a distinction between the “conceptualisation” of a person, organisation or place that is captured in a thesaurus entry or authority file record, as separate from the actual physical entity. So for the person/organisation/family/place cases, in the LOCAH transformation process, the presence of an EAD controlaccess term results in the generation of two URIs and two triples, the first expressing a relationship (locah:associatedWith) between archival resource and concept, and the second between concept and entity (person, organisation, place). This second relationship is expressed using a (recently introduced) property from the Friend of a Friend (FOAF) RDF vocabulary, foaf:focus.

For a concrete example from the LOCAH dataset, consider the case of the Sir Joseph Dalton Hooker collection, which is identified by the URI http://data.archiveshub.ac.uk/id/archivalresource/gb15sirjosephdaltonhooker. The description of that “Archival Resource” shows that the collection is “associated with” four other resources, identified by the following URIs:

http://data.archiveshub.ac.uk/id/concept/lcsh/antarcticadiscoveryandexploration

http://data.archiveshub.ac.uk/id/concept/person/nra/hookerjosephdalton1817-1911sirknightbotanist

http://data.archiveshub.ac.uk/id/concept/organisation/ncarules/britishnavalexpeditionantarcticregions1839-1843

http://data.archiveshub.ac.uk/id/concept/unesco/botany

If we look in turn at the descriptions of those resources, we see that they are all concepts (i.e. instances of the class skos:Concept) – even the second and third cases. And in those two cases the concept is the subject of a “foaf:focus” relationship with a further resource, of type Person and Organisation, respectively:

http://data.archiveshub.ac.uk/id/person/nra/hookerjosephdalton1817-1911sirknightbotanist

http://data.archiveshub.ac.uk/id/organisation/ncarules/britishnavalexpeditionantarcticregions1839-1843

I’ve tried to depict this in the graph below. I’ve omitted the rdf:type arcs for conciseness, and relied on colour to indicate resource type (blue = Archival resource; white = Concept; green = Agent (Person or Organisation).

So, the question is how/whether this applies for the SALDA/MOA cases I describe above.

For the “topic collections” case, the link is simply to a concept (a member of a “MOA Topics” “Concept Scheme”), and there isn’t a separate physical entity involved.

For the “place series” case, in theory we could introduce a set of concepts but I’m not sure there is any value in doing so – there is no external thesaurus/authority file involved, and I think it’s reasonable to simply make the direct link between archival resource and place.

The keyword matching case actually covers various sub-cases, and I need to think harder about them, but broadly I think we should try to avoid the complexity of the “intermediate” concept where it isn’t really necessary.

Summary

In short, while I need to do some more work on it, it’s been relatively straightforward to apply the model and the transformation processes developed in LOCAH to the MOA data.

What is perhaps more interesting is how we’ve “specialised” the fairly “general” LOCAH approach, based on Karen’s “local knowledge” of specific characteristics of the MOA data.

While it’s perhaps premature to draw general conclusions from this single case, I do wonder whether that the nature of the EAD format and the ways it is used may mean that this combination of the general and the local/specific turns out be a common pattern e.g. for a different dataset, a different set of “local”/specific characteristics might be identified and exploited in a similar fashion. Amongst other things, I should probably think about how this is reflected in the transformation process, e.g. whether it is possible to “modularise” the XSLT transform in such a way that it the “general” parts are separated from the “specific” ones, and it is easier to “plug in” versions of the latter as required.

URIs. A decision

Tuesday, May 10th, 2011

The project team at Sussex (Jane Harvell, Fiona Courage, Chris Keene and myself) met for an hour yesterday to decide about the URI stem for our data.

We took 20 minutes. Did we make a hasty decision? No. Did we make a considered, long term, looking to the future sort of decision? Yes. The combined expertise round the table was very useful; Jane is very library and looks to the digital future, Fiona is very involved with the Keep and our identity when we are there, Chris knows about servers and how that bit works. We considered the comments from Rob Styles and the advice from Pete Johnston at Eduserv and we decided on;

data.lib.sussex.ac.uk/archive

We wanted something that could work with other archive collections (if we decide to make them into linked data) so a Mass Observation or Massobs stem was too exclusive. We also wanted to avoid creating lots and lots of URIs for the same thing in the future so a generic stem seemed the way to go.

Musings about URIs

Wednesday, April 20th, 2011

Choosing a base for our URIs. Easy right? The task was recently allocated to me. Should take all of 5 minutes and then I can sit back and sip my coffee at job well done. Simples.

Annoyingly, not quite yet.

First thing: The URIs will resolve to an actual web server. We’ve got loads of servers, hostnames and aliases (cnames) but which to use? We need a server and hostname that will be stable and permanent. In this rapidly changing world, changing services, and consolidation of servers (and a move towards that cloud stuff hosted services) what’s best to use?

Two potential base URI options:

  • data.lib.sussex.ac.uk
  • www.sussex.ac.uk/library/

The former was my immediate first choice, it fits in with the common naming practice ‘data.organisation.tld’ (admittedly with ‘lib’ in the middle, I don’t think we are ready to roll out an institutional data service just yet).

The latter was a consideration as it built on an already known and trusted URI on a institutionally embedded service: our University website (and corresponding infrastructure). Both URL and service are going to be around for the foreseeable future. What’s more they don’t require the Library to maintain any additional infrastructure. However, this didn’t fit in with the common convention used, might clash with other Library URLs. And there’s a risk: If the University moved to a new Content Management System it might break our URIs, especially if the CMS required full control of the ‘www.sussex.ac.uk’ namespace. Plus,  it just doesn’t look cool.

So currently thinking is http://data.lib.sussex.ac.uk/ – We can run it off a server here in the Library, which runs Apache and a number of other undemanding web services (wikis etc). This does require the Library to maintain it, which to be blunt, might be an issue if I leave. But there is nothing stopping us working with our IT Services and moving data.lib.sussex.ac.uk to a centrally run (or even third party hosted) server in the future.

Second issue…

Do we need a to create a ‘Mass Observation’ name space under http://data.lib.sussex.ac.uk/ e.g. http://data.lib.sussex.ac.uk/moa/ ?

In a nutshell, keeping it as http://data.lib.sussex.ac.uk/ keeps a simple URI, and allows us to merge in other datasets in to the same ‘pool’ (I don’t think pool is part of the Linked Data vocabulary but never mind).

However the risk is that should we wish to create more Linked Data sets in the future, whether for the Library Catalogue, the Institutional Repository or other Special Collections, how can we be sure the various identifiers, names and reference numbers will not clash between the different datasets? Will a Library Catalogue and Archive metadata be strange bedfellows?

I’ve been discussing this with Pete Johnston from Eduserv who has provided a lot of advice and things to consider. An example which came up in our discussions was:

http://data.lib.sussex.ac.uk/id/person/nra/churchillsirwinstonleonardspencer1874-1965knightprimeministerandhistorian

Would it not be desirable to have the above as a URI which could provide a description (of Winston Churchill) and links to both the MOA, other archives and mentions in the Library Catalogue, all from one URI ID?

My ignorance in this area is high, but my understanding is that these URIs will probably serve up information from elsewhere (i.e. our hopefully soon to be Talis Platform Store) and present it, will having one name space confuse things, as the URI will need fetch data from potentially various stores and sources to present to the requester (human, computer or otherwise).

Perhaps another option is to keep to one namespace, but separate it out in to collections further ‘down’ e.g.

http://data.lib.sussex.ac.uk/id/document/moa/1234

http://data.lib.sussex.ac.uk/id/document/anotherarchive/1234

Is that an option, or will it break conventions?

I should say at this point that we are using Designing URI Sets for the UK Public Sector as a guide for creating URIs and are trying to stick to their guidelines as much as possible.

However: I am torn between the grand, right(?), more technical nirvana of one name space. And the less risky approach of keeping the MOA data in it’s own name space silo (you can’t have a open data blog post without the s word).

The problem ultimately is that I am still so very new to this I find it hard to think about what the issues may be, or what the right and wrong approaches are.

So, I welcome your expertise, thoughts and insights. What would you recommend? What are we not thinking about that we should? What problems are we making for ourselves down the line? What is the right approach? And (building on that last question) what is the right approach considering our somewhat limited resources and time?

So ladies and gentlemen, you thoughts please? Please.

The Mass Observation Archive

Monday, March 28th, 2011

We thought it would be useful to say a little bit here about what the Mass Observation Archive is to provide some context to the SALDA Project (and archivists love context).

The Mass Observation Archive specialises in material about everyday life in Britain. It contains papers generated by the original Mass Observation social research organisation (1937 to early 1950s), and newer material collected continuously since 1981. The Archive is in the care of the University of Sussex and is a charitable trust.   We are working on the catalogue data from the early phase which encompasses the Second World War.

Mass Observation started in 1937 as a reaction to the abdication of Edward VIII. There is a history of Mass Observation available on the Mass Observation website.  It is important to stress that it is the catalogue data we are making available, not the documents themselves.

This is an example of a Mass Observation Archive catalogue record using the web interface to our archival management system (CALM) which is called CALMView.  A less blurry version is available here

You can see that the hierarchy is represented above the item description, so “Observations made in the Locarno, Streatham between December 1938 and April 1939”  is in File : Clothes in Dance Halls 1938-40, Subseries: Observations and interviews 1938-40, Series: TC18 Personal Appearence and Clothes, Section: Topic Collections, Collection: Mass Observation Archive.

Out and about in Birmingham and London

Monday, March 7th, 2011

RDTF

On Tuesday 1 March, myelf and Chris Keene, Technical Development Manager for the Library and SALDA project partner attended the start up meeting for the projects running as part of  the JISC Resource Discovery Taskforce, Infrastructure for Resource Discovery (RDTF)

RDTF is due to get a new name soon. I hope they keep the Task Force bit, possibly because it makes me think of G-force from Battle of the planets

The day was very useful for finding out about the other projects running alongside SALDA. Andy McGregor, JISC  Project Manager, impressed upon us the wider implications of the projects, what lessons are can be learnt and the importance of these blogs for disemminating our findings to encourage sharing and collaboration. This is new territory we are in and we are not alone. Common issues of licensing, standards and what vocabularies to use for linked data were discussed during the day, along with an issue that I feel strongly about; what is the added value of linked data for the end user? I hope to answer this question in a future post.

UKAD

The next day I went along to the first UK Archive Discovery Network (UKAD) Forum at TNA.  A really great day, well put together with three plenary  sessions and then 5 groups of 3 parrell sessions running for 30 minutes each with a space outside the rooms for demonstrations and networking. This lent a really buzzy atmosphere to the day, not least because we were discussing the online future of archives and archive data. I really appreciated the opportunity at the start of the day to stand up and introduce ourselves and say who we would be interested in talking to. This got rid of alot of slidling up to people during the lunch breaka nd starring at their badges. Using this forward approach it was a good day for SALDA as I met with Pete Johnston from Eduserve and LOCAH who will be working on the project with us transforming our EAD into RDFs and spoke to Adrian Stevenson and Jane Stevenson at the LOCAH project, whose templates SALDA will using to create linked data.

It was nice to see a University of Sussex reading list screen shot making its way into a presentation on Linked Data by Richard Wallis from Talis, hopefully we will have two blobs on the linked data cloud soon! The semantic web relies on collections of reliable open data to work so that links can be made and it is exciting to think the our Mass Observation Archive catalogue data could be one of these datasets.

A super report of the day by Bethan Ruddock at the Archives hub is available on their blog here.

SALDA Project – Welcome!

Monday, February 21st, 2011

Welcome to the SALDA project blog. We are very excited to begin this six month JISC funded project to make the records of the Mass Observation Archive at the University of Sussex available as open Linked Data and establish a methodology that can be used to open up our other archives.

Description and Objectives
The SALDA project proposes to establish a methodology that can be used by the University of Sussex to open up metadata using the Linked Data approach. We will use knowledge and expertise already generated on similar projects  to convert existing EAD currently available on our internal Archival Management System (CALM) into Linked Data that will be enhanced and made available via XML.

The University of Sussex Special Collections comprise over 100 archival collections translating to 65,000 ISAD(G)-compliant records available on the CALM system. We are concentrating on the largest archival collection held within the Library, the Mass Observation Archive, potentially creating up to 23,000 Linked Data records.

The Key steps are:

1. Export the data from our CALM Archival Management System
2. Transform the data in to Linked Data (Eduserv)
3. Further enhance, complement and refine the data (UKOLN)
4. Publish the data as open Linked Data, as XML and we plan to upload it to the Talis Platform.

The Special Collections Department will move to a new historical resource centre known as The Keep in 2013. The Keep will bring together the collections of East Sussex Records Office, Brighton & Hove City and the University Special Collections under one roof. All three institutions currently use separate databases to record their collections meaning that sharing data about collections is problematic and a solution needs to be found. This project will provide invaluable experience in exporting and reusing our metadata and explore the potential of using this approach to open up the information on holdings between institutions and greatly enhance resource discovery.

We will use the same methodology on the SALDA Project to that which is currently being used on the LOCAH Project, which is taking data from the ArchivesHub and making it available as structured Linked Data. We plan to use the Open Data Commons PDDL licence to ensure the data is open and can be used by others. We will document our experiences on this blog and release any code or templates to help others implement a similar approach.

Alongside making these records available, we propose to work with experts experienced in similar Linked Data projects to draw up a methodology that would then be rolled out to enhance the remaining collections held by the University once the project had been completed. This process would become part of our ongoing cataloguing activities ensuring the objective of the project is sustainable.

Our project plan is available on our about page