URIs. A decision

May 10th, 2011

The project team at Sussex (Jane Harvell, Fiona Courage, Chris Keene and myself) met for an hour yesterday to decide about the URI stem for our data.

We took 20 minutes. Did we make a hasty decision? No. Did we make a considered, long term, looking to the future sort of decision? Yes. The combined expertise round the table was very useful; Jane is very library and looks to the digital future, Fiona is very involved with the Keep and our identity when we are there, Chris knows about servers and how that bit works. We considered the comments from Rob Styles and the advice from Pete Johnston at Eduserv and we decided on;

data.lib.sussex.ac.uk/archive

We wanted something that could work with other archive collections (if we decide to make them into linked data) so a Mass Observation or Massobs stem was too exclusive. We also wanted to avoid creating lots and lots of URIs for the same thing in the future so a generic stem seemed the way to go.

Musings about URIs

April 20th, 2011

Choosing a base for our URIs. Easy right? The task was recently allocated to me. Should take all of 5 minutes and then I can sit back and sip my coffee at job well done. Simples.

Annoyingly, not quite yet.

First thing: The URIs will resolve to an actual web server. We’ve got loads of servers, hostnames and aliases (cnames) but which to use? We need a server and hostname that will be stable and permanent. In this rapidly changing world, changing services, and consolidation of servers (and a move towards that cloud stuff hosted services) what’s best to use?

Two potential base URI options:

  • data.lib.sussex.ac.uk
  • www.sussex.ac.uk/library/

The former was my immediate first choice, it fits in with the common naming practice ‘data.organisation.tld’ (admittedly with ‘lib’ in the middle, I don’t think we are ready to roll out an institutional data service just yet).

The latter was a consideration as it built on an already known and trusted URI on a institutionally embedded service: our University website (and corresponding infrastructure). Both URL and service are going to be around for the foreseeable future. What’s more they don’t require the Library to maintain any additional infrastructure. However, this didn’t fit in with the common convention used, might clash with other Library URLs. And there’s a risk: If the University moved to a new Content Management System it might break our URIs, especially if the CMS required full control of the ‘www.sussex.ac.uk’ namespace. Plus,  it just doesn’t look cool.

So currently thinking is http://data.lib.sussex.ac.uk/ – We can run it off a server here in the Library, which runs Apache and a number of other undemanding web services (wikis etc). This does require the Library to maintain it, which to be blunt, might be an issue if I leave. But there is nothing stopping us working with our IT Services and moving data.lib.sussex.ac.uk to a centrally run (or even third party hosted) server in the future.

Second issue…

Do we need a to create a ‘Mass Observation’ name space under http://data.lib.sussex.ac.uk/ e.g. http://data.lib.sussex.ac.uk/moa/ ?

In a nutshell, keeping it as http://data.lib.sussex.ac.uk/ keeps a simple URI, and allows us to merge in other datasets in to the same ‘pool’ (I don’t think pool is part of the Linked Data vocabulary but never mind).

However the risk is that should we wish to create more Linked Data sets in the future, whether for the Library Catalogue, the Institutional Repository or other Special Collections, how can we be sure the various identifiers, names and reference numbers will not clash between the different datasets? Will a Library Catalogue and Archive metadata be strange bedfellows?

I’ve been discussing this with Pete Johnston from Eduserv who has provided a lot of advice and things to consider. An example which came up in our discussions was:

http://data.lib.sussex.ac.uk/id/person/nra/churchillsirwinstonleonardspencer1874-1965knightprimeministerandhistorian

Would it not be desirable to have the above as a URI which could provide a description (of Winston Churchill) and links to both the MOA, other archives and mentions in the Library Catalogue, all from one URI ID?

My ignorance in this area is high, but my understanding is that these URIs will probably serve up information from elsewhere (i.e. our hopefully soon to be Talis Platform Store) and present it, will having one name space confuse things, as the URI will need fetch data from potentially various stores and sources to present to the requester (human, computer or otherwise).

Perhaps another option is to keep to one namespace, but separate it out in to collections further ‘down’ e.g.

http://data.lib.sussex.ac.uk/id/document/moa/1234

http://data.lib.sussex.ac.uk/id/document/anotherarchive/1234

Is that an option, or will it break conventions?

I should say at this point that we are using Designing URI Sets for the UK Public Sector as a guide for creating URIs and are trying to stick to their guidelines as much as possible.

However: I am torn between the grand, right(?), more technical nirvana of one name space. And the less risky approach of keeping the MOA data in it’s own name space silo (you can’t have a open data blog post without the s word).

The problem ultimately is that I am still so very new to this I find it hard to think about what the issues may be, or what the right and wrong approaches are.

So, I welcome your expertise, thoughts and insights. What would you recommend? What are we not thinking about that we should? What problems are we making for ourselves down the line? What is the right approach? And (building on that last question) what is the right approach considering our somewhat limited resources and time?

So ladies and gentlemen, you thoughts please? Please.

Licencing part 2

April 11th, 2011

Thanks to Alexandra and Owen for their thoughts, and please see Chris Keene’s comment also.  This issue was never going to be straight forward and any discussion about licencing makes people edgy.  Perhaps not the licence itself  but the idea that someone could use someone else’s work without asking or crediting them.  As Alexandra says, this is a possibly a separate issue, but it seems that by using CC-BY licences people are hedging their bets – you can use it but you have to say where you got it from, which is perfectly reasonable. There is also uncertainty about whether they have the right to licence the data in the first place.

I’ve discussed this with Fiona and she makes the point that in an academic context, we tell people every day how to reference the materials they are using and so attribution of data is in our very core. Making our catalogue records available as Linked Open Data with no insistence on attribution is contrary to what we do every day. However, naïve as it may sound, though we don’t insist that people attribute in this case, that’s not to say that they won’t.

We know we are diving in there and taking a risk, we don’t know how the data could be used in the future and what impact that will have. But someone has got to take that risk. We are confident in that are able to licence the data for use in the first place and we want to take the most open road.  We don’t do it lightly.

It is perhaps the conflict between archivists and developers. As Archivists we are naturally cautious and as I said earlier, make attribution a key part of our work. Developers/ technicians are much more used to making things out there as open source – I’m assuming – would any developers like to comment?

Licencing our data

April 5th, 2011

We have decided to use a Open Data Commons Public Domain Dedication and Licence ( PDDL) to licence our data once it is open and on the Talis Platform.

Key points of PDDL

  1. Recommended by JISC for collections of factual data
  2. Goal is to eliminate restrictions on the use of data so it can be used for any purpose including commmercial and in combination with other data
  3. There is no requirement to attribute the source of the data
  4. The Licence makes the work – in our case the catalogue records of the Mass Observation Archive – permanently  available to the public for any use of any kind.
    The line above in bold is the scary bit but also the main point of getting the data out there and we are lucky to be sure of its ownership and copyright.

Why we chose PDDL

PDDL is the standard for collections of non personal factual data which is what the catalogue records of the Mass Observation Archive are. The assumption is we own the rights to this data as the original creators were employees of the University of Sussex so we are free to licence this data.

JISC guards against putting variants in licences for special requirements for example no use of images,  as “The introduction of variant terms into a ‘Creative Commons-like licence’ from a single institution may require those potential beneficiaries to pay for legal advice in order to understand the implications of the variation. The value of seeing and understanding a single licence across the web is lost, as every minor variation encountered increases the likelihood that the different licences will conflict when combined in some third party use case” (JISC rights and licencing). I don’t believe a variant is necessary as it is a collection of factual data.

The Mass Observation Archive

March 28th, 2011

We thought it would be useful to say a little bit here about what the Mass Observation Archive is to provide some context to the SALDA Project (and archivists love context).

The Mass Observation Archive specialises in material about everyday life in Britain. It contains papers generated by the original Mass Observation social research organisation (1937 to early 1950s), and newer material collected continuously since 1981. The Archive is in the care of the University of Sussex and is a charitable trust.   We are working on the catalogue data from the early phase which encompasses the Second World War.

Mass Observation started in 1937 as a reaction to the abdication of Edward VIII. There is a history of Mass Observation available on the Mass Observation website.  It is important to stress that it is the catalogue data we are making available, not the documents themselves.

This is an example of a Mass Observation Archive catalogue record using the web interface to our archival management system (CALM) which is called CALMView.  A less blurry version is available here

You can see that the hierarchy is represented above the item description, so “Observations made in the Locarno, Streatham between December 1938 and April 1939”  is in File : Clothes in Dance Halls 1938-40, Subseries: Observations and interviews 1938-40, Series: TC18 Personal Appearence and Clothes, Section: Topic Collections, Collection: Mass Observation Archive.

Linked Data – What’s the point of that again?

March 22nd, 2011

A controversial title, but it neatly summarizes the feeling in the SALDA Project camp at the moment. Myself and Fiona Courage (Special Collections Manager and Curator of the Mass Observation Archive) are coming from an archive / library background and are very focused on the end result of the project which should be a set of Linked Open Data comprising our Mass Observation Archive catalogue records. But we’re not really sure what the benefit of this is, so I wanted to ask the RDTF community and beyond for help. We know it is early days for the project and we’re likely to look back at this blog post in a few months time and answer our own question, but right now we seem to have hit a bit of a brick wall.

What we are realising is that, unlike a lot of projects with archives that I have done in the past, it is  perhaps not the end result that is most important but the journey to get there. What we hope to achieve with SALDA is skills and knowledge to make our catalogues Linked Data and use those skills and that knowledge to inform decisions about whether it would be beneficial for make all our data Linked Data. Our journey so far has taught us that we need to refine and review our catalogue data within our archival management system anyway and the changes we are making now in preparation are opening up the data to enable better search results already. When the Mass Observation Archive was catalogued in the 1970s and 1980s abbrievations seem to have been all the rage, so Communist Party became CP and Conscientious Objectors became CO to give a couple of examples.  This was fine in a printed finding aid under the heading of Communist Party, and also fine on a HTML page that users scrolled through. It is not fine for keyword searching. This may seem very basic stuff, but I’m sure there are lots of archives out there that have records that make sense to the archivist and in the printed list, but will not be retrieved via a search engine. If resource discovery is our aim, then making our information clear and accessible is key. Following on from this basic idea of “finding what you search for”, is Linked Data a step on from this? Finding what you search for and a bit more?

CALM records into Encoded Archival Description (EAD)

March 7th, 2011

I am working on getting our catalogue data ready for export from our archival management system CALM. We are using the Archives Hub EAD 2000 report which exists in CALM. The following fields are now in in our collection level record:

Language “Eng”

Creator Name “Mass Observation Archive”

EHFD publisher “University of Sussex Library”

Country Code “GB”

Origination “Mass Observation”

Repository Code “181”

Guidelines for required fields and common problems with the EAD report are available from the Archives Hub here. For the future, we will need to add these fields to all our collection level records to make them EAD ready.

A quirk with this transfer to EAD is that it is a report, not an export so you cannot highlight a selection of records (called a hitlist in CALM). The Mass Observation Archive is over 23,000 records and is causing CALM to freeze. Very quick and helpful advice from the CALM helpdesk led us to turn off the server and then run the report which seems to work. This method is less good for the rest of the Special Collections staff in the office who need to use CALM and our users who access it through the web interface, so I am rationing my EAD tests to less busy times.

Out and about in Birmingham and London

March 7th, 2011

RDTF

On Tuesday 1 March, myelf and Chris Keene, Technical Development Manager for the Library and SALDA project partner attended the start up meeting for the projects running as part of  the JISC Resource Discovery Taskforce, Infrastructure for Resource Discovery (RDTF)

RDTF is due to get a new name soon. I hope they keep the Task Force bit, possibly because it makes me think of G-force from Battle of the planets

The day was very useful for finding out about the other projects running alongside SALDA. Andy McGregor, JISC  Project Manager, impressed upon us the wider implications of the projects, what lessons are can be learnt and the importance of these blogs for disemminating our findings to encourage sharing and collaboration. This is new territory we are in and we are not alone. Common issues of licensing, standards and what vocabularies to use for linked data were discussed during the day, along with an issue that I feel strongly about; what is the added value of linked data for the end user? I hope to answer this question in a future post.

UKAD

The next day I went along to the first UK Archive Discovery Network (UKAD) Forum at TNA.  A really great day, well put together with three plenary  sessions and then 5 groups of 3 parrell sessions running for 30 minutes each with a space outside the rooms for demonstrations and networking. This lent a really buzzy atmosphere to the day, not least because we were discussing the online future of archives and archive data. I really appreciated the opportunity at the start of the day to stand up and introduce ourselves and say who we would be interested in talking to. This got rid of alot of slidling up to people during the lunch breaka nd starring at their badges. Using this forward approach it was a good day for SALDA as I met with Pete Johnston from Eduserve and LOCAH who will be working on the project with us transforming our EAD into RDFs and spoke to Adrian Stevenson and Jane Stevenson at the LOCAH project, whose templates SALDA will using to create linked data.

It was nice to see a University of Sussex reading list screen shot making its way into a presentation on Linked Data by Richard Wallis from Talis, hopefully we will have two blobs on the linked data cloud soon! The semantic web relies on collections of reliable open data to work so that links can be made and it is exciting to think the our Mass Observation Archive catalogue data could be one of these datasets.

A super report of the day by Bethan Ruddock at the Archives hub is available on their blog here.

SALDA Project – Welcome!

February 21st, 2011

Welcome to the SALDA project blog. We are very excited to begin this six month JISC funded project to make the records of the Mass Observation Archive at the University of Sussex available as open Linked Data and establish a methodology that can be used to open up our other archives.

Description and Objectives
The SALDA project proposes to establish a methodology that can be used by the University of Sussex to open up metadata using the Linked Data approach. We will use knowledge and expertise already generated on similar projects  to convert existing EAD currently available on our internal Archival Management System (CALM) into Linked Data that will be enhanced and made available via XML.

The University of Sussex Special Collections comprise over 100 archival collections translating to 65,000 ISAD(G)-compliant records available on the CALM system. We are concentrating on the largest archival collection held within the Library, the Mass Observation Archive, potentially creating up to 23,000 Linked Data records.

The Key steps are:

1. Export the data from our CALM Archival Management System
2. Transform the data in to Linked Data (Eduserv)
3. Further enhance, complement and refine the data (UKOLN)
4. Publish the data as open Linked Data, as XML and we plan to upload it to the Talis Platform.

The Special Collections Department will move to a new historical resource centre known as The Keep in 2013. The Keep will bring together the collections of East Sussex Records Office, Brighton & Hove City and the University Special Collections under one roof. All three institutions currently use separate databases to record their collections meaning that sharing data about collections is problematic and a solution needs to be found. This project will provide invaluable experience in exporting and reusing our metadata and explore the potential of using this approach to open up the information on holdings between institutions and greatly enhance resource discovery.

We will use the same methodology on the SALDA Project to that which is currently being used on the LOCAH Project, which is taking data from the ArchivesHub and making it available as structured Linked Data. We plan to use the Open Data Commons PDDL licence to ensure the data is open and can be used by others. We will document our experiences on this blog and release any code or templates to help others implement a similar approach.

Alongside making these records available, we propose to work with experts experienced in similar Linked Data projects to draw up a methodology that would then be rolled out to enhance the remaining collections held by the University once the project had been completed. This process would become part of our ongoing cataloguing activities ensuring the objective of the project is sustainable.

Our project plan is available on our about page