Archive for April, 2011

Musings about URIs

Wednesday, April 20th, 2011

Choosing a base for our URIs. Easy right? The task was recently allocated to me. Should take all of 5 minutes and then I can sit back and sip my coffee at job well done. Simples.

Annoyingly, not quite yet.

First thing: The URIs will resolve to an actual web server. We’ve got loads of servers, hostnames and aliases (cnames) but which to use? We need a server and hostname that will be stable and permanent. In this rapidly changing world, changing services, and consolidation of servers (and a move towards that cloud stuff hosted services) what’s best to use?

Two potential base URI options:

  • data.lib.sussex.ac.uk
  • www.sussex.ac.uk/library/

The former was my immediate first choice, it fits in with the common naming practice ‘data.organisation.tld’ (admittedly with ‘lib’ in the middle, I don’t think we are ready to roll out an institutional data service just yet).

The latter was a consideration as it built on an already known and trusted URI on a institutionally embedded service: our University website (and corresponding infrastructure). Both URL and service are going to be around for the foreseeable future. What’s more they don’t require the Library to maintain any additional infrastructure. However, this didn’t fit in with the common convention used, might clash with other Library URLs. And there’s a risk: If the University moved to a new Content Management System it might break our URIs, especially if the CMS required full control of the ‘www.sussex.ac.uk’ namespace. Plus,  it just doesn’t look cool.

So currently thinking is http://data.lib.sussex.ac.uk/ – We can run it off a server here in the Library, which runs Apache and a number of other undemanding web services (wikis etc). This does require the Library to maintain it, which to be blunt, might be an issue if I leave. But there is nothing stopping us working with our IT Services and moving data.lib.sussex.ac.uk to a centrally run (or even third party hosted) server in the future.

Second issue…

Do we need a to create a ‘Mass Observation’ name space under http://data.lib.sussex.ac.uk/ e.g. http://data.lib.sussex.ac.uk/moa/ ?

In a nutshell, keeping it as http://data.lib.sussex.ac.uk/ keeps a simple URI, and allows us to merge in other datasets in to the same ‘pool’ (I don’t think pool is part of the Linked Data vocabulary but never mind).

However the risk is that should we wish to create more Linked Data sets in the future, whether for the Library Catalogue, the Institutional Repository or other Special Collections, how can we be sure the various identifiers, names and reference numbers will not clash between the different datasets? Will a Library Catalogue and Archive metadata be strange bedfellows?

I’ve been discussing this with Pete Johnston from Eduserv who has provided a lot of advice and things to consider. An example which came up in our discussions was:

http://data.lib.sussex.ac.uk/id/person/nra/churchillsirwinstonleonardspencer1874-1965knightprimeministerandhistorian

Would it not be desirable to have the above as a URI which could provide a description (of Winston Churchill) and links to both the MOA, other archives and mentions in the Library Catalogue, all from one URI ID?

My ignorance in this area is high, but my understanding is that these URIs will probably serve up information from elsewhere (i.e. our hopefully soon to be Talis Platform Store) and present it, will having one name space confuse things, as the URI will need fetch data from potentially various stores and sources to present to the requester (human, computer or otherwise).

Perhaps another option is to keep to one namespace, but separate it out in to collections further ‘down’ e.g.

http://data.lib.sussex.ac.uk/id/document/moa/1234

http://data.lib.sussex.ac.uk/id/document/anotherarchive/1234

Is that an option, or will it break conventions?

I should say at this point that we are using Designing URI Sets for the UK Public Sector as a guide for creating URIs and are trying to stick to their guidelines as much as possible.

However: I am torn between the grand, right(?), more technical nirvana of one name space. And the less risky approach of keeping the MOA data in it’s own name space silo (you can’t have a open data blog post without the s word).

The problem ultimately is that I am still so very new to this I find it hard to think about what the issues may be, or what the right and wrong approaches are.

So, I welcome your expertise, thoughts and insights. What would you recommend? What are we not thinking about that we should? What problems are we making for ourselves down the line? What is the right approach? And (building on that last question) what is the right approach considering our somewhat limited resources and time?

So ladies and gentlemen, you thoughts please? Please.

Licencing part 2

Monday, April 11th, 2011

Thanks to Alexandra and Owen for their thoughts, and please see Chris Keene’s comment also.  This issue was never going to be straight forward and any discussion about licencing makes people edgy.  Perhaps not the licence itself  but the idea that someone could use someone else’s work without asking or crediting them.  As Alexandra says, this is a possibly a separate issue, but it seems that by using CC-BY licences people are hedging their bets – you can use it but you have to say where you got it from, which is perfectly reasonable. There is also uncertainty about whether they have the right to licence the data in the first place.

I’ve discussed this with Fiona and she makes the point that in an academic context, we tell people every day how to reference the materials they are using and so attribution of data is in our very core. Making our catalogue records available as Linked Open Data with no insistence on attribution is contrary to what we do every day. However, naïve as it may sound, though we don’t insist that people attribute in this case, that’s not to say that they won’t.

We know we are diving in there and taking a risk, we don’t know how the data could be used in the future and what impact that will have. But someone has got to take that risk. We are confident in that are able to licence the data for use in the first place and we want to take the most open road.  We don’t do it lightly.

It is perhaps the conflict between archivists and developers. As Archivists we are naturally cautious and as I said earlier, make attribution a key part of our work. Developers/ technicians are much more used to making things out there as open source – I’m assuming – would any developers like to comment?

Licencing our data

Tuesday, April 5th, 2011

We have decided to use a Open Data Commons Public Domain Dedication and Licence ( PDDL) to licence our data once it is open and on the Talis Platform.

Key points of PDDL

  1. Recommended by JISC for collections of factual data
  2. Goal is to eliminate restrictions on the use of data so it can be used for any purpose including commmercial and in combination with other data
  3. There is no requirement to attribute the source of the data
  4. The Licence makes the work – in our case the catalogue records of the Mass Observation Archive – permanently  available to the public for any use of any kind.
    The line above in bold is the scary bit but also the main point of getting the data out there and we are lucky to be sure of its ownership and copyright.

Why we chose PDDL

PDDL is the standard for collections of non personal factual data which is what the catalogue records of the Mass Observation Archive are. The assumption is we own the rights to this data as the original creators were employees of the University of Sussex so we are free to licence this data.

JISC guards against putting variants in licences for special requirements for example no use of images,  as “The introduction of variant terms into a ‘Creative Commons-like licence’ from a single institution may require those potential beneficiaries to pay for legal advice in order to understand the implications of the variation. The value of seeing and understanding a single licence across the web is lost, as every minor variation encountered increases the likelihood that the different licences will conflict when combined in some third party use case” (JISC rights and licencing). I don’t believe a variant is necessary as it is a collection of factual data.