Licencing our data « SALDA Project

Licencing our data

We have decided to use a Open Data Commons Public Domain Dedication and Licence ( PDDL) to licence our data once it is open and on the Talis Platform.

Key points of PDDL

Recommended by JISC for collections of factual data
Goal is to eliminate restrictions on the use of data so it can be used for any purpose including commmercial and in combination with other data
There is no requirement to attribute the source of the data
The Licence makes the work – in our case the catalogue records of the Mass Observation Archive – permanently available to the public for any use of any kind.
The line above in bold is the scary bit but also the main point of getting the data out there and we are lucky to be sure of its ownership and copyright.

Why we chose PDDL

PDDL is the standard for collections of non personal factual data which is what the catalogue records of the Mass Observation Archive are. The assumption is we own the rights to this data as the original creators were employees of the University of Sussex so we are free to licence this data.

JISC guards against putting variants in licences for special requirements for example no use of images, as “The introduction of variant terms into a ‘Creative Commons-like licence’ from a single institution may require those potential beneficiaries to pay for legal advice in order to understand the implications of the variation. The value of seeing and understanding a single licence across the web is lost, as every minor variation encountered increases the likelihood that the different licences will conflict when combined in some third party use case” (JISC rights and licencing). I don’t believe a variant is necessary as it is a collection of factual data.

Tags: Licence, PDDL

This entry was posted on Tuesday, April 5th, 2011 at 09:42 and is filed under Uncategorized. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

6 Responses to “Licencing our data”

Alexandra Eveleigh says:

April 6, 2011 at 14:27

So, I asked Karen on twitter whether the non-attribution element of this licence didn’t worry her, and Owen Stephens has asked me to elaborate further. I wasn’t sure I could manage that in 140 characters so I am posting a comment here instead:

To me, provenance, traceability and context (all of which are interlinked) are vital means by which the user can establish the authenticity and reliability of data – and in this case, also of the archival data that the Mass Observation archive catalogue itself describes. Attribution is the means by which the chain of provenance is maintained.

This attribution doesn’t need to be of great length or specific detail (in particular, I can see there are instances where multiple data sources are used where this could be tricky to implement). But I assert that there does need to be some means by which the user (by which I mean a human, not a machine) can satisfy him or herself of the source of the data, and/or that the data has not been invented or manipulated in a deliberate attempt to mislead. A means by which another person could come along, replicate the same process, identify the same data, and get the same result.

I do not view this as an attempt to control or dictate ways in which the data can be used – putting aside moral arguments, I imagine that is more or less practically impossible anyway. It does give the data provider some come back if the data is used without attribution and in such a way as to misrepresent the source, but on reflection I think I see attribution more as an insurance protection for the user than for the data provider.

Morally, I also think attribution is also a matter of courtesy – an acknowledgement of the hard work that has gone into making the data available for re-use in the first place, and into the original compilation of the catalogue. But that is perhaps a separate issue 🙂

I don’t really know enough about licensing to feel confident making suggestions for alternative schemes, but for what its worth, it seems to me that the Open Government Licence http://www.nationalarchives.gov.uk/doc/open-government-licence/ handles this particular issue quite well. However, I don’t know if there are similar, but more generic, data licences available. The OGL is also, ahem, considerably clearer, I would say, than the PDDL, on which I had to use ‘Find’ in my browser just to find the section which mentions attribution (or lack of it).

Having said that, I do wonder if I am imposing rather an analogue paradigm onto data here, and I’m certainly open to attempts to try to persuade me to change my mind! In particular, I can see that maybe the very ‘linkiness’ of linked data might perhaps be able to substitute for or replace citation and attribution as a means of tracing source and context, but I’m very much a linked data beginner so you will need to explain slowly…
Owen Stephens says:

April 8, 2011 at 11:49

Thanks for taking the time to expand here Alexandra – very useful.

Your concerns are very much in line with the feedback I’ve had from other places – particularly from data owners in the Lucero Project I’m working on at the Open University (http://lucero-project.info/) – and we’ve generally agreed the use of CC-BY licensing on the data sets we are publishing at http://data.open.ac.uk as part of that project.

Personally I would see licensing and provenance as two related, but separate, things. By licensing data as OGL or CC-BY you are making a statement that anyone using the data is required (legally) to attribute data back to the source.

While I don’t disagree generally with any of your statements about attribution and provenance – it is definitely good practice – but I don’t think this means it should be enforced by a legal framework.

I would prefer to see good practice driven by ‘acceptable behaviour’ rather than legal recourse.

I agree the OGL license is nice and clear. The downside we found on Lucero (and why we went with CC-BY even though there is some debate as to its applicability to data) is that familiarity and general knowledge in the community is key to the license having practical application.

Finally – the issues around linked data and provenance/attribution are complex (which is one reason for not getting into it!). If I use one of your URIs that’s one thing (like linking to a web page – the attribution is built in). However, if I want to make use of a statement you have made in a ‘triple’ then it is much more difficult – for a start you could be combining very large numbers of triples from multiple sources so scale is an issue, and secondly the ‘triple’ itself doesn’t have an identifier so you can ‘point’ at the triple to say where it is from.

I attended a workshop on provenance and linked data recently – there should be a summary appearing on http://rdtf.mimas.ac.uk/ in the near future. (also my notes from sessions are at http://www.meanboyfriend.com/overdue_ideas/tag/provenance/ if you are interested, although they are tangential to this dicussion in many ways)
Chris Keene says:

April 11, 2011 at 11:02

I’m working with Karen on SALDA.

Our main discussion about the Licence was: what’s stopping us being as open as possible. Which included such things as who owned the data etc.

From my own perspective though I’ve been reading about Linked Data for a few years, it’s all been somewhat abstract, it’s been quite a learning curve moving to doing this in practice! With that in mind, here are some of my *current* thoughts, which I’m hoping may have developed somewhat by the end of the project.

In the back of my mind is that you can never predict what someone might do you with your data. And every rule and requirement might lead to unforeseen consequence down the line.

Imagine someone building an amazing iphone app about World War II. They use various Linked Data sources to provide facts about a specific geographic area at a particular time during the war. A ‘fact’ from the Mass Observation Archive might be that someone in that area was keeping a diary at that time. What if we were the only ones require attribution, that the app/site was showing a vast amount of information from many sources but has to show ‘information sourced from Mass Observation’ at the bottom of each record just for our small contribution.

Of course, if an app was using just our data, or our data made up a significant proportion of the information, then it would be good practice, and it would increase the credibility of the app. But I can’t help feeling the loss or gain here is in the hands of the app developers. People might decide to not use the app, or take it seriously, because of a lack of attribution, that’s the developers loss, not ours.

So to me, ‘provenance, traceability and context’ are important to end users, and developers should take note, but I’m not convinced data providers should force developers hands on the issue with licence agreements. I’d really welcome anyone’s thoughts on this.

There is another level, the above was between an application and an end user. What about us providing this information to the application. The developers, when they come across the data source, would also want to know ‘provenance, traceability and context’. I’m aware there are ways to assert this in the Linked Data and SPARQL results we will (hopefully) provide. This is something I know nothing about, but I presume there will be a way for us to assert that we (the data providers) are also the people managing the collecting the data is about.

(aside: interesting question, what if some copied the data, and made the same assertion, but made a few changes as well? And is this any different to what could happen to anything else on the web?).

My final thought is that our data, and hence the statements/facts we assert in Linked Data triples is all about our collection.

This is unlike, say, LD of a Library Catalogue which may state: ‘the publisher of Harry Potter is Bloomsbury’. This ‘fact’ has no direct reference to the Library in question. As it gets used (ie referred to) in the Linked Data linky world the owner may be lost (lack of credit) and the reliability of the source unknown.

However for us, I think a lot of the facts that we state will in some way be about the content held in the MOA. For example the Survey “detailed record of everything panel did between rising and going to bed on the 12th of the Month” was carried out in “September 1937” ( http://specialcollections.lib.sussex.ac.uk/CalmView/Record.aspx?src=DServe.Catalog&id=SxMOA1%2f3%2f9 ).

My point here is that the risk that our data ‘out there’ gets detached from us may be less so than others. Though this may just be naïvity and ignorance.

Chris.
SALDA Project says:

April 11, 2011 at 11:18

[…] to Alexandra and Owen for their thoughts, and please see Chris Keene’s comment also. This issue was never going to be straight forward and any discussion about licencing makes […]
Jane Stevenson says:

April 19, 2011 at 08:03

Hi to all there at SALDA,

I’m still grappling with the licensing issue with the Hub – made more complex by being an aggregator, but still essentially the same issues. Although one issue for us is actually whether we are in a position to licence the data at all!
Some thoughts:

1) Have you done a risk register? I’m thinking of doing this to share with our contributors – Owen was arguing that you have to think of the worst that can happen and the consequences of that, so it seems like a sensible step.

2) If, for example, I think about the content of the administrative/biographical history, I worry that with an open licence others can just take work like this and use it for financial gain – it is the ‘creative’ area of work for archivists – an investment for them. However, I keep coming back to the Linked Data vision and the need to take the risk in order to be part of the fully open data movement.

3) if you are going to go with Linked Data and you restrict what can be done with the data, is it really the right thing to be doing?

4) if you already have machine interfaces then people can grab the data anyway, and I guess they can grab it by less formal means as well (screen scraping)?

5) this is the metadata, not the archive, and its purpose is to promote the archive – is it really appropriate to be too protective?

6) maybe there is a reality here that archivists simply have to change their perspective on control. It’s not just about Linked Data, its about opening up data – if the government, the BBC, etc., are doing this, maybe we have to just take the risk?

7) maybe with provenance we have to accept that it is down to users’ requirements – if they want provenance, then application providers are likely to respond to this. We can keep advocating this, as best practice, but we cannot mandate it if users are not concerned about it.

8) If we go with a CC-BY type licence and make attribution a legal requirement, will this hamper use? As Chris says: “What if we were the only ones requiring attribution, that the app/site was showing a vast amount of information from many sources but has to show ‘information sourced from Mass Observation’ at the bottom of each record just for our small contribution.”
Chris Keene says:

April 19, 2011 at 15:21

Hi Jane,

great comments 🙂

Haven’t looked in to a Risk Register just for licensing, but it’s something we will consider.

Many years a go I experimented with some tools to rip music CDs in to MP3s. At that time itunes and similar were not really around so most of the options were quite technical command driven stuff. I ripped a few CDs using different tools and different settings and noted the file size of the final MP3. It wasn’t very scientific. OK, it wasn’t remotely scientific, but I stuck the results online, and the page got a few hits. I had got my value out of the exercise, now it was online, frankly I couldn’t care what others did with it, if they found it useful, great, if they made millions from it, good for them.

It’s perhaps with the same attitude I, personally, approach this project. The records we have took effort to create. They provide value to us and our users in helping us to manage and discover our collection.

Now we’re putting them out there, hopefully to help others to discover our collection, and so providing a benefit to potential users. And perhaps some of the facts/statements that are produced are useful in their own right, not just about as a finding aid for our collection, potentially useful to the public at large. Great.

And just maybe someone will find commercial use from it. Do we care? Should we? We’ve got our use from the data (as noted above), and our use from opening up the collection as LD (again as noted above), if others can find other uses, good for them. It doesn’t diminish the value we get from the metadata. I think of it like a newspaper I’ve purchased, I’ve read it, have no more need for it, but I can leave it on the train in case someone else wants it and it in no way makes a difference to me.

So in terms of point 2, personally, I just don’t see what others finding use with our data (commercial or otherwise) should be a problem. In fact in this age where we are trying to show our Impact and engagement with the wider community I would see it as a absolute boon!

Of course, the situation is much more complex when you are dealing with data from many organisations and I don’t for a second envy your position on having to making a decision on the best licence for the Hub Data. If there is anything we can help with (in our all-quite-new-to-this capacity) then we’re more than happy to. Ultimately I guess there might need to be a little of what you suggest in point 6 in our community. 🙂

Chris