Archive for the ‘Uncategorized’ Category

Cost/benefits of the open data approach

Monday, July 18th, 2011

We have been asked to assess how much it has cost us in terms of time and resources to make our data openly available, so here goes.

Our approach to the project was to have a dedicated project manager (me) working 0.5 FTE, using the skills of Pete Johnston for the transformation to Linked Data and the skills of Chris Keene (Technical development Manager for the Library) when required. This meant we were all dedicated to our tasks and  that someone was on top of the administration part of the project, as well as researching the licence and talking/presenting to groups and stakeholders whilst the technical transformation was taking place. This was a good use of time and resources and provided a bridge between the two sides.

We made a decision early on that we did not have time within the project allocation to re-structure the MOA data prior to tranformation as we would like  but we did work through 75% of it expanding name and organisation abbreviations to allow ways into the data. If we have re-structured the data within the CALM database putting dates in the date field, separating out title and description, this would have added at least another month to the project. It prehapes would have meant that there would have been less tweaking to the stylesheet that Pete made for the Locah project, but all worked out in the end as we approached it from a different angle, using lookup lists of keywords and people (See earlier blog posts here and here)

Benefits

The benefits of open data are harder to quantify. We are excited by the potential uses of our data ourside of the archive searchroom and one of the reasons we have used the ODC-PDDL  is so that we can be as open as possible and see what happens. The success of this project also means that open data is on the agenda in the Library (see Chris’s blog post).

Benefits for the Keep : cataloguing guidelines

I have reported back to stakeholders from the Keep as we need to look into how we can share our data and provide resource discovery of all our collections for visitors to the Keep. Having had a close look at our catalogue data for the project we are able to provide recommendations that will hopefully make it easier to export, share and transfer our data to existing or new systems. We have created some in-house cataloguing guidelines and the following guides were produced by myself and a colleague Adam Harwood who is currently cataloguing the University of Sussex Collection.

  • CALM_ISADG_Collection level This document maps the required ISAD G fields to the CALM fields with guidelines on how to populate the fields. We have also included the fields required for export to EAD using the Archive Hub report on CALM.

Our priority in this area to to concentrate on our existing collection level descriptions and any new catalogue componant records that we create. We will share these guidelines with colleagues from the Keep in the next few months.

Following in our footsteps

Wednesday, July 6th, 2011

Question: If others wanted to take a similar approach to your project, what advice would you give them.

Our advice at the start would be:

1. Get your data ready. We are working on our catalogue data to make it more structured so that we can be ready to export to other formats and make it more portable. Regardeless of whether it becomes Linked Data in the future, we are getting ourselves ready. This is also probably the most time consuming aspect. From personal experience, once you start looking at your catalogue data, you’ll find lots of things that you want to change or are missing or don’t make sense so the work starts to grow…

2. Are you in a position to licence your data? We chose the catalogue data of the Mass Observation Archive as we were confident of its provenance so we could make it fully open and available under ODC-PDDL. This hopefully will allow the greatest flexibility for people wanting to use the data and fits with the ethos of the project and the JISC Discovery strand

3. Find out about other similar projects! We at SALDA realise the value of these blog posts to anyone wanting to do a similar project to SALDA. We followed in the footsteps of the LOCAH project and were able to use their stylesheet and experience in tranforming archival data into Linked Data. We are working with the Pete Johnston from Eduserv whose knowledge and experience is invaluable. You can see his contribution to the blog here

4. Find examples of Linked Data in use, in human readable format so that you can show stakeholders, colleagues, friends what it is that you are on about. I use the BBC wildlife pages and how they link to Animal Diversity Web

Report from stakeholder meeting

Wednesday, June 15th, 2011

On 31st May, we held a meeting for stakeholders and other interested parties, to talk about the SALDA project and its impact on future developments. Attending the meeting were:

Karen Watson SALDA Project Manager, University of Sussex
Fiona Courage Special Collections Manager, University of Sussex
Jane Harvell Head of Academic Services, University of Sussex
Chris Keene Technical Development Manager, University of Sussex
Richard Fisher Business Analyst – ICT Services, East Sussex County Council
Elizabeth Hughes County Archivist, East Sussex Record Office
Jenny Hand Knowledge and information Manager, Royal Pavilion and Museums

The Keep is only two years away so naturally the focus is on finding systems that support all the partners and enable researchers to search and use the collections. Transforming all our data into Linked Data would enable cross searching and enhanced resource discovery, but there is an issue of time.  However, there was a general feeling in the room that the data preparation that we have identified that will allow mapping /easier transference  will make us ready for other solutions or to embark on a linked data exercise in the future.

Data potential

It was good to discuss the different ways that our collective data could be used outside of the reading room. There is potential for mobile apps, hack days and contributions to other projects – Brighton and Hove Museums already contribute to Culture Grid. SALDA has enabled us to see this potential with the added benefit that any enhancements and changes that we make to our catalogue data will improve resource discovery for researchers within our exisiting systems.

I’ve said before that alot of what we are talking about in terms of changes to our data is quite basic stuff (date in the date field etc) and a colleague pointed out that libraries had these conversations years ago. It is the diverse and complicated nature of archives that means that a one size fits all approach is difficult to achieve, so we are looking at the minimum we need to do, with our strategic goal being that we want to be ready to export / map / transfer our data and make it as portable as possible.

Local data for local people

Monday, June 6th, 2011

This  is to add to Pete’s  post on the transformation of our data. The SALDA project is really searching for a framework or a set of tools to enable us to transform our other archive collections into Linked Data. What we have discovered so far is that there is a model that we can apply to the data, based on the LOCAH model,  but there is some local tweaking that needs to be done due to the structure of our data.

Prior to 2009, our catalogue data was in HTML lists on our website or printed lists in our reading room. We imported the catalogues into our CALM database in summer 2009 and most information went into the title field. This meant that when we then exported the data to EAD there was no separate fields for date or description. I then revealed to Pete with my head in my hands, that we don’t use access points and this is what the LOCAH process was based around. Pete was optimistic in his outlook saying that there were good points about our data to focus on.

  • it was consistent in that it was all from one data provider
  • it was consistent in the format of the date and where it appeared in the data (albeit not in the date field)

We decided then to think about other ways into the data. I provided Pete with 28 names out of the data in authorised form using National Register of Archives rules . I was able to confirm that these were definitely those people, so when it says “Churchill” in the data, it is:

Churchill, Sir Winston Leonard Spencer (1874-1965)

Knight, prime minister and historian

not churchill insurance, churchill college etc.

I also provided 100 or so keywords that appeared in the data and covered subjects from air raids to sex including places and organisations (Labour Party, Communist Party) events  (the Coronation in 1953) and wider concepts like class, family, education and death.

Future proofing our data

Realising the limitations of our data as it stands in our archival management system has made the team at Sussex really look at how we catalogue things.  We need to future proof our data so that we can export or transform our data or map across to other systems easier. We are compiling cataloguing guidelines to ensure that all our collection level records are ISAD (G) compatible and that certain fields are always populated in our componant records. This is not a small change and it will take a long time to modify 67,000 existing records. This has been an unexpected by-product of the SALDA project and one that we can’t ignore.

Musings about URIs

Wednesday, April 20th, 2011

Choosing a base for our URIs. Easy right? The task was recently allocated to me. Should take all of 5 minutes and then I can sit back and sip my coffee at job well done. Simples.

Annoyingly, not quite yet.

First thing: The URIs will resolve to an actual web server. We’ve got loads of servers, hostnames and aliases (cnames) but which to use? We need a server and hostname that will be stable and permanent. In this rapidly changing world, changing services, and consolidation of servers (and a move towards that cloud stuff hosted services) what’s best to use?

Two potential base URI options:

  • data.lib.sussex.ac.uk
  • www.sussex.ac.uk/library/

The former was my immediate first choice, it fits in with the common naming practice ‘data.organisation.tld’ (admittedly with ‘lib’ in the middle, I don’t think we are ready to roll out an institutional data service just yet).

The latter was a consideration as it built on an already known and trusted URI on a institutionally embedded service: our University website (and corresponding infrastructure). Both URL and service are going to be around for the foreseeable future. What’s more they don’t require the Library to maintain any additional infrastructure. However, this didn’t fit in with the common convention used, might clash with other Library URLs. And there’s a risk: If the University moved to a new Content Management System it might break our URIs, especially if the CMS required full control of the ‘www.sussex.ac.uk’ namespace. Plus,  it just doesn’t look cool.

So currently thinking is http://data.lib.sussex.ac.uk/ – We can run it off a server here in the Library, which runs Apache and a number of other undemanding web services (wikis etc). This does require the Library to maintain it, which to be blunt, might be an issue if I leave. But there is nothing stopping us working with our IT Services and moving data.lib.sussex.ac.uk to a centrally run (or even third party hosted) server in the future.

Second issue…

Do we need a to create a ‘Mass Observation’ name space under http://data.lib.sussex.ac.uk/ e.g. http://data.lib.sussex.ac.uk/moa/ ?

In a nutshell, keeping it as http://data.lib.sussex.ac.uk/ keeps a simple URI, and allows us to merge in other datasets in to the same ‘pool’ (I don’t think pool is part of the Linked Data vocabulary but never mind).

However the risk is that should we wish to create more Linked Data sets in the future, whether for the Library Catalogue, the Institutional Repository or other Special Collections, how can we be sure the various identifiers, names and reference numbers will not clash between the different datasets? Will a Library Catalogue and Archive metadata be strange bedfellows?

I’ve been discussing this with Pete Johnston from Eduserv who has provided a lot of advice and things to consider. An example which came up in our discussions was:

http://data.lib.sussex.ac.uk/id/person/nra/churchillsirwinstonleonardspencer1874-1965knightprimeministerandhistorian

Would it not be desirable to have the above as a URI which could provide a description (of Winston Churchill) and links to both the MOA, other archives and mentions in the Library Catalogue, all from one URI ID?

My ignorance in this area is high, but my understanding is that these URIs will probably serve up information from elsewhere (i.e. our hopefully soon to be Talis Platform Store) and present it, will having one name space confuse things, as the URI will need fetch data from potentially various stores and sources to present to the requester (human, computer or otherwise).

Perhaps another option is to keep to one namespace, but separate it out in to collections further ‘down’ e.g.

http://data.lib.sussex.ac.uk/id/document/moa/1234

http://data.lib.sussex.ac.uk/id/document/anotherarchive/1234

Is that an option, or will it break conventions?

I should say at this point that we are using Designing URI Sets for the UK Public Sector as a guide for creating URIs and are trying to stick to their guidelines as much as possible.

However: I am torn between the grand, right(?), more technical nirvana of one name space. And the less risky approach of keeping the MOA data in it’s own name space silo (you can’t have a open data blog post without the s word).

The problem ultimately is that I am still so very new to this I find it hard to think about what the issues may be, or what the right and wrong approaches are.

So, I welcome your expertise, thoughts and insights. What would you recommend? What are we not thinking about that we should? What problems are we making for ourselves down the line? What is the right approach? And (building on that last question) what is the right approach considering our somewhat limited resources and time?

So ladies and gentlemen, you thoughts please? Please.

Licencing part 2

Monday, April 11th, 2011

Thanks to Alexandra and Owen for their thoughts, and please see Chris Keene’s comment also.  This issue was never going to be straight forward and any discussion about licencing makes people edgy.  Perhaps not the licence itself  but the idea that someone could use someone else’s work without asking or crediting them.  As Alexandra says, this is a possibly a separate issue, but it seems that by using CC-BY licences people are hedging their bets – you can use it but you have to say where you got it from, which is perfectly reasonable. There is also uncertainty about whether they have the right to licence the data in the first place.

I’ve discussed this with Fiona and she makes the point that in an academic context, we tell people every day how to reference the materials they are using and so attribution of data is in our very core. Making our catalogue records available as Linked Open Data with no insistence on attribution is contrary to what we do every day. However, naïve as it may sound, though we don’t insist that people attribute in this case, that’s not to say that they won’t.

We know we are diving in there and taking a risk, we don’t know how the data could be used in the future and what impact that will have. But someone has got to take that risk. We are confident in that are able to licence the data for use in the first place and we want to take the most open road.  We don’t do it lightly.

It is perhaps the conflict between archivists and developers. As Archivists we are naturally cautious and as I said earlier, make attribution a key part of our work. Developers/ technicians are much more used to making things out there as open source – I’m assuming – would any developers like to comment?

Licencing our data

Tuesday, April 5th, 2011

We have decided to use a Open Data Commons Public Domain Dedication and Licence ( PDDL) to licence our data once it is open and on the Talis Platform.

Key points of PDDL

  1. Recommended by JISC for collections of factual data
  2. Goal is to eliminate restrictions on the use of data so it can be used for any purpose including commmercial and in combination with other data
  3. There is no requirement to attribute the source of the data
  4. The Licence makes the work – in our case the catalogue records of the Mass Observation Archive – permanently  available to the public for any use of any kind.
    The line above in bold is the scary bit but also the main point of getting the data out there and we are lucky to be sure of its ownership and copyright.

Why we chose PDDL

PDDL is the standard for collections of non personal factual data which is what the catalogue records of the Mass Observation Archive are. The assumption is we own the rights to this data as the original creators were employees of the University of Sussex so we are free to licence this data.

JISC guards against putting variants in licences for special requirements for example no use of images,  as “The introduction of variant terms into a ‘Creative Commons-like licence’ from a single institution may require those potential beneficiaries to pay for legal advice in order to understand the implications of the variation. The value of seeing and understanding a single licence across the web is lost, as every minor variation encountered increases the likelihood that the different licences will conflict when combined in some third party use case” (JISC rights and licencing). I don’t believe a variant is necessary as it is a collection of factual data.

The Mass Observation Archive

Monday, March 28th, 2011

We thought it would be useful to say a little bit here about what the Mass Observation Archive is to provide some context to the SALDA Project (and archivists love context).

The Mass Observation Archive specialises in material about everyday life in Britain. It contains papers generated by the original Mass Observation social research organisation (1937 to early 1950s), and newer material collected continuously since 1981. The Archive is in the care of the University of Sussex and is a charitable trust.   We are working on the catalogue data from the early phase which encompasses the Second World War.

Mass Observation started in 1937 as a reaction to the abdication of Edward VIII. There is a history of Mass Observation available on the Mass Observation website.  It is important to stress that it is the catalogue data we are making available, not the documents themselves.

This is an example of a Mass Observation Archive catalogue record using the web interface to our archival management system (CALM) which is called CALMView.  A less blurry version is available here

You can see that the hierarchy is represented above the item description, so “Observations made in the Locarno, Streatham between December 1938 and April 1939”  is in File : Clothes in Dance Halls 1938-40, Subseries: Observations and interviews 1938-40, Series: TC18 Personal Appearence and Clothes, Section: Topic Collections, Collection: Mass Observation Archive.

Linked Data – What’s the point of that again?

Tuesday, March 22nd, 2011

A controversial title, but it neatly summarizes the feeling in the SALDA Project camp at the moment. Myself and Fiona Courage (Special Collections Manager and Curator of the Mass Observation Archive) are coming from an archive / library background and are very focused on the end result of the project which should be a set of Linked Open Data comprising our Mass Observation Archive catalogue records. But we’re not really sure what the benefit of this is, so I wanted to ask the RDTF community and beyond for help. We know it is early days for the project and we’re likely to look back at this blog post in a few months time and answer our own question, but right now we seem to have hit a bit of a brick wall.

What we are realising is that, unlike a lot of projects with archives that I have done in the past, it is  perhaps not the end result that is most important but the journey to get there. What we hope to achieve with SALDA is skills and knowledge to make our catalogues Linked Data and use those skills and that knowledge to inform decisions about whether it would be beneficial for make all our data Linked Data. Our journey so far has taught us that we need to refine and review our catalogue data within our archival management system anyway and the changes we are making now in preparation are opening up the data to enable better search results already. When the Mass Observation Archive was catalogued in the 1970s and 1980s abbrievations seem to have been all the rage, so Communist Party became CP and Conscientious Objectors became CO to give a couple of examples.  This was fine in a printed finding aid under the heading of Communist Party, and also fine on a HTML page that users scrolled through. It is not fine for keyword searching. This may seem very basic stuff, but I’m sure there are lots of archives out there that have records that make sense to the archivist and in the printed list, but will not be retrieved via a search engine. If resource discovery is our aim, then making our information clear and accessible is key. Following on from this basic idea of “finding what you search for”, is Linked Data a step on from this? Finding what you search for and a bit more?

CALM records into Encoded Archival Description (EAD)

Monday, March 7th, 2011

I am working on getting our catalogue data ready for export from our archival management system CALM. We are using the Archives Hub EAD 2000 report which exists in CALM. The following fields are now in in our collection level record:

Language “Eng”

Creator Name “Mass Observation Archive”

EHFD publisher “University of Sussex Library”

Country Code “GB”

Origination “Mass Observation”

Repository Code “181”

Guidelines for required fields and common problems with the EAD report are available from the Archives Hub here. For the future, we will need to add these fields to all our collection level records to make them EAD ready.

A quirk with this transfer to EAD is that it is a report, not an export so you cannot highlight a selection of records (called a hitlist in CALM). The Mass Observation Archive is over 23,000 records and is causing CALM to freeze. Very quick and helpful advice from the CALM helpdesk led us to turn off the server and then run the report which seems to work. This method is less good for the rest of the Special Collections staff in the office who need to use CALM and our users who access it through the web interface, so I am rationing my EAD tests to less busy times.