by Adam Harwood
It sounds like something from CSI New York. And its something that I, an Archivist, have been doing for the last few months. No dusting off old manuscripts for me – digital forensics is my bread and butter. The reality unfortunately is not as exciting as it sounds, but maybe you, my library and archive colleagues, might be interested in this young yet burgeoning aspect of digital preservation.
On my desk currently sits a normal looking laptop computer, but boot it up and you’ll be looking at an unfamiliar screen that is the first step in preserving all Special Collections’ digital collections. I call it the digital forensics machine and we will use it to transfer digital records off of physical storage media like external hard drives and usb sticks and into a digital repository. Our digital repository doesn’t exist at the moment, but we can prepare our records to be transferred to it for when we do get it. I’ll explain what a digital repository is in another post where I’ll also explain what digital preservation is. For the moment I want to describe the digital forensics process and explain why we need to do it in the first place.
When Special Collections is given a digital record, we will most likely receive it on a usb stick, external hard drive or CD/DVD. We’ve got quite a few of these sitting in The Keep’s strong rooms right now. And that is where they stay. We create a catalogue record for it much like we would a physical item and if a researcher wants to see it, the only option they have is to come in to The Keep where we could plug it in to a university laptop and they could then browse the files on the storage device. The problem with this is that modern operating systems like Windows 10 by default will change the files held on any storage device when you plug it in and access them. Strange but true. Unknown to most users, the operating system will change the files on the external device to record the last time and date they were accessed. Let’s imagine this happening to a physical record. Let’s say that each time a researcher accessed and read a handwritten letter from say Virginia Woolf to Leonard Woolf, they wrote the time and date they read it on to the letter itself. This is the equivalent of what currently happens by default with modern operating systems and the files that it accesses!
We don’t want this. We are an archive and we’re interested in the last time the creator of a record last accessed it, not when the last researcher saw it. So we have a bit of a problem that if we plugged in all our hard drives into a computer, the computer would change the last accessed time and dates of the files. The information we are interested in – when it was last accessed by the creator – would be lost. This is why we need digital forensics. The digital forensics machine runs an operating system that won’t change the last access dates of files that it reads. What it also allows us to do is make an exact copy of the whole of the storage device that the files are stored on – including old deleted files and blank space. We then do our archiving work on this copy so we know the original files on the hard disk will always remain intact. This copy is also known as a ‘disk image’.
We can run a virus check on this disk image and then put it in ‘quarantine’ for another two weeks. We do this because virus scanners need to be updated reqularly with the latest viruses out there. We wait two weeks for the virus checker to be updated and then run it again on the disk image. This way we can be sure that we have detected any possible viruses. This process is equivalent to a physical record that goes through quarantine in an freezer for two weeks to kill off any mould or bugs that might be present.
The forensics machine then allows us to extract information from the files en masse. It can tell us how many kinds of each file format there are, it will take a snapshot of the file directory and also record all of these activities in a separate file so we will always know exactly what has been done to the files – a key factor to maintain the integrity of any digital record.
The machine will also generate what is called a ‘checksum hash’. This is a 32 character alphanumeric signature that is used to verify the integrity of a file. If a change is made to a file then the checksum will also change. The idea is that when ever a file is transferred between systems or storage devices we can check that there hasn’t been a faulty transfer by comparing checksums before and after the transfer. This ensures that we will always know if there has been some kind of disk error, non-malicious meddling or god forbid malicious meddling to a digital file. If there has then we will always have the original file on the original storage device or a copy of the disk image.
Once we have all this information we will transfer the files we want off of the digital forensics machine and into the digital repository. The digital repository will compare checksums as part of the transfer process. We don’t have a digital repository at the moment so for now I’m running a little app on my machine that checks checksums during transfer to the G drive. I’ve also got a device called a ‘write blocker’ that I plug in to my machine that prevents my windows 7 operating system from changing the last accessed dates. All these processes have been adapted from the legal profession – hence the use of the term digital forensics.
So now you know a little bit about what I get up to sat in content delivery apparently not doing anything really library related. I’ve compiled this process in collaboration with an academic in the humanities lab and I hope that in the future it will allow us to preserve some of the research outputs created by humanities scholars deposited in Special Collections or even deposited into our Research Data Repository. The process will form just the first part of our end to end digital preservation process which will end with making our digital files available through our online catalogue. I’ll write about this in a future blog post.