We have some confidential data for our research. Currently, we use an encrypted hard drive for storing the data and any researcher using the data takes it off the drive.

However, we do not have any way of knowing where the data copies exist at a certain point in time and it is generally up to the researchers to manage their own copies. Also, the external hard drive needs to be locked up and may or may not be available when needed because someone took it or the person managing it was out of office or whatever.

EDITED FOR CLARITY:
I'm looking for a way to have the data shared with the section on a local area network (even in un-encrypted form) with the caveat that the user's have to log into the file server using their institution emails. Once they download a file, it is logged somewhere that person x took a copy of the file y.

Once they have taken the copy, we don't police them about it. It is more self reporting where once they've used the file and deleted it, they log back in and sign off somewhere that they deleted the copy. Once in a while we (or the server) look up if any files have not been reported back and email the person to confirm if they have in fact deleted their copy.

I should re-iterate that this is not a matter of trust but more a check-in / check-out thing.

I know I can probably do this by having a web server hosted on the machine having the files and user's have to log in to get access to the web page that can then track which files they take but I have a gut feeling that there must be a more elegant way of doing this than files over HTTP and that there should probably be something out there that does something similar already.

Also, any thoughts on reliability? Can I have data backed up automatically or synced without using a cloud service that may potentially have leaks (or a private cloud service that is reasonably secure and recommended for this application).

2 Answers
2

Logging who accesses data is possible, but tracking what they do with it once they have accessed it is near impossible. This is similar to the basic DRM problem. On one level, you want to give legitimate users access to the data, but on another level, you want to control what they do with that access. Unless you have complete control over the systems they use, this is pretty much impossible. To some extent, you need to trust your researchers. You can assist them to do the right thing, but based on what you have outlined, I don't think there is any way to gurantee that copies of data are destroyed and no other copies of those copies were made.

There are various approaches you can use, but they all come with significant overheads, have impact on the usability and can be very expensive. For example, you could provide individual encrypted copies of the data and individual decryption keys and a formal process for researchers to obtain copies of the data. The researcher would need to decrypt the data to view it using their key. However, this may greatly complicate their ability to use the data with various research/analysis systems, requires a private key management system, requires additional support for those who have problems using the keys, lose their keys, etc etc. If you provide the sort of flexibility which would allow the researchers to input the data into other applications for analysis etc, you are also providing them with means to copy the data. Once they can do that, you have no way of knowing when/if all copies have been destroyed or not. Even knowing with any certainty that data has been destroyed can be tricky, especially for those without good understanding - many users still believe that once a file is deleted the data is gone.

Obtaining secuure backup of your data is less of an issue. There are many solutions out there that provide various levels of security with respect to data backup. There is no requirement to decrypt your data to have it backed up and provided you use a reputable service provider, the risks are probably no greater than the risks you already have having the data available on the network. Is the external drive locked away in a secure location, such as a safe, when not in use? If not, you likely have a higher risk of someone just walking out with it in their pocket!

I suspect you really need to do a good risk analysis and work out what your top risks are. How sensitive is your data in reality? What is the potential damage should the data be compromised? What is the potential benefit to unauthorized people accessing the data? Are you protecting against active attack or against more accidental loss, such as a researcher leaving their laptop on the bus? Is it possible to modify the data such that sensitive information is reduced without impacting on research requirements? I have worked in areas involving medical research where we had different classes of the data. The less sensitive data had all information which could be used to personally identify an individual removed, replaced with some other key or randomized. The specific medical details still represented real world data, but could not be directly linked to a specific individual. The 'master' data contained all sensitive information, but required additional process and approvals before it could be used for research etc.

Thanks for the detailed answer. I think I should probably edit my question cause I seem to give the wrong signals. Trusting the user's isn't really the issue. We trust them 100%. This is supposed to be more of a reminder service, like we send out an email saying 'Hey, you took that file out last month, remember to delete it so we can cross it off the list'. This is just for cleanup sake so there's isn't data that nobody is going to use just lying around.
–
Saad FarooqNov 8 '12 at 23:54

The problem seems a bit unclear to me. If I understand the request correctly, you want to keep track of the copies made off the central store, but the end goal is not mentioned. I don't know of any software that would be able to do so, and it would be a pretty daunting task (it's quite hard to stop someone booting off a USB drive into Linux, and making a copy, thus evading any tracking software).

If the goal is protecting the data from outsiders, and you trust the researchers, the following should provide "good enough" security, while remaining usable:

The central store is encrypted. Only the person responsible for it has the key.

Any copies made are to be encrypted at rest (e.g. Truecrypt), no exceptions, with a different key known to the researcher.

Encryption keys for copies must meet reasonable standards, and be guarded well.

Thus any loss of hardware does not result in loss of data as well.

If the goal is protecting the data from rogue researchers, it can't be done. There are too many ways for an attacker who has the hardware and decryption keys to make copies of the data without being discovered.

Wuala and Spideroak claim zero-knowledge file storage (the keys only reside on the client and are never sent to them). I think Cubby makes a similar claim (see their DirectSync option). Me.ga (successor to now-defunct Megaupload) also claims that level of security, but won't launch until mid-Jan 2013.

Thanks for the answer. The protection isn't against rogue researchers. It's against involuntary distribution. We only want to know how has the file. The idea of having this done over LAN is precisely so that the file is not served via file share but off some file serving mechanism that logs who took it. Again this is not about policing them, it's about being able to check off once a copy has been deleted or to send reminders to each one to delete their copies once the research is archived. I hope this makes it clearer.
–
Saad FarooqNov 2 '12 at 3:57

Easiest way is still everyone having to get it on request from a central authority, who can then record the copy. You would otherwise have to build something that checks though, say, Web Server or SAMBA logs, and matches each request to a person.
–
scuzzy-deltaNov 4 '12 at 21:03

Hmm.. I didn't think it would be that hard. Our uni linux systems need us to log in. I assumed setting up a system to log who copies would be trivial after that.
–
Saad FarooqNov 5 '12 at 0:58