Research Data Security

The security of information at Princeton University is directed by the Data Governance Steering Committee, which oversees the actions of the Privacy Policy Committee, the Information Security Policy, and the Data Management Advisory Group. Researchers in the biomedical as well as social and behavioral sciences are expected to be proactive in designing and performing research to ensure that the dignity, welfare, and privacy of individual research subjects are protected and that information about an individual remains confidential. The protection of research data is a fundamental responsibility, rooted in regulatory and ethical principles and should be upheld by all data stewards.

The Research Data Security Guidelines pertain to researchers and research team members who obtain, access or generate research data, regardless of whether the data is associated with funding or not. These guidelines help Princeton University researchers understand the sensitivity of the data they are collecting and develop appropriate data protection plans, know the appropriate mediums and places to store data, understand how and when to dispose of data, prepare their research data for public use, understand how to keep research data secure while traveling, and what to do in the event of theft, loss, or unauthorized use of confidential research data. These guidelines can also be used as part of the data management planning process in conjunction with other tools such as the DMPTool to help meet federal funding agency requirements and prepare research data for public use.

Investigator Responsibilities

Anyone who conducts research with human subjects at Princeton University has a responsibility to protect the data collected and used for their research. This is especially important when the data (a) contain personal identifiers or enough detailed information that the identity of participating human subjects can be inferred, (b) contain information that is highly sensitive, or (c) are covered by a restricted use agreement. The guidelines below are intended to help researchers understand when and how to use the most effective and efficient methods for storing and analyzing confidential research data so that those data are adequately protected from theft, loss or unauthorized use.

As a general practice, researchers working with human subjects should avoid collecting personally identifiable information (PII) whenever possible. Perhaps the best way to protect a research subject’s identity is by not knowing that identity in the first place. However, in many cases, the collection of PII is necessary for carrying out a research project.

There are many ways in which PII arises in the normal course of conducting research. If subjects sign informed consent agreements, their signatures are identifying information that must be securely stored. If subjects are awarded a prize or paid for their participation in a study, the researcher needs enough identifying information to enable delivery of the payment or prize. In some cases, researchers may need to merge data from different sources (e.g., survey responses and biological data), a step that can only be carried out with some form of personal identifier. Likewise, longitudinal studies usually require storage of detailed personal identifiers so that subjects can be contacted for subsequent interviews over long periods of time.

What is Personally Identifiable Information (PII)

PII is defined as information that is uniquely associated with an individual person. The HIPAA privacy rules identify 18 items (such as name, mailing address, email address, social security number, etc.) that are considered to be forms of PII. While the list is regarded as comprehensive, it is not necessarily exhaustive.

Inferring the Identity of Research Subjects

It is sometimes possible to infer the identity of someone participating in a research study even when the data for the study do not contain any explicit identifiers such as those listed above. For example, by cross-referencing certain variables such as state of residence, occupation, education, age, sex, and race, it might be possible to infer the identity of a research subject. As such, the absence of personal identifiers from a research data set does not obviate the need for secure storage and protection. Similarly, when research data sets are being made available for public use, the data need to be stripped of all personal identifiers and coded in a manner that does not allow anyone to infer the identity of a subject. This is often a difficult task because the identity of individuals can be inferred by using data sets from multiple sources. The proliferation of public use datasets and publicly available records has increased the odds of being able to infer someone’s identity by merging multiple data sources through a phenomenon known as the Mosaic effect. Researchers who produce or share anonymous public use data files need to consider whether the data they are using or releasing could be used in combination with other publicly available data to infer individual identities. Researchers are encouraged to consult the Institutional Review Board to determine if their proposed research involves human subjects.

Highly Sensitive Data

Research data are considered highly sensitive when there is a heightened risk that disclosure may result in embarrassment or harm to the research subject. Data on topics such as sexual behavior, illegal drug use, criminal behavior, crime victimization or mental health are considered highly sensitive. Information that could have adverse consequences for subjects or damage their financial standing, employability, insurability, or reputation should be adequately protected from public disclosure, theft, loss or unauthorized use, especially if it includes PII.

Restricted Use Agreements

Many researchers at Princeton University receive data from outside agencies or institutions that are subject to restricted use agreements (also called data sharing agreements). These are legal contracts that impose restrictions on the researchers’ use of the data and sometimes include detailed procedures for secure storage, restricted access and analysis of the data. As part of the agreement, certain government agencies may also visit the researcher (or “licensee”) to conduct a compliance audit. In other cases, restricted use agreements may simply prevent public release of the data or sale of the data to a third party. But in cases where an agreement does not specify data security procedures, researchers must consider the need to keep their data secure so that the potential for harm to any individuals or organizations is minimized. When faced with two sets of data security requirements (e.g., one from the Princeton University IRB and one from a restricted use agreement), the researcher should always default to the requirements with higher standards for data protection. Additional information on restricted use agreements can be found here.

PII Data from Open Public Records

Researchers who work with open public records that contain PII (e.g., voter registration files, telephone directories, occupational license registries, property tax records, firearms registries, criminal records) may not meet the regulatory definition of research involving human subjects. However, researchers are advised to use caution when dealing with public records data that contain sensitive information. Merging and publishing sensitive information from publicly available records has the potential to embarrass or harm individuals described in the records even though the information is already public. Researchers are encouraged to consult the Institutional Review Board to determine if their proposed research involves human subjects and whether risk of harm has be adequately minimized.

Public Use Data Files

Public use data files are files from which all PII has been removed and the data are coded in such a way as to make identification of research subjects extremely unlikely. Researchers who work with public use data sets that do not contain PII may not meet the regulatory definition of research involving human subjects. However some restricted use agreements nevertheless require local IRB review. As such, researchers are encouraged to consult the Institutional Review Board to determine if their proposed research requires IRB review.

The Difference Between Anonymous and Confidential

It is important to understand the differences between the terms anonymous and confidential as they are used in different phases of a research study.

When subjects are recruited for a research project, their involvement can be described as anonymous if it is impossible for anyone (even the researcher) to know whether or not those individuals participated in the study. For example, participation in an online survey that cannot be linked in any way to the individual would be considered anonymous. However, when participation is confidential, the research team knows that a particular individual has participated in the research and is obligated to protect that information from disclosure to others outside of the team, except as clearly noted in the consent document. Thus, in this example, if study participants sign a consent form, the consent documents the subject’s participation in the study and must be treated as a confidential document, even if there is no way to connect data about them to their identity.

In terms of the research data that are produced by a study, those data are anonymous if no one, not even the researcher, can connect the information back to the individual who provided it. The data do not contain any PII and it is not possible to infer the identity of anyone in the study.

When data are confidential, there continues to be a link between the data and the identity of the individual who provided it. The link usually takes the form of a study ID number that is common to both the de-identified data and the corresponding list of names or other types of PII. The research team is obligated to protect both the PII and the links from unintended disclosure according to the terms of the protocol approval by the IRB and the terms of the informed consent document. In order to protect against accidental disclosure, the subject’s name or other identifiers should be stored separately from their research data and replaced with a unique code to create a new identity for the subject.