The guidance below is intended to help you minimize disclosure risk when sharing data collected from human participants. If you use any of the following techniques to anonymize your data, please include this information in your documentation and README file. For transparency, it should be clear how the dataset was modified to protect study participants.
Before proceeding, please note that not all human participant data needs to be de-identified, or stripped, of direct and indirect identifiers. Please review your consent form and prepare your data to share only what participants have agreed to share. If you are unsure whether you need to de-identify your data, please see the Portage help guide Can I share my Data? and consult with your institution’s Research Ethics Board. For help selecting a repository for your data, please see Portage’s Recommended Repositories for COVID-19 Research Data guide or consult with librarians at your institution to see if further support is available.
For help understanding any terms used in this document, please see Portage’s Glossary of Terms for Sensitive Data Used for Research Purposes. You may also wish to review Portage’s Human Participant Research Data Risk Matrix and Research Data Management Language for Informed Consent for more information.
You can also download this document as a PDF from Portage’s Zenodo Community.
- Identify and Remove Direct Identifiers
- Identify and Evaluate Indirect or Quasi-Identifiers based on Perceived Risk and Utility
- Considerations for Qualitative Data De-identification
- Brief Considerations for Social Media, Medical Images, and Genomics Data
- Appendix 1: Code for Checking K-Anonymity
- Appendix 2: Free de-identification software packages
- Appendix 3: Fee-based services for de-identification
Identify and Remove Direct Identifiers
Direct identifiers are those which place study participants at immediate risk of being re-identified. Unless explicit consent was received from study participants, they must be removed from any published version of your dataset. The following list is based on various sources, including guidance from major international funding agencies, the US Health Insurance Portability and Accountability Act (HIPAA) and the British Medical Journal. See Preparing raw clinical data for publication: guidance for journal editors, authors, and peer reviewers and List of 18 items considered under HIPAA to be identifiers.
Direct identifiers are:
- Names or initials, as well as names of relatives or household members
- Addresses, and small area geographic identifiers such as postal codes / zip codes
- Telephone numbers
- Electronic identifiers such as web addresses, email addresses, social media handles, or IP addresses of individual computers
- Unique identifying numbers such as hospital IDs, Social Insurance Numbers, clinical trial record numbers, account numbers, certificate or license numbers
- Exact dates relating to individually-linked events such as birth or marriage, date of hospital admission or discharge, or date of a medical procedure
- Multimedia data: unaltered photographs, audio, or videos of individuals
- Biometric identifiers including finger or voice prints, and iris or retinal images
- Human genomic data, unless risk was explained and consent to share data or consent for secondary use of data was received from study participants
- Age information for individuals over 89 years old
How do I remove this information?
Removing direct identifiers from your data is relatively straightforward. You may either record this personal information in a separate document, spreadsheet, or database and link this to the other data points via a series of codes that can be removed before publishing, or choose to delete the identifying data points entirely at the end of the project. Refer to your consent forms to determine how to proceed. If you are unsure whether data can simply be unlinked or if it must be destroyed, consult your local Research Ethics Board.
Identify and Evaluate Indirect or Quasi-Identifiers based on Perceived Risk and Utility
Indirect or quasi-identifiers are characteristics (such as demographic information) relating to individuals that could be linked with other data sources to violate the confidentiality of individuals. Quasi-identifiers may not be identifying on their own but can be disclosive in combination. For instance, identifying a participant’s home community size within an overall limited geographic study area may allow someone to infer that participant’s location more precisely. A variable should be considered a quasi-identifier if someone could plausibly match that variable to information from another source. See the International Household Survey Network Anonymization Principles and the Information and Privacy Commissioner Ontario De-identification Guidelines for Structured Data.
A list of potential quasi-identifiers:
- Geographic identifiers (census geography, town name, urban/rural indicator) of home, place of birth, place of treatment, place of schooling, or other geography linked to individuals
- Sex / gender identity, orientation
- Ethnic background, race, visible minority, or Indigenous status
- Immigration status
- Membership in organizations
- Use of specific social networks or services
- Socioeconomic data, such as occupation or place of work, income, or education
- Household and family composition, marital status, number of children / pregnancies
- Criminal records and other information that may link to public records
- Generalized dates linked to individuals, e.g. age, graduation year, immigration year
- Some full-sentence responses
- Note: These must be checked individually. For instance, the comment “The library should be open longer” is not identifying; however, a comment like “As chair of a research group that uses the library,…” is potentially identifying.
- Some medical information (e.g. permanent disabilities or rare medical conditions) may be identifying; temporary illness or injury is less likely to be so. The test is whether this is information that can be found elsewhere and therefore could be used to re-identify the person.
How do I figure out what combination of quasi-identifiers are a problem?
1. Observe the possible combinations
A good first step may be to look at the demographic variables in the dataset and consider describing an individual to a friend using only the values of those variables. Is there any likelihood that the person would be recognizable? For example, “I’m thinking of a person living in Toronto who is female, married, has a University degree, is between the ages of 40 and 55 and has an income of between 60 and 75 thousand dollars.” Even if there is only one such person in the dataset, this is likely not enough information to create risk UNLESS contextual information about the dataset narrows things down further. For instance, if your data is limited to a specific, narrow group of individuals, such as the referees for the Ontario Hockey Association, the list of quasi-identifiers given above may be enough to uniquely identify an individual. Quasi-identifiers need to be evaluated in the context of what is known or what may be reasonably inferred about the survey population.
2. Assess these combinations mathematically
K-anonymity is a mathematical approach to demonstrating that a dataset has been anonymized, where k is an integer selected by the researcher that represents a group of records with the same information across all quasi-identifiers. Within your dataset, a set of ‘k’ records (e.g., a set of 3 or 5 records) is called an equivalence class. To achieve k-anonymity, it should not be possible to distinguish one record from the other records in its equivalence class. For example, if you choose a k value of 5, each record in your dataset must have the exact set of quasi-identifiers that are present in at least 4 other records in order to achieve k-anonymity.
K-anonymity only works to precisely estimate risk if a dataset is a complete sample of some population. K-anonymity considerably overestimates risk in the case of a dataset that is a subsample of a population. When determining the appropriate k value to use, consider:
- A lower k value of 3 may be sufficient in datasets that contain small samples from a large population.
- A higher (or more conservative) k value should be used if a dataset is a complete sample of a population.
Keep in mind that a dataset that is a complete sample of a known population may have additional risk factors. Imagine that all the respondents in a particular equivalence class answered a question the same way – you would know how each person in the survey belonging to that equivalence class answered the question. Respondents to surveys are generally told that their responses will be kept confidential, not merely that no one will know which line of data contains their specific answers. A k-anonymous dataset that is a complete sample may not fulfill that promise.
The code in Appendix 1 can be used with your preferred statistical software package to create equivalence classes based on the quasi-identifiers in the dataset and to list them by size. If any equivalence class has fewer members than the value of k you selected, use the data reduction techniques below to further reduce dataset risk.
For more on k-anonymity, see International Household Survey Network (IHSN)’s Measuring the Disclosure Risk.
3. Use data reduction techniques to address dataset risk
Univariate frequencies and bivariate crosstabs can be used to identify small categories of quasi-identifiers. (‘Small’ is relative; as a first pass, groups smaller than 5% of the dataset or containing fewer than 20 cases could be considered.) Data reduction techniques can be used to mitigate risk once you have identified these small groups. There are three simple types of data reduction you may wish to consider:
- The simplest is to completely drop risky variables from the dataset. This is an option for variables with relatively high risk that are not considered to be of high research value. (For example, in some datasets geography may be considered relatively less important than ethnicity or language.)
- The second is global re-coding, or aggregating the observed values into a defined set of classes, such as transforming a variable with years of age into a variable of ten-year age categories, or top-coding a high income category to “$100,000 and above”.
- A third option for unusual cases is to use local suppression. For example, a very young married respondent might have their marital status set to ‘missing’ as an alternative to globally re-coding the otherwise non-risky age variable into a larger group.
How do I assess the sensitivity of non-identifying variables in dataset?
Non-identifying information includes survey responses and measurements that are not likely to be recognizable as coming from specific individuals. Examples include opinions, rankings, scales, or temporary measures such as resting heart rate after meditation or the number of times an individual ate breakfast in a week.
It is possible for non-identifying information to be highly sensitive as well. Information that could be used to stigmatize or discriminate against an individual, such as a criminal record, sexual practices, illicit drug use, mental health and psychological well being, and other sensitive medical information all increase the risk of the dataset and should be considered when deciding whether to release the data at all. You may wish to remove or modify these variables to create a less sensitive version of the data.
Considerations for Qualitative Data De-identification
Qualitative data describes qualities or characteristics that can be observed, but not necessarily measured. This type of data is collected through interviews, surveys, or observations, and may be in the form of transcripts, notes, video and audio recordings, images, and text documents. As with quantitative data, direct identifiers may appear in the form of names, date and place of birth, other locations, and even photos. These direct identifiers can be used along with indirect or quasi-identifiers, such as medical, education, financial, and employment information, to trace or determine an individual’s identity.
The process for removing identifying information in a video recording, audio interview, or oral transcript is very different from that used to de-identify a medical record. For one, it is harder to do programmatically. Extremely detailed field notes or audiovisual information often requires someone to read or watch the content thoroughly.
- Avoid asking for identifying information in the first place.
- It is easier to edit the information at the point of capture than it is to remove information after it has been recorded.
- If you require identifying information at the research stage, try to capture it within the first few minutes of an interview or recording, so that it is easy to edit it out quickly. Alternatively, transcribe the information in a separate document that can be removed from a person’s file.
- Make de-identification a part of the process of informed consent.
- Ensure that study participants are aware of your planned use of the data, and the fact that their information may be anonymized to protect them. Make it clear in your consent forms how extensively they will be de-identified (i.e., what elements will be replaced or removed). While direct identifiers may be eliminated (name, address, birthday, etc.), there may be other subtle clues to their identities that remain within the recording or transcript.
- Agree in advance with participants which type of identifying information can be revealed in an interview. (For example, the participant may not wish to mention an employer’s name). This is easier than removing information after the fact.
- Keep in mind that not all data needs to be de-identified or anonymized. In some circumstances, you may be recording deeply personal accounts and should be mindful of a participant’s right to have their story told in their own words. Some participants may have a personal interest in staying identified.
- Use pseudonyms and change identifying details to protect anonymity.
- If changing the person’s name, location of residence, or occupation can be done without compromising the dataset, this can help to protect their anonymity. Be advised that this could influence the utility of a dataset as it may alter a future researcher’s perception of the interviewee’s socio-economic status or behaviour.
- If necessary, remove blocks of sensitive text or edit out portions within audio-visual data.
- Some portion of the research may need to be redacted. Be wary of using search and replace techniques as it is easy to replace the wrong piece of information.
- Voices in audio recordings may need to be masked by altering pitches.
- Faces in visual data may need to be pixelated.
- Restrict access.
- This is not preferred, but some datasets will not remain useful if all identifiers are removed. It may be possible to allow researchers seeking secondary access to request that queries be performed by the original research team, who can then share results if they are non-disclosive or can be appropriately de-identified.
For more information, see the UK Data Service’s Guide to Anonymisation of Qualitative Data.
Brief Considerations for Social Media, Medical Images, and Genomics Data
1. Data collected from social media or social networking platforms (e.g., Twitter, Facebook)
Here are a series of questions to consider before you deposit social media data:
- Could the topic you are studying be considered sensitive?
- Could your data lead to stigmatization of, or discrimination against, the content author?
- Is the study population vulnerable?
- What expectation of privacy might the individual users of this platform have?
- Is it possible or reasonable to obtain informed consent?
- Can or should the data be anonymized?
For example, Twitter allows the content author to maintain control over their tweets. As part of Twitter’s policies, only numeric Tweet IDs and User IDs should be redistributed. If you have weighed the questions above and decide to deposit your dataset, the Tweets must first be ‘dehydrated’ (distilled down to just the Tweet ID) using a tool such as DocNow’s twarc. Any secondary use of the data would then require an end-user to “rehydrate” the Tweet IDs using the Twitter REST API or an external tool such as DocNow’s Hydrator. Content will not be returned for tweets that have since been deleted.
The following resources provide more in depth guidance:
- Zeffiro and Brodeur, Social Media Research Data Ethics and Management (slides from a workshop presented at McMaster University).
- Ryerson University Research Ethics Board’s Guidelines for Research Involving Social Media.
- Mannheimer and Hull, Sharing Selves: Developing an Ethical Framework for Curating Social Media Data.
- North Carolina State University’s Social Media Archives Toolkit, which contains guidance on the legal and ethical implications of sharing social media data, and an annotated bibliography with further resources.
2. Medical Images
Before you archive medical images, remove any direct identifiers you do not have explicit consent to share, such as name, patient ID, and exact dates from the image header or embedded metadata, and black out any pixels in the image that contain identifying information. Neuroimages must also be defaced using a tool such as PyDeface. Some repositories may be able to assist you or recommend tools for defacing. For example, the International Neuroimaging Data-Sharing Initiative (INDI) can help researchers who plan to share their data on the INDI platform.
The following resources provide more guidance for de-identifying DICOM files:
- The Cancer Imaging Archive (TCIA) De-identification Overview.
- See specifically “Table 1 – DICOM Tags Modified or Removed at the source site” for a list of DICOM tags deemed to be unsafe.
- The Radiological Society of North America (RSNA) International Covid-19 Open Radiology Database (RICORD) De-identification Protocol.
- The DICOM standard itself provides important guidance for de-identifying header information. Specifically, DICOM Part 15: Security and System Management Profiles, Appendix E: Attribute Confidentiality Profiles may be useful.
- These profiles attempt to balance the need to protect privacy with the need to retain information so the data remain useful.
- If it is necessary to retain identifiers, your REB application will have ideally referenced the profile you intend to use, and your consent form should clearly state what information will be shared.
De-identification of DICOM files may be done programmatically, using a software to strip identifiers from the header.
- TCIA recommends the Clinical Trial Processor (CTP) software developed by RSNA.
- RSNA’s Covid-19 Open Radiology Database (RICORD) recommends another RSNA software called Anonymizer, and has published instructions on how to install and use it. Anonymizer implements RICORD’s de-identification protocol.
- There are many other non-commercial options available, such as the DicomCleaner™ tool.
- As with all de-identification software, results may be variable, and you should confirm that identifying information was removed before you share your images. Note that:
- Vendors or end-users may not have always used DICOM elements in a way that conforms to the standard.
- Private elements or private tags may have been used to store personal information, and the use of these tags may not be well-defined in the vendor documentation.
3. Genomics data, and other biomedical samples
Because each person’s DNA sequence is unique, human biological materials can never be truly anonymous. Before you archive or biobank these data, please review your consent form. Ideally the consent process will have:
- provided participants with information about how their data will be used, analyzed, stored and shared,
- identified what information will be stored alongside the data,
- communicated what level of privacy or confidentiality a participant may expect, and who may have access to the data,
- indicated whether the data/samples will be stored in Canada or outside of Canada,
- acknowledged whether there is a possibility that the data will be used for commercial purposes,
- clearly explained the risks of disclosure.
Further information is available in TCPS 2 (2018), Chapter 12: Human Biological Materials Including Materials Related to Human Reproduction (sections A and D specifically), and Chapter 13: Human Genetic Research. See also Thorogood (2018) Canada: will privacy rules continue to favour open science?
The NIH Privacy in Genomics webpage provides a concise overview of some of the benefits and risks of sharing genetic information. For an example of how genetic information was used to identify study participants, see Identifying Personal Genomes by Surname Inference, or a summary of the study in the 2013 Nature editorial on Genetic privacy. For further information on ethics and consent in genomics, see the Global Alliance for Genomics and Health Regulatory & Ethics Toolkit resources, such as Data Privacy and Security Policy and Consent Policy.
Appendix 1: Code for Checking K-Anonymity
— Stata —
* Stata code for checking k-anonymity
* Kristi Thompson, May 2020
* create the equivalence groups
egen equivalence_group= group(var1 var2 var3 var4 var5)
* create a variable to count cases in each equivalence group
by equivalence_group: gen equivalence_size =_N
* list the ID numbers of equivalence groups containing 3 or fewer cases
tab equivalence_group if equivalence_size < 3, sort
* list the values of the quasi-identifiers for each small equivalence class.
list var1 var2 var3 var4 var5 if equivalence_group == 1
— R —
# R code for checking k-anonymity
# Carolyn Sullivan and Kristi Thompson, May 2020
# install plyr, a useful data manipulation package.
# Load the library.
datafile <- ” location of the data file – csv format – “
# Read the csv file.
df <- read.csv (datafile)
# Figure out what equivalence classes there are, and how many cases in each equivalence class.
dfunique <- ddply(df, .(var1, var2, var3, var4, var5), nrow)
dfunique <- dfunique[order(dfunique$V1),]
Appendix 2: Free de-identification software packages
Many of these tools take a hierarchical approach to de-identifying data, which means that you will need to pre-define possible generalizations for the quasi-identifiers in the dataset, and the program will search for possible solutions and recommend a set of the generalizations to use to best meet anonymization goals. For datasets with a large number of quasi-identifiers, or cases where several datasets with similar quasi-identifiers need to be de-identified, this might be a useful approach. For smaller datasets, it may be more straightforward to work in a statistical package. The software packages included here all have some usability issues, and fairly steep learning curves. Amnesia and the graphical user interface to sdcMicro may be the most user-friendly.
- This software has both online and desktop versions, however, uploading sensitive data to a third-party web site is not generally recommended. If possible, install the software locally (Windows or Linux only).
- Amnesia supports k-anonymity and km-anonymity (a slightly more flexible approach to anonymity when the number of quasi-identifiers in a dataset is very high, as it allows for combinations up to m quasi-identifiers to appear at least k times in the published data).
- A few limitations: there is not currently a way to specify missing values; documentation could be more thorough, for instance, defining hierarchies is not straightforward.
- This software may work best for clinical data, or data which are not survey data.
- An R package for statistical disclosure control (microdata anonymization). This software can read many data types (e.g., csv, sav, dta, sas7bdat, xlsx) and can be used in Windows, Linux or Mac operating systems. Implements muArgus code.
- A graphical user interface is available, and there is a vignette with guidance called ‘Using the interactive GUI – sdcApp’ linked from the sdcMicro landing page in CRAN repository.
- Please be aware that large datasets take time to load, and computation time for large or complex datasets may be lengthy.
Other tools that may be useful
- Open source anonymization tool for use in Windows, Linux, and Mac. Provides support for SQL databases, xlsx and csv files, and has a graphical user interface.
- Supports various privacy models including k-anonymity, and variants ℓ-diversity, t-closeness, β-Likeness, and more.
- Allows end-users to categorize, top and bottom code, generalize, and transform data in more complex ways.
- Large datasets take time to load, and computation time for large or complex datasets may be lengthy.
- Software to apply Statistical Disclosure Control techniques. The program takes a hierarchical approach to de-identifying data.
- JAR file should be executable in Windows or Mac OS.
- A tester found that getting data loaded and correctly defined was a bit of a challenge and advised that the program could use better documentation on setting up hierarchies.
- The University of Texas at Dallas Anonymization Toolbox
- The toolbox currently supports 6 different anonymization methods and 3 privacy definitions, including k-anonymity, ℓ-diversity, and t-closeness.
- Algorithms can either be applied directly to a dataset or can be used as library functions inside other applications.
- This is a set of Java routines. Data curators who prefer to do their statistical programming in Java might find it useful.
Appendix 3: Fee-based services for de-identification
A few fee-based services that researchers may opt to use for de-identification are included below:
- d-wise (American & European offices)
- Offering free anonymization services to anyone working on a COVID-19 vaccine.
- Inter-university Consortium for Political and Social Research (ICPSR) (Archive headquartered at University of Michigan)
- If you wish ICPSR to conduct disclosure analysis of your data, you will need to purchase the Professional Curation package. Cost is based on the number of variables and complexity of the data. Contact ICPSR Acquisitions at email@example.com for additional information (information obtained from Open ICPSR FAQ under Pricing and Sensitive Data sections).
- Privacy Analytics (Ottawa-based company)
- Privacy Analytics can review datasets as part of their Data Privacy Validation Services.
- Methodology based on the HIPAA Expert Determination De-identification Standard.
- To find out more about their services, please fill in the form at the bottom of their “Certification” webpage.