Archive and share data

Content

Why Archive and share your data?
What data to keep
Best practices when preparing data for archiving and sharing
Preparing sensitive data for sharing

Why Archive and share your data?

A growing number of granting agencies and journals require data deposit. In addition, depositing your data in an appropriate repository safeguards it for reuse by your team or other researchers. It also supports the validation or replication of your research.

Here is what Emad Shihab, Associate Professor of Computer Science and Software Engineering at Concordia, has to say about data sharing:

"Sharing datasets for our research has significantly improved its impact and allowed for more transparency. Our most cited and reproduced work is work where we made our datasets publicly available. It requires some work to prepare and support these datasets, but it is well worth it and appreciated by the wider research community!"

What data to keep?

Data can be archived and preserved locally or shared in a public data repository. Note that archiving can be costly and there may not be enough space to archive everything. Researchers should carefully identify which data to preserve. Consider the following:

Does the data support published research?
Are the data likely to be reused?
Are the data unique or historically significant?
Are there funder or institutional requirements?
Are the data difficult to reproduce?
Are there any ethical issues to consider?
Are the data in support of a patent application?

Examples of data that should be kept by discipline (from Stanford University).

Best practices when preparing data for archiving and sharing

File formats	Choose long-term storage file-formats, preferably non-proprietary, to overcome software obsolescence. More information
Documentation	Add it alongside your data to make it understandable, reusable. More information
Ownership and privacy	If sharing data, make sure that: you or your organization own the data. More information on data licenses. all ethical requirements are followed. More information on ethics.
Data integrity	If keeping a local copy, avoid bit rot through refreshment (copy data on a new drive every 2-5 years) and replication (maintain three copies of the data, on two forms of storage with one in an external location).

Preparing sensitive data for sharing

Consent forms are key to data sharing:
- Some data cannot be shared for legal or ethical reasons. However, if sharing the dataset is required, ensure that this has been stated in consent forms and cleared with the Research Ethics Unit. Find out more.
De-identification allows sharing of sensitive data:
- De-identification is the process used to remove identifying data. Identifiers can be direct, which point directly to an individual, or indirect, which point to an individual when combined with other data.
  - Examples of direct and indirect identifiers
- De-identification guidance:
  - De-identification guide (Portage)
  - De-identification guidelines for structured data (Information privacy commissioner of Ontario)
  - Methods of data de-identification:
    - Anonymization (removing identifiers altogether)
    - Pseudonymization (replacing identifiers with pseudonyms or other identifiers)
Protecting sensitive species data:
- Guidance exists on how to make this type of data available without exposing species to harm.Find out more

See also:

Can I share my data? Decision tree (Portage)
Data Deposit & Access section of the Human Participant Research Data Risk Matrix (p. 8) (Portage)
Anonymisation: managing data protection risk code of practice (UK Information Commissioner's Office).
Anonymisation: Guide from the UK Data Service
McGill Data Anonymization Workshop Series 2023: recordings and slides from a a workshop series providing theoretical and practical knowledge in data anonymization and de-identification of sensitive data to promote and facilitate data deposit and data sharing.