-
oa Building of a Large Scale De-Identified Biomedical Database in Qatar-Principles and Challenges
- Publisher: Hamad bin Khalifa University Press (HBKU Press)
- Source: Qatar Foundation Annual Research Conference Proceedings, Qatar Foundation Annual Research Conference Proceedings Volume 2016 Issue 1, Mar 2016, Volume 2016, HBPP3324
Abstract
Background
Electronic Medical Records (EMRs) hold diverse clinical information about large populations. When this information is coupled with genetic data, it has the potential to make unprecedented associations between genes and diseases. The incorporation of these discoveries into healthcare practice offers the hope to improve healthcare through personalized treatments. The Qatar National Genome project aims to achieve this vision by building a warehouse of genome sequencing information linked to de-identified EMR data. The warehouse should facilitate accessibility to research data, but also protect patients’ privacy and confidentiality by employing responsible data de-identification and data sharing mechanisms.
This abstract discusses the privacy and governance challenges encountered during the construction and deployment of the data warehouse. To simplify the presentation, we divide the data management lifecycle into four stages and discuss the challenges at each stage separately: 1) Initial data collection, 2) data storage, 3) data sharing (utilization) and 4) Dissemination of research findings to the community.
Data collection
The data for the Qatari genome project is sought from the community. Thus it is important to consult with the population to establish the basic principles for data collection and research oversight. To achieve that, a community engagement model should be defined. The model should establish:
1. An advocating technique for advertising the project to the community and raising the number of individuals who are aware of the project. The technique should strive to reach different elements within the society, provide clear dissemination of risks and benefits and establish methods for recurrent evaluation of the community attitudes and understanding of the Project.
2. A recruitment strategy for establishing the enrollment criteria and enrollment process:
a. The enrollment criteria defines the basis for enrollment (should it be disease based or volunteer based) and the acceptable age for volunteers, and
b. The enrollment process defines the scope of subjects’ consent (opt in/out or informed consent) and warrants a clear boundary between research and clinical practice.
3. The extent of institutional review board (IRB) and community oversight, given the potential impact of the project on the community, an oversight for the program by the community and the IRB should be discussed and established. The scope includes oversight on data repositories, oversight on research studies as well as oversight on any changes to the protocol (data use agreements, communications, etc.)
Data storage
Foundational documents in modern research ethics stress the importance of reducing harm to participants and maximizing benefits to the society. Re-identification of participants’ identity is one form of harm that can be involuntarily or deliberately inflicted. Personal information derived from EMR records and/or genomic data can be used against the participants to limit insurance coverage, to guide employment decisions, or to apply social stigma. To minimize the risk of harm, the research platform should store de-identified clinical and biobank data while retaining the link between both data sources (the de-identified EMR data and the biobank data). This can be achieved by applying the following two operations:
1. The first operation (known as pseudonymization) identifies a stable and unique identifier(s) (such as Qatari IDs) that is included in both data sources and replaces it with a unique random ID (or pseudonym).
2. The second removes all uniquely identifying information (such as names, record number, and emails) from the structured data and masks all unique identifiers from the unstructured data (such as doctors’ notes). To perform this step properly, we need to determine the uniquely identifying information proper to the Qatari setting. Due to the relatively small population size in Qatar, some regular attributes might prove to be very informative. For example an age of 87 or above and certain professions, such as lawyer, might uniquely identify a participant.
Multiple aspects need to be considered when designing the pseudonymization operation, these include:
1. Ensuring that each subject is assigned the same random ID (pseudonym) across the different data sources. This consistency will ensure that data belonging to a particular subject will be mapped to one record.
2. The pseudonymization process could be reversible or not. Reversible systems allow reverting back to the identity of the subjects through a process called de-pseudonymization. They are used when communication with patients is a foreseen possibility.
3. In case communication with participants is forecasted, then a secure de-pseudonymization mechanism should be specified. The mechanism should define (i) the cases for which re-identification can occur, (ii) the bodies that can initiate re-identification requests, (iii) those that rule and regulate these requests, and (iv) the actual re-identification mechanism.
Data sharing
After the removal of uniquely identifying information, the resulting data is said to be de-identified -but not anonymized. Access to (non-anonymized) biomedical data collected in Qatar is governed by the QSCH “guidelines, regulations and policies for research involving human subjects”. So a critical part in defining a data access protocol is to:
1. Identify and understand data access procedures and requirements set by QSCH, and
2. Identify and understand data access desires of the Qatari community (through surveys, meetings with community representatives, etc.)
3. And finally deploy the gathered policies and requirements along with the collected consents into the design of the data-access platform.
Note that access to the research data platform has to be provided to all research institutes within Qatar. Such as researchers from Hamad Medical Corporation, Qatar Biomedical Research Institute, Qatar Computational Research Institute, Weil Cornell Medical College in Qatar, Qatar University, Hamad bin Khalifa University, Sidra Medical and Research Center and other research institutions. Moreover, the data warehouse is viewed as a platform for worldwide collaborative research projects. With such massive mandate, a principal feature is to have the capacity to foster timely research and discoveries. Data application processes and approvals should be smooth and should not delay project initiation significantly. This cannot be realized using traditional “IRB-based” data-sharing systems. Thus, there will, eventually be a need to (fully or partially) automate the data access process. In other words, we need to design a system to automatically match data access requests with access decisions. In general, access decisions could be provided at multiple access levels. For example in some cases, the requested data could be exported to the investigator premises while in other cases, secure remote access can be imposed. In general, the granted access levels should counter the risk posed by data requests. For example, a request for highly sensitive data (such as HIV data) from an investigator affiliated with a well-established Qatari research institute is inherently less risky than a request for the same dataset by an investigator affiliated with an institution outside Qatar, thus, the second request should receive more access limitations than the first.
Dissemination of findings
Prior work demonstrated that in order to affirm the value of research participation and contribute to public education, it is important to have a mechanism for disseminating research findings to the public. This will keep the community aware of how their participation is facilitating research and improving knowledge in the biomedical field.
The mechanism should also tackle the issue of disseminating specific research findings to specific participants. One of the main challenges in that regard is to define when a finding is considered scientifically valid and when it is considered valuable information for the recipient.