Behind the Mask-Balancing Data Privacy and Record Linkage Performance
In a typical day, our personal data courses through countless computer systems, and health care is no exception. Sensitive personal information and personally identifiable information (PII) are key parts of systems that improve the efficiency and accuracy of medical care, but they also carry significant privacy concerns. Beyond the growing risk of identity theft is the fact that such data contains information on health conditions that we may not want other people knowing. Protecting privacy means controlling what data can be accessed and who can access it.
“These protections involve tradeoffs between providing too much access and too little. Too much raises privacy risks, and too little can make data systems less useful,” said Hye-Chung Kum, PhD, associate professor in the Department of Health Policy and Management in the Texas A&M School of Public Health. “Maintaining privacy and confidentiality of data while having sufficient information for meaningful use requires a well-orchestrated system.”
One example of this is the act of record linking, which is connecting various pieces of information on a person from disparate sources. Sometimes called patient matching, record linkage is a critical task to follow over time different care processes, such as counting how many people are readmitted after discharge, to inform policy and clinical care decisions. However, not much is known about how privacy protections affect record linking and how that could inform better safeguards.
Kum, who has a joint appointment in the Department of Computer Science and Engineering in the Texas A&M College of Engineering, joined with Eric Ragan, PhD, assistant professor in the Department of Visualization, and students in the Department of Computer Science and Engineering to determine if it is possible to limit data access without harming record linking accuracy and how much data can be hidden before efforts begin to suffer. This study, part of a Patient-Centered Outcomes Research Institute award, recently received an Honourable Mention Award for the 2018 ACM CHI Conference on Human Factors in Computing Systems. In it, Kum and Ragan experimented with a computer interface that hid various amounts of PII from view.
Record linking can often be a challenging process because data from disparate sources do not always have common identifying keys like identification numbers. Automated record linking software can use various other pieces of information such as names, addresses and birth dates to join software, but this process typically needs human intervention to iron out kinks in the matching process. People with common names, fathers and sons with the same name and people with the same birthday can sometimes cause problems with automated processes, as can incorrectly entered data, such as transposed numbers in addresses or birth dates.
The research team generated mock data with many of these issues and had 104 participants review data through a computer interface and answer questions related to record linkage. The interface they built hid varying degrees of PII and used symbols to indicate possible data problems such as mismatched names or transposed birth years. The data were shown unaltered (baseline), with all information visible along with symbols to indicate data issues (full), with moderate amounts of PII visible (moderate), with small amounts of data revealed (low) and with all information concealed (masked). The three categories with hidden data also used symbols.
Kum and Ragan found that the participants in the full and baseline scenarios identified record linking issues with an 85 percent accuracy. In the moderate case where participants were only shown 30 percent of the full data, participants had an accuracy that was almost identical, showing that balancing performance and privacy protection is possible. The low (7 percent of full data) and masked (fully de-identified data) cases had lower accuracy, indicating that there is a point where concealing data is detrimental to the linking process. However, at around 75 percent accuracy, using the proposed interface even fully de-identified data would allow a significant improvement over automated processes and would be useful in applications where all PII must be hidden by law.
These findings indicate that it is possible to balance data system performance and accuracy while protecting privacy and that a carefully designed interface that can hide information and point out data issues can help. How such efforts will work with other types of information and with processes like showing probability information or cleaning up data are avenues to explore in future research. Different information types and data processing work call for different measures.
“This research points the way toward concrete privacy protection mechanisms via accountable, transparent use of data for all of us as our PII zips around countless computers and networks,” Kum said.HIPAA