C. Marc Taylor is vice chair of the UK Research Integrity Office, a charity with a wide remit in support of the good conduct of research in Britain’s universities. Taylor drew on a range of experiences to discuss some of the difficult questions he’s encountered as people start to apply ethical principles to research involving bulk data – practical questions, as much as theoretical ones. He has long experience working on health research practices though he is not an ethicist: he is also chair of ISRCTN, a register of clinical trials, and a member of the Confidentiality Advisory Group at the Health Research Authority. Until 2011, he was head of R&D systems and governance in the Department of Health’s Research and Development Directorate.

For any given dataset, whether it is apparently anonymous or has been gathered with consent, Taylor’s first question is: which ethical standard should apply?

Datasets pose a number of ethical and legal challenges. They may be personal, identifiable, sensitive, or confidential. They may have the potential to cause embarrassment, offence or harm to the subjects, or some combination of all of these. The category of data that is both sensitive and identifiable is bigger than most people think.

Ethical principles overlap but don’t always coincide with the legal tests. The ethical position is, however, fairly well reflected in the General Data Protection Regulation, which will take force in the EU -including the UK – in May 2018. It is also reflected in the data protection bill going through the UK Parliament at the end of 2017. This body of law provides special protection for numerous categories: ethnic origin, political opinions, religious and philosophical beliefs, genetic and biometric data, health, sex life, sexual orientation, and so on. All of these merit particular care according to the regulation, because they could make the data subject vulnerable. The same thought process runs through the tradition of medical ethics, which was designed to protect individuals and which calls for an assessment of the balance between public benefit and the potential to infringe upon people’s rights and well-being.

Many complications arise. For example, a dataset may be accessible now, but was gathered in and misdirected from a setting where individuals thought they were speaking in confidence to someone who wouldn’t share it without permission. Such issues are quite problematic when a researcher tries to use several large datasets to find new things by linking them together. Deciding whether the provenance is respectable in such cases is essential and requires expertise. There’s an added complication in deciding whose rights we are trying to protect: just the 60 million in the UK or also, mutually, the nationals of other countries whose data law might allow transfers.

Legally, the question of identifiability is quite demanding. Provisions in Recital 26 in the GDPR and the Human Rights Act and other legislation focus on whether pseudonymisation, a historically commonly used technique, can be broken, either by finding the key or by matching the dataset to another that could make the data subjects re-identifiable.

In medical research, it’s clear that it’s not enough to want to use a dataset for a beneficial purpose; it has to be looked after carefully to ensure that the researcher does not, in effect, do the groundwork that enables someone else to put the data to a purpose that’s not in the interests of the original data subjects. The point of drawing on large, open source datasets is to be able to link them together to find new insights, but the law demands careful attention to deciding what baseline security arrangements to make and what governance processes to put in place.

Another difficult question: how do we distinguish between datasets that were always intended to be for research purposes and those gathered for a different reason; and then ensure that the latter have appropriate consent for the new use? Guideline 12 of the Council for International Organisations of Medical Sciences’ ethical guidelines for health-related research involving humans attempts to answer this.

As an example – a scenario that seems deceptively straightforward – say you have a dataset that has consent for some known uses and you have a way of going back to the individuals, albeit with difficulty, to find out whether that consent covered your new intended use of the data. Can you judge whether it would be within their reasonable expectations to match up that data with details about – for example – their Facebook usage?

Even when the answer seems to be yes, the situation can be problematic; difficulties arise when researchers have been too precise about consent in the past. This is a problem that has frequently sent inappropriate-looking cases to the Confidentiality Advisory Group. In these examples, in order to do large-scale epidemiology, a researcher wants to link up research datasets – but long ago they were created with firm promises that the data would never be shared outside a certain number of named organisations. Nowadays, in the health sector the best linkage is often done at NHS Digital, which has the right to do it under the law. However, in longitudinal studies the consent may well have been given before anyone had ever heard of NHS Digital, even under its previous legal title, the Health and Social Care Information Centre. The result is a struggle to figure out whether the researchers have the permission – or the spirit thereof – of the original data subjects to link the data in a way never thought of 20 years ago.

And those are the problems that are well-trodden ground. There are many guidelines about looking after research datasets and the governance you need in order to be able to figure out whether you can use the dataset for a given purpose, including ethical review of a research database.

A more complicated issue is the question of new uses of the data. Research datasets were often gathered for a completely different purpose than the one now in view. In that case, you know you don’t have consent and you are falling back on the reasons to agree that reusing that data is in the public interest and your use will not infringe people’s rights and welfare.

There are a number of questions to ask if you’re going to start linking a known dataset with a number of others that are not properly curated. NHS Digital, which got its new name after the scandal that it had shared data with insurance companies for commercial purposes – has undergone reforms which now make it a reasonably safe custodian for linkage. But outside of that setting it’s hard to be as confident that people are thinking carefully about the provenance of datasets and the legitimacy of linking them.

An example is the Department for Education, which holds a lot of information about pupils and the workforce. Many of the DoE’s safeguards were invented before the world of big data was imagined. The relevant regulations – Education (Individual Pupil Information) (Prescribed Persons) (England) – were laid in 2009. The presumption then was that data subjects can be protected by being specific about the purposes of gathering their data and by creating a list of trusted bodies with whom the datasets can be shared. The possibility was never contemplated of deep linkage, which might expose information about those children and their families. Linking social security data across the whole area with information about educational attainment to find out about social disadvantage or comparing attainment data with information about sexual diseases would produce useful insights – but the individuals concerned would feel deeply offended and regard it as a massive breach of privacy if the information were misused.

The last of Taylor’s difficult questions surrounded the nature of informed consent. Researchers are on shaky ground if they rely on the kind of consent collected by apps and websites when they ask users to tick a box indicating general consent to their data policies. In that situation, users believe that they are giving their data for a specific purpose. GDPR tries to solve this by demanding that individual consent be given by explicit or affirmative actions that demonstrate that the user has understood what they think the use of the data will be, after which ensuring that it’s used for that purpose is the responsibility of a specific data controller. This is in contrast to the approach we find in the ethical statements of good practice mentioned above. These carry over the medical model into other areas and dictate that we should be following a process that ensures that somebody has adequately understood the material facts and decided – or refused – without coercion or undue influence or deception to go along with the research activity in order to secure the benefits it offers to the data subject or others. In that context, consent is viewed as a continuing process not a discrete event.

There is already a lot of discussion about how this should be applied to data. This is typically on the presumption that it’s not going to be as intrusive as experimentally cutting a body open with a knife or trying out some new medicine. Nonetheless, it’s potentially quite intrusive. The Health Research Authority talks about a proportionate approach balancing risks and benefits, but it still emphasises strongly that there should be succinct, relevant, truthful information available to the person to unpack the risks and benefits they’re signing up to when they consent to a particular use.

An important aspect that’s harder for people to understand once they’ve undergone further or higher education, is that there is a presumption under law that pretty much anybody can understand the available choices and be free to make a judgement – even one that’s not really in their best interests – if they are given a proper explanation. The law places a responsibility on the person offering the choice to put the issues in language an ordinary person can understand. That’s quite difficult to do with data mining. The research could lead to something useful but also may combine individual information in a way that feels like a massive invasion of privacy.

In health and social care, we have a managed national system of research ethics committees whose job it is to look at these issues for their sector. This system provides mechanisms for the committees to discuss with one another whether they’re applying similar standards. But Taylor doesn’t understand how to apply these tests reliably to a situation in which the scope for mining big data is moving so fast; or what system can be relied on in other sectors of life. The harms are quite widespread. People may think of them as being related primarily to their health data, but there are many other sensitivities that could damage some individuals’ interests.

Taylor felt that existing alternatives to consent are not particularly helpful in this context. The ethical and legal exceptions for overriding confidentiality include:

  • Vital interests. This exception typically applies to the security services, and Taylor wondered what ethical test they apply and how these issues look to anyone viewing them from outside this area of the law.
  • Some uses in the public interest require datasets covering 100% of the population. There are exceptions from consent for cancer and other disease registries and public health measures. These exceptions don’t help solve issues regarding extensive linkage with wider datasets.
  • Mental capacity.

Taylor’s main issue was therefore the need to inform people better and more consistently about what’s being done with their personal information in the course of research on large data sets. The general responsibility to do so already exists, and the law will define particular responsibilities where it allows an exception to consent.

However, Taylor can’t find a consensus on the kinds of explanation that should be offered to the general public, where to put it, who’s responsible for coming up with an understandable version, and even (for the more internet-savvy) how to find good explanations of the technical and organisational measures that are intended to protect anonymised data sets from being disclosive. Researchers need to be seen to be virtuous as well as actually being careful.

Taylor’s conclusion is that research on bulk data is running ahead of the public understanding and that it’s a challenge to know how to extend established ethical thinking so it builds public trust that the research is well governed. His interest has been to understand how to have practical and proportionate arrangements that the public can understand, because the big issue is loss of trust.

A questioner asked how to implement accountability for what is being done with data, particularly with respect to the security agencies. Taylor argued that although data science work is fairly impenetrable, in other fields that also have that problem there are rules requiring scrutiny by independent experts to ensure well-informed challenge. A commenter added that in this case the public can’t see when data mining stops being proportionate. Taylor suggested that a key issue is that the law is completely accessible to an average person because at present it’s so convoluted and impenetrable and there is no way to tell that a problem is being dealt with responsibly. It would be more reassuring if the judges who oversee cases such as analysis for security purposes were recognised for their experience of data protection and the confidentiality issues they have to deal with.

A commenter noted that an important part of accountability is a named individual who takes responsibility when something goes wrong. In addition, the UK has a well-developed concept of professional codes and professional self-regulation, but currently there is no obvious profession of “data scientist”, though there may be in a few years. The commenter felt it would help if the code of conduct for data science is not based on a European directive, which no one will read; instead, it should be framed as professional responsibility with the penalty of being struck off if someone does the wrong thing. This is something people already understand. Taylor compared the current situation with data mining to the “leaky regulatory hose” that led to the Grenfell fire and its aftermath, which sees everyone blaming each other.

This talk was presented at the November 24, 2017 RISCS meeting, “Ethics, Cybersecurity, and Data Science, hosted by the Interdisciplinary Ethics Research Group at the University of Warwick.

Wendy M. Grossman

Freelance writer specializing in computers, freedom, and privacy. For RISCS, I write blog posts and meeting and talk summaries