Trust Is Not a Contract

The UK Biobank breach is a 500,000-person case study in the difference between a contract and an architecture, and a warning for how the European Health Data Space gets built

Share
Trust Is Not a Contract

What the UK Biobank breach means for the future of health data sharing in Europe

Last week, de-identified health data relating to UK Biobank's 500,000 volunteers was advertised for sale on Alibaba. Three academic institutions, all approved researchers operating under signed contracts, appear to have been the source of the exposed data after gaining legitimate access to the dataset. The listings were removed, and no purchases are believed to have been made. The institutions had their access revoked. And the UK Technology Minister stood up in the House of Commons to explain what had gone wrong.

The instinctive reaction is to treat this as a cybersecurity story. Rogue researchers, lax enforcement, a platform that allowed bulk downloads when it shouldn't have. All true. But focusing on the mechanics of the breach misses the deeper problem, which is that this was not simply a failure of security. It was a failure of architecture.

The 198th time

Professor Luc Rocher at the Oxford Internet Institute pointed out that this was the 198th known exposure of UK Biobank data since last summer. Not the first. The 198th. Researchers have repeatedly and accidentally uploaded datasets to code-sharing platforms. Some of those files are still available online today. The Guardian demonstrated last month that a single participant could be identified from just two easily known facts, despite the data being nominally de-identified.

When asked directly, UK Biobank's chief executive Sir Rory Collins acknowledged it was "impossible" to entirely rule out that people could be identified using de-identified data and other information. That admission, from the custodian of the dataset itself, is more significant than any external legal opinion. It confirms that de-identification is not a guarantee. It is a reduction of risk, not an elimination of it.

This is a distinction that matters and is often misunderstood. De-identification removes obvious identifiers such as names, addresses, and NHS numbers. Anonymisation goes further, rendering it practically impossible to identify an individual even when combining the data with other available information. The GDPR already recognises the difference. Pseudonymised data is still personal data. Truly anonymised data falls outside the regulation entirely. But the practical governance of secondary use tends to treat de-identification as if it were a finish line rather than a speed bump. In an environment where health, demographic, social, and leaked datasets can be combined, the risk is not one field in isolation. It is the mosaic created by many fields together. The UK Biobank breach shows what happens when that assumption meets reality.

UK Biobank's response each time has followed the same pattern. Reassure participants that the data was de-identified. Note that no one has been "unwillingly identified." Promise better training and tighter controls. But the fundamental question has never been about training or controls. It has been about whether the architecture itself permits the thing you are trying to prevent.

UK Biobank operated on a model where approved researchers could download bulk datasets to their own machines. They changed this in 2024 to a cloud-based research platform, but legacy access persisted, and even the new platform still allowed researchers to export derived data. The result was predictable. If data can be downloaded, data will be downloaded. If it can be listed for sale, eventually it will be. The contract said don't. The architecture said you can. The architecture won.

What this means for the European Health Data Space

The timing could not be worse for the European Health Data Space. The EHDS regulation, which creates a framework for the secondary use of health data across EU member states, entered into force in March 2025 and is now moving from legislation into implementation. Member states are standing up Health Data Access Bodies. Data permit frameworks are being designed. Secure processing environments are being specified. One of the central operational assumptions is that de-identified health data can be made available safely across borders, institutions, and jurisdictions through pseudonymisation, contractual controls, and approved access.

The UK Biobank breach challenges every layer of that assumption.

Contractual controls are necessary but insufficient. UK Biobank had contracts, mandatory training, compliance requirements, and more than 22,000 researchers across over 60 countries working with its data. Three institutions appear to have breached those agreements. The EHDS will face the same challenge at continental scale. Data permits issued by Health Data Access Bodies will carry legal obligations. But legal obligations only work after the fact. They tell you what to do once something has gone wrong. They do not prevent the thing from going wrong in the first place.

Secure processing environments need to be genuinely secure. The EHDS regulation envisions that secondary use of health data should happen within secure processing environments where researchers can analyse data without extracting it. That is the right instinct. But the UK Biobank experience shows that "secure environment" can mean very different things in practice. If the environment allows data to be downloaded to a local machine, it is not a secure processing environment. It is a data distribution platform with access controls. The difference matters. The EHDS must define secure processing architecturally, not just procedurally.

The case for federated infrastructure

There is an alternative architecture that addresses all three of these problems. In a federated model, the data never leaves the data controller's infrastructure. The computation travels to the data, not the other way around. Researchers submit queries or models. The infrastructure runs them. Results are returned through an output checking process that prevents raw data from leaving the environment. No bulk downloads. No local copies. No datasets on researcher laptops waiting to be accidentally uploaded to GitHub or deliberately listed on Alibaba.

This is not a theoretical concept. Federated Trusted Research Environments are already being deployed in national health programmes. Done properly, this kind of architecture aligns with the Five Safes framework: safe people, safe projects, safe settings, safe data, and safe outputs. A platform that permits bulk extraction, uncontrolled local copies, or weak output checking may look like a secure environment on paper, but in practice it fails the spirit of the Five Safes.

But secure architecture cannot be a research prison. If the environment is slow, restrictive, or missing the tools researchers need, users will look for workarounds. The challenge for EHDS is not just to stop data leaving the room. It is to make the room good enough that researchers do not need to leave it.

The EHDS has an opportunity to get this right. The regulation already points in the direction of secure processing. But "secure processing" needs to be defined architecturally, not just contractually. A data permit that says "you may not extract data" is a contract. A system where raw participant-level data extraction is technically blocked is an architecture. A contract is not an architecture. The UK Biobank breach is a 500,000-person case study in the difference between the two.

The custodianship question

Behind all of this sits a more fundamental question about who should be the custodian of health data and what custodianship actually means.

UK Biobank is a charity. It collected data from willing volunteers with explicit consent for medical research. It operated in good faith. It is not the villain in this story. But the breach reveals that custodianship is not just about intent. It is about infrastructure, governance, and the architectural decisions that determine what is possible within a system, not just what is permitted.

As the EHDS takes shape across Europe, every member state will need to make decisions about data custodianship. Who holds the data. Where it is processed. How access is mediated. Whether researchers interact with data in situ or download copies to their own environments. These are not bureaucratic questions. They are architectural ones. And the architecture you choose determines the trust you can actually deliver, as opposed to the trust you merely promise.

The UK Biobank breach did not happen because people were untrustworthy. It happened because the system was designed in a way that assumed trust would be enough. Trust is essential. But trust is not a contract. And a contract is not an architecture.

The organisations that get this right will be the ones that build systems where the question of whether a researcher might misuse data is made irrelevant by the fact that the data never leaves the room.