Anonymization by decentralization? The case of COVID-19 contact tracing apps

In the debate on contact tracing apps used in the fight against COVID-19 (COVID-19 apps), an increasing number of experts are resorting to the notion of “decentralization” as an essential component of “privacy preserving” software. While the centralized versus decentralized dichotomy is largely technical, it also has significant implications under EU data protection law. While both the European Commission (EC) and the European Data Protection Board (EDPB) consider the decentralized approach to be “more in line with the [data] minimization principle”, they have not per se rejected the idea of a centralized solution. This contribution intends to provide a brief overview of the (technical) origins of the centralized versus decentralized debate, the role of decentralization in so-called privacy preserving technologies and some of its alleged benefits under the General Data Protection Regulation (GDPR).

From manual to digital contact tracing

Contact tracing, along with other measures such as social distancing and quarantine, has long been used to control the spread of infectious diseases. Traditionally it was performed manually, i.e. by interviewing patients diagnosed positive in order to assemble an interaction graph. Now it is being increasingly carried out with the support of digital tools, ranging from the use of geolocation data to the analysis of mobile traffic information. In the COVID-19 crisis many European countries have resorted to the use of Bluetooth Low Energy (BLE) technology, which is arguably less privacy-invasive than the use of location data.

BLE-based COVID-19 apps rely on the emission and reception of ephemeral identifiers (EphIDs). Broadly speaking, they work as follows. When two individuals cross each other’s path, both apps (i) broadcast their own EphIDs and (ii) record the EphIDs of nearby app users. If an app user becomes infected with COVID-19, he/she has the possibility to provide the operator of the app with information about the fact that he/she is infected and about his/her recent encounters.   information is then used to (i) calculate the risk that someone has been infected following an encounter with an infected user and (ii) should that risk reach a certain threshold, inform that person of the procedure to follow.

Decentralized versus centralized software systems

In the context of BLE contact tracing apps, the key-concern lies in the distinction between centralized and decentralized software systems. From a physical perspective, centralized systems generally rely on a single server as a core distribution center in the communication with the end-users. From a functional perspective, they often imply that each entity involved in the software system bears a specific, distinct role (e.g. when streaming a movie on Netflix, Netflix’s central server provides the video; the user consumes it). This also means that, if the central server is compromised, the entire network is (i.e. there is a single point of failure). Decentralized systems eliminate the need for a central entity and rely on multiple servers or end-user devices (“peers”) to cooperate and undertake a given task. Each of those entities is tasked with a (nearly) similar role (e.g. in the case of peer-to-peer file sharing, each entity participates to both the upload and download of files). Decentralized systems are by no means a novel concept. There is a vast array of use-cases implementing them, with blockchain technology probably being the most popular one. Decentralization can be used as a form of Privacy Enhancing Technology (PET). As pointed out by De Filippi, most privacy preserving decentralized systems focus on ensuring either the confidentiality and anonymity of personal data (such as “TOR”), or user control over his/her personal data (such as the MIT project Enigma).

In the context of COVID-19 apps, and as outlined in the DP-3T protocol, the distinction between decentralized and centralized digital contact tracing lies in the following aspects. First, the pseudo-random EphIDs of each app user are generated by the user’s phone on the basis of his/her own secret key, rather than by the backend server on the basis of a permanent pseudo-identifier unique to each app user. This prevents the operator of the backend server from being able to revert every EphID the user’s phone created back to its permanent identifier and to (potentially) associate every observation to an identifiable (although not identified) individual. Second, the data that an infected user shares with the backend server are limited to the EphIDs he/she broadcasted during the infectious time window, rather than the EphIDs observed during that time frame. Third, the calculation of the risk score occurs on the user’s phone, rather than on the backend server.

The implications of decentralization (and centralization) under the GDPR

In certain instances, decentralized systems could facilitate compliance with the principles of data minimization and purpose limitation, respectively.  These principles find their technical equivalent in so-called “privacy by design strategies” (i.e. strategies aimed at embedding data protection principles in the design of the technology) to “minimize” and “separate” personal data processing. The first strategy aims to minimize the amount of personal data processed, the second one aims to make it harder to “combine or correlate data”. As literally illustrated by the DP-3T consortium (an international group of academics that developed the open source DP-3T protocol for decentralized, privacy-preserving contact tracing), one way to illustrate how a decentralized approach can facilitate compliance with the aforementioned principles is by comparing the (potentially personal) data to which the backend server has access under the two scenarios.

Since the backend server can normally not link the infected person’s EphIDs to an identified or identifiable natural person, the DP-3T members cautiously qualify the data processed by the backend server in a decentralized solution as “nearly anonymous” (see below). By contrast, they consider  the data processed by the backend server in a centralized scenario as pseudonymous, as the server would be able to link the app users’ EphIDs to identifiable individuals on the basis of their permanent identifier. The decentralized approach shows that it is possible to achieve the primary purpose of the app (i.e.  to notify contact people at risk and give guidance on next steps) by providing the backend server with less (personal) data (i.e. EphIDs relating only to infected users as opposed to also observed EphIDs and each app user’s permanent identifier). Moreover, in a decentralized scenario, computations on personal data are decoupled as much as possible from the need to transfer these data from their original source (the user’s phone) to a third party (the backend server), since the risk calculation occurs on the user’s phone rather than on the backend server. The decentralized approach hence appears to be more in line with the principle of data minimization.

The fact that the backend server holds (potentially personal) data of all app users (i.e. the permanent identifiers) could, if the backend server is not trusted, also increase the likelihood of such data being re-purposed without the data subject’s knowledge. This would violate the principle of purpose limitation. The authors of the DP-3T protocol indeed explain that, by design, the centralized solution allows the backend server to reconstruct the network of people with which the infected user interacted.  Moreover, the DP-3T members point out that, as every newly infected user uploads his/her contact-history and the volume of contact histories to which the backend server has access grows, the server could learn information about interactions between non-infected users. In a decentralized approach, by comparison, “the system does not reveal any information about the interaction between two users to any entity other than the two users themselves”. Some have argued that centralized solutions offer the advantage to enable epidemiologists to gain more insight into the spread of the disease. From a data protection perspective, gaining insights into the spread of the disease is a different purpose than notifying people at risk of infection. Such a purpose does not necessarily justify sharing an infected user’s interaction data with the backend server, which is what happens under a centralized solution. The only people who would arguably need access to such data to learn about the spread of the disease are epidemiologists. The DP-3T protocol, which provides users with the option to voluntarily share their contact events with infected users if they are at risk, shows that a decentralized solution can also support the option to gain insights into the spread of the disease.

Decentralization as a safeguard, not a way out

The GDPR materially applies to the “processing of personal data wholly or partly by automated means […]” (emphasis added) (article 2.1. GDPR). As illustrated above, decentralization could limit the type and quantity of personal data processed but cannot, by itself, change the nature of such data from personal to anonymous, the latter falling outside the scope of the GDPR. This is especially true in light of the broad and zero-risk definition of personal data that has often been put forward by data protection authorities (see notably the non-binding Opinion 05/2014 of the Article 29 Working Party, now EDPB, on Anonymization Techniques). Some statements of that Opinion, which have so far not been explicitly retracted, indicate indeed that, in order for data to be anonymous, it should “irreversibly prevent (re-)identification of the data subject”. Under this approach, if anybody is in theory able to re-identify the data subject, the data would be considered personal, regardless of the effort in terms of e.g. costs, time, expertise such re-identification would require. It is acknowledged that such zero-risk approach does not appear in line with the text of the GDPR and the most recent and authoritative European Court of Justice (ECJ) judgment on the matter in Breyer. Recital 26 of the GDPR indeed states that to determine whether a person is identifiable “account should be taken of all the means reasonably likely to be used […]” (emphasis added), taking into account objective factors such as the costs, amount of time required for identification, the available technology at the time of processing and technological developments. In Breyer, the ECJ specified that if re-identification requires the use of illegal means, these would not qualify as reasonably likely to be used. Despite the text of the GDPR and the ECJ’s Breyer judgment arguably leaning towards a more risk-based approach to personal data, there is still a great amount of uncertainty on which approach should be followed. This inevitably affects the data protection analysis of the COVID-19 app under consideration.

As conceded by the DP-3T consortium, under the decentralized solution, the infected user’s EphIDs are not completely immune to re-identification attacks. For instance, as clarified in the Data Protection Impact Assessment of the DP-3T protocol, the backend server would be able to re-identify the infected user by storing and processing traffic information about the upload. It follows that, if one takes the above-mentioned zero-risk approach to personal data, the backend server could be deemed to process personal (albeit pseudonymous) data when receiving and sending the infected users’ EphIDs. According to the DP-3T members, this is unlikely to be the case under a risk-based approach to personal data. The reason for this is that the actions required in order for the backend server to re-identify the infected user, would contravene the DP—3T protocol and data protection laws. While this assessment strictly follows the lines of Breyer, it nonetheless limits its reasoning mainly to the lawfulness of re-identification means, rather than their likelihood in terms of other factors such as costs and time required to re-identify the user.

In their privacy and security risk evaluation of digital proximity systems, the DP-3T members report that, under a centralized solution, infected users would be vulnerable to the same re-identification attacks as in a decentralized scenario. At the same time, they nonetheless seem to imply that the data processed by the backend server would qualify as personal data even under a risk-based approach. When pointing out that, in centralized solutions, the backend server could (be used to) trace the location of any app user over time, the DP-3T members argue that the back-end server could identify each app user, by reverting the EphIDs to a user’s permanent identifier and combining this identifier with other datasets, such as a “registered smart travel card or CCTV footage”. However, in light of the lawfulness criterion established in Breyer, it could be argued that re-identification by the back-end server is likely to be unlawful also under a centralized solution. Therefore, a meaningful conclusion as to which solution provides more anonymity to the data processed by the backend server is likely to depend on an analysis of whether re-identification by the backend server is more likely to happen under a centralized or decentralized solution because of the lower time, costs and/or expertise required. Since, in a centralized approach, the permanent identifiers are already by themselves unique to each user over time, whereas in a decentralized solution the EphIDs can become unique to a user only if the server gains access to a user’s phone, one could argue that it would probably be less time-consuming and cheaper for the server to re-identify a user (by tracing his/her location) in a centralized approach. This, however, calls for a case-by-case analysis of each implementation scenario in close collaboration between technical and legal experts.

Conclusion

The aforementioned analysis indicates that decentralized privacy preservation can in certain instances be useful to facilitate compliance with the principle of data minimization and purpose limitation, as required by the GDPR. Decentralization does not, however, by itself change the nature of the data processed from personal into anonymous. Although the ultimate data protection impact assessment of a technology can only be done given a concrete factual (technical and organizational) setting, it can be useful to bear these considerations in mind when developing and deploying decentralized privacy preserving technologies.

***