From the DSA to Media Data Space: the possible solutions for the access to platforms’ data to tackle disinformation
The ability of independent researchers to access online platforms’ data is a precondition for effective platform governance, independent oversight and to understand how these platforms work. In particular, granular access to online platforms’ data allows researchers to carry our public interest research into platforms’ takedown decisions (see e.g. Lumen Database), ads libraries (see here) and recommender systems (see here). Access to social media platforms’ data is also key in tackling mis- and disinformation (see here). In essence, researchers need meaningful access to platforms’ data to identify organized disinformation campaigns, to understand users’ engagement with disinformation and to identify how platforms enable, facilitate, or amplify disinformation (e.g. via optimization rules, micro-targeting; see e.g. Citizen Browser) and so on. This evidence-based work – such as the recent WSJ Facebook files – critically depends on access to platform data. Yet, whereas the amount of platforms’ data is constantly growing, it has become increasingly difficult for researchers to access that data. After a short introduction into the status quo, this blog post looks into the possible solutions to enhance access to platform’ data for research on disinformation: Article 31 of the DSA, the strengthen Code of Practice on Disinformation, Art. 40 GDPR Code of Conduct and a Media Data Space.
The ‘post-APIcalypse’ access to platforms’ data
The lack of sufficient access to platforms’ data by researchers is due to various reasons: privatization of data ecosystems, the lack of incentives for platforms in revealing what kind of users’ data they have and how they use it, corporate secrecy on platforms’ algorithmic practices and data protection concerns (see also here and here). As a result, access to platforms’ data for researchers is currently mainly governed by contractual agreements, platforms’ own terms of service and public application programming interfaces (APIs). APIs access can be restricted or eliminated at any time and for any reason. In the aftermath of the Cambridge Analytica scandal platforms cut down their APIs even more, which has been described by researchers as ‘the APIcalypse’ and a ‘post-API age’. Other self-regulatory platform initiatives such as Social Science One have been widely criticized for delays, significant limitations and ’extremely limited scientific value’. The UN Special Rapporteur on the promotion and protection of the right to freedom of opinion and expression, stressed a lack of transparency and access to data as ‘the major failings of companies across almost all the concerns in relation to disinformation and misinformation’.
The GDPR says no
Somewhat ironically, some ‘big tech’ platforms are now presenting themselves as defenders of privacy citing the General Data Protection Regulation (GDPR) as a key obstacle that prevents them from sharing their with researchers. The most recent example is Facebook’s shut down of the accounts of researchers affiliated with the Cybersecurity for Democracy project at New York University (NYU) who got access to Facebook’s data by asking users to install a browser add-on called Ad Observer and voluntarily share anonymized data about ads they see. Facebook claimed that it had to stop NYU’s ‘unauthorized scraping’ violating platform’s terms of service to ‘protect people’s privacy’ in line with the Federal Trade Commission (FTC)’s Order. However, in a strongly-worded letter, the FTC called this claim ‘inaccurate’ and an ‘insinuation’. It has been argued that the FTC order restricts how Facebook shares user information but it does not preclude users from voluntarily sharing information about their experiences on the platform, including through ‘data donation’ initiatives, such as browser extensions.
This is not to say that data protection concerns and users right to privacy are a non-issue. On the contrary, web scraping raises major legal and ethical concerns (see here) and platform datasets can be abused for harmful purposes (see Clearview or MegaFace case). It is true that the GDPR lacks clarity regarding whether and how platforms might share data with researchers (see here, here or here). The argument that via a browser add-on data about Facebook users who did not install it are also collected, requires a separate analysis.
However, the GDPR is being weaponized by some platforms to prevent good-faith research in the public interest. As bluntly put by the EDPS, it would appear that the reluctance to give access to platforms’ data is motivated no so much by data protection concerns as by ‘the absence of business incentive to invest effort in disclosing or being transparent about the volume and nature of data they control.’ One thing is clear: the GDPR does not a priori prohibit the sharing of personal data by platforms with researchers. Data access can be granted in a privacy-preserving way. Hopefully, the long-awaited European Data Protection Board’s (EDPB) guidelines on the processing personal data for scientific research purposes (due in 2021) will soon offer some clarifications on how that can be done. The bottom line is, as Vermeulen sums up, ‘the fact that it requires an argument is in itself a barrier to hand over data. This uncertainty is undesirable for both researchers and the platforms.’
For these reasons, there is a clear need for a legally binding data access framework that provides independent researchers with access to a range of different types of platform data. Let’s unpack what the EU policy and regulatory initiatives have in store.
Solution #1: Article 31 DSA
From the outset, Article 31 of the Digital Services Act proposal (DSA) is all one could hope for: it provides a specific provision on data access. The closer look reveals the following drawbacks (for a detailed analysis see here). First, Article 31 DSA does not provide for a direct access for researchers. It is upon a reasoned request from the Digital Services Coordinator or the European Commission (EC) that the very large online platforms (VLOPS) shall provide access to data. Second, access is limited to so-called ‘vetted researchers’. Various conditions to qualify as a ‘vetted researcher’ ultimately narrow done its scope to university-affiliated academics. As the EU DisinfoLab points out, the disinformation community has grown beyond the realm of the university and now includes a variety of different actors: journalists, fact-checkers, digital forensics experts, and open-source investigators etc. They are all excluded from the ‘vetted researchers’ definition, and hence access to data. Third, ‘vetted researchers’ may only use platforms’ data for purposes of research into ‘systemic risks’ as defined in Article 26 DSA. Arguably, this category is broad and non-exhaustive and includes catch-all concepts such as the right to privacy and freedom of expression. Symptomatically however, Article 26 DSA mentions ‘dissemination of illegal content’ and ‘intentional manipulation of the service’ but does not refer to mis- or disinformation as a ‘systemic risk’. One can hope that research into disinformation techniques such as attention hacking, information laundering or cross-platform migrations will also fall under this definition.
Putting all the eggs in the DSA basket is therefore a risky bet. Even if Article 31 DSA gets amended according to civil society calls (see here and here), the regulation will enter into force no sooner than in 2-3 years. The adoption of the delegated act foreseen in Article 31(5) DSA specifying the conditions under which data access can take place, including technical and procedural aspects, may take even more time. Let’s see what can happen in the meantime.
Solution #2: The Strengthen Code of Practice on Disinformation
The centrepiece of EU disinformation efforts has been the self-regulatory Code of Practice on Disinformation, in force since October 2018. The EC’s Assessment of the Code of Practice in September 2020 found the lack of access to data allowing for an independent research on emerging trends and threats posed by online disinformation, ‘a fundamental shortcoming’ of the Code. The data access required to detect and analyse disinformation was seen as ‘episodic and arbitrary,’ and did ‘not respond to the full range of research needs.’ In July 2021, the Assembly of the signatories of the Code of Practice has kicked off the process of drafting the strengthened Code of Practice on Disinformation in line with the EC Guidance published on 26 May 2021.
According to the EC Guidance, relevant signatories should commit to co-creating a ‘a robust framework for access to data for research purposes’ which will offer ‘transparent, open and non-discriminatory, proportionate and justified’ conditions for access to data for research purposes. The framework should include access to 2 parallel data access regimes: i) continuous, real-time, stable and harmonised access to anonymised, aggregate or otherwise non-personal data through APIs or other open and accessible technical solutions; ii) access to ‘data requiring additional scrutiny including personal data’ which should at least allow academic researchers to have access to datasets necessary to understand sources, vectors, methods and propagation patterns of disinformation phenomenon. To that end, platforms and the research community, should together define conditions applicable for access to these datasets which ‘in principle’ should be standardized and uniform across platforms. The Guidance shows that the EC learned its lesson: it wants to end platforms’ cherry-picking with whom they want to collaborate based on series of multi-bilateral arrangements. The Code should also be complemented by ‘a robust monitoring system’: the signatories should commit to and report on concrete service-level indicators in order to evaluate the quantity and granularity of data made available, the number of research organisations having access to platforms’ data, as well as the amount of resources made available.
Despite of how promising it all sounds, the main characteristic of the Code remains unchanged: the participation and subscription to its commitments remains voluntary. However, the EC end game is to evolve the Code of Practice towards a Code of Conduct foreseen in Article 35 DSA. The monitoring and the assessment of the achievement of the objectives of such Code, would then fall under the European Board for Digital Services (Article 35(5) DSA). In the meantime, the EC hopes that the strengthened Code of Practice will create a framework that, already in the interim before the DSA’s adoption, would reinforce the accountability of online platforms. A first draft of the strengthened Code of Practice is expected already this autumn.
Solution #3: GDPR Article 40 Code of Conduct
On 30 August 2021, the European Digital Media Observatory (EDMO) has launched a Working Group on Access to Platform Data to work towards a creation of a Code of Conduct on access to platform data under Article 40 GDPR. EDMO believes that Article 40 Code of Conduct could ‘clarify how platforms may provide access to data to independent researchers in a GDPR-compliant manner’. Under Article 40 GDPR, Codes of Conduct may also establish a monitoring body to oversee the implementation of the Code. Such codes must also be approved by a relevant data protection authority. According to the EDPB Guidelines 1/2019 on Codes of Conduct and Monitoring Bodies, Article 40 Code of Conduct should lay out how the GDPR might be put into practice in a ‘specific, practical and precise manner’, provide ‘unambiguous, concrete, attainable and enforceable (testable)’ standards and rules. In short, a Code should avoid being ‘overly legalistic’, but instead provide ‘concrete case scenarios’ and ‘specific examples of best practice’. To this end, twelve members of the Working Group from academia, civil society (Access Now, Future of Privacy Forum), and ‘big tech’ (Twitter, Facebook, Google) will work together to: (i) identify the legal basis and key questions regarding how to provide privacy-compliant access to data and (ii) understand what clarity and guidance on the reach of the GDPR is required to address these questions.
EDMO’s initiative seems like a promising way forward to offering researchers a clearer route to data access that is overseen and enforced by an independent monitoring body. And although the drafting of the Code will take time, bringing the key stakeholders around the same table is an accomplishment on its own.
Solution #4: Media Data Space
In December 2020, the EC published the Media and Audiovisual Action Plan (MAAP) alongside with the European Democracy Action Plan. With the MAAP, the EC ambition is ‘to accelerate the recovery, transformation and resilience of the media industry’ and to ‘set up integrated industrial policy for news media’. As part of this strategy, the EC wants to create a ‘European media data space’ to support media companies ‘in sharing data and developing innovative solutions’. The EC envisages the following benefits of sharing media data. First, access to audience data, content meta data and other types users’ behaviour data would allow European media companies to create personalised content and promotion. Second, it would help EU news publishers to pool together their content and customer data to produce news targeting their own national audiences. Third, it would provide insight to services aiming at increasing the findability of media content across borders. A media data space has therefore a clear economic goal: to empower European media companies who are at a competitive disadvantage vis-a-vis the ‘big tech’ platforms ‘to make better decisions and deploy more advanced solutions based on insights gleaned from data’.
However, one can imagine a wider, societal and political impact of such a media data space. That depends on who will have access to and to what data. According to the Data Strategy, data spaces can be used for multiple purposes (including both research and non-research purposes). The creation of a shared media data space open for researchers and fact-checkers could possibly facilitate research on platforms and disinformation phenomena. Of course, one aspect to consider is how the cross-context use of personal data, including the use of sensitive personal data for other purposes, would fit with the GDPR and the upcoming Data Governance Act (DGA) proposal.
Access to platforms’ data has different angles: from the lack of incentives and platforms’ perceived risks of what the data they share will reveal, lack of standards for data quality, to the potential for misuse of personal data. It is also, to some extent, a trust-building exercise to ensure a proper balance between the interests of the data subjects and the shared interest of the society as a whole to ensure an independent oversight and effective accountability of online platforms. A ‘quick fix’ to all these is neither possible nor necessarily wanted. But with the recent initiatives we are finally moving from discussing data access in an abstract way towards constructive, concrete and – at least in some cases – legally binding solutions.