ICO Launches Consultation Series on Generative AI and Data Protection

The Information Commissioner’s Office (ICO), the UK’s data protection authority and enforcer of the UK General Data Protection Regulation (GDPR), has announced a series of consultations on how aspects of data protection law should apply to the development and use of generative AI models. The ICO is one of the key sectoral regulators the government has charged with ensuring, within their respective remits, that AI suppliers, deployers, and users adhere to the five principles outlined in the government’s white paper. (See our analysis of the white paper.)

This article explains the ICO guidance on establishing a lawful basis for collecting online personal data (or web scraping) for the purpose of training generative AI models, which will be of interest to developers and users of generative AI models, particularly those that wish to help shape the ICO guidance by responding to the consultation.

According to the ICO guidance, published on January 15:

“Generative AI refers to AI models that can create new content e.g. text, computer code, audio, music, images, and videos. Typically these models are trained on extensive datasets, which allows them to exhibit a broad range of general-purpose capabilities.

The impact of generative AI could be transformative for people and businesses if organisations develop and deploy it responsibly with the trust of the people whose data it is built on. We are moving fast to address any risks and enable organisations and the public to reap the benefits of generative AI.”

The ICO guidance responds to a number of requests for clarification, made by innovators in the AI field. These questions include the following:

What is the appropriate lawful basis for training generative AI models?
How does the purpose limitation principle play out in the context of generative AI development and deployment?
What are the expectations around complying with the accuracy principle?
What are the expectations in terms of complying with data subject rights?

The ICO intends to share a series of proposed chapters interpreting the UK GDPR and Part 2 of the Data Protection Act 2018 in relation to these questions. The ICO is seeking the views of stakeholders with an interest in generative AI, to help inform its positions.

The first chapter of the ICO consultation concerns the lawful basis for web scraping to train generative AI models. This chapter sets out the background and explains the collection of training data as part of the first stage of the generative AI lifecycle, which can be divided into five steps:

Data collection (i.e., “data scraping” from publicly accessible sources)
Data pre-processing
Training and model improvement
Fine tuning
Deployment

The consultation provides that training data may be collected from a variety of online sources, such as blogs, social media, forum discussions, product reviews, and personal websites, and may include images, video, text, and contact details. To the extent such training data includes personal data, it must be processed in accordance with applicable data protection laws. In order to collect personal data for generative AI model training purposes, developers must ensure their processing is not in breach of any laws (such as intellectual property laws) and that they have a lawful basis for processing.

The first limb may be difficult to satisfy for developers of generative AI platforms that engage in web-based data scraping in the UK and where the AI is being developed for commercial purposes. This is because, in contrast to the EU, the UK has not implemented a general exception to copyright infringement for text and data mining; the exception being limited to non-commercial research. UK government proposals to create a broader exception to copyright infringement for data scraping activities in the UK were dropped in February 2023 following criticism from the creative industries. In March 2023, the UK Intellectual Property Office (UKIPO) was tasked to produce a code of practice that “will provide guidance to support AI firms to access copyrighted work as an input to their models, whilst ensuring there are protections (e.g., labelling) on generated output to support right holders of copyrighted work.” The government also tasked the UKIPO with taking forward the recommendations in relation to enforcement, in what it believes is an approach that will “ensure the creation of a balanced and pragmatic code of practice that will enable both sectors to grow in partnership.” Though the UK government set a deadline of summer 2023 for this code of practice, timelines have already slipped significantly. The government expects parties to enter into the final code of practice on a voluntary basis. If the code of practice is not adopted or agreement is not reached, legislation could be considered, something which has been advocated by a House of Commons interim report on the governance of AI published at the end of August 2023.

Under the UK GDPR, every processing activity must have a lawful basis — that is, a reason for processing that the data protection law provides as legal. In some contexts, users of data can rely on consent, compliance with law or contract, or other bases. The ICO is of the view that the only available lawful basis for collection of personal data in the context of training data will be legitimate interests. In order to establish legitimate interests as a lawful basis, developers must pass a “three part test,” demonstrating that:

The purpose of the processing is legitimate.
The processing is necessary for the designated purpose.
The individual’s interests do not override the interest being pursued.

The consultation draws on existing ICO guidance on the legitimate-interest basis for processing. Controllers must identify a specific interest, which may be a business interest or wider social interest. Within these two broad categories, there will be a broad range of more granular legitimate interests. The necessity test is a factual assessment to establish whether web scraping is necessary to achieve the interest stated in the legitimate purpose test. The ICO recognizes that generative AI training requires large data sets that can only be collected using large-scale data scraping. The data sets may contain data about large numbers of people. These individuals’ interests must be weighed against those of the entity using the data, which requires an assessment of the likely impact that the processing would have on the individuals. Collecting people’s personal data through web scraping is “invisible processing,” of which affected individuals are unlikely to be aware. Invisible processing makes it more difficult for individuals to retain control over their data or exercise their rights (known as “upstream risks and harms”). Generative AI models can also be used to generate inaccurate information about people, resulting in distress or harm, or may be used by hackers as a social engineering tool to generate phishing communications tailored in more sophisticated ways to individuals to perpetrate fraud (known as “downstream risks and harms”).

The ICO identifies a number of risk mitigations that may help generative AI developers pass the third part of the legitimate interests test (i.e., that the individual’s interests are not overridden). Where a generative AI developer deploys the model on its own platform, if they have relied on the public interest of the wider society as their legitimate interest (i.e., part one of the three part test), they should be able to control and demonstrate whether the generative AI model is actually used for the stated wider benefit. The developer should also be able to assess risks to individuals in the development and post-deployment phases, and implement measures to address such risks. The development phase of a generative AI model includes the collection and pre-processing of the training data, training and improvement of the model, fine tuning of the training dataset, and fine training of the model. The model can then be deployed for its intended use.

The ICO describes two different approaches that generative AI developers take in offering their models commercially. Some developers make their models available to a third party through an application programming interface (API). The third party does not have its own copy of the generative AI model on its servers or in its cloud. Instead, the developer continues to host the model, and the third party must query the model through the API. The ICO describes this as a “closed source” approach. Alternately, a developer may permit third parties to host their own copies of the model on their own servers or in their clouds. The ICO terms this as an “open source” approach.

A closed source approach would enable the developer to take steps to ensure that the third party’s use of the model aligns with the objective the developer has identified as the legitimate interest (i.e., part one of the three part test). For instance, the developer might limit queries that could result in a risk of harm to individuals and monitor the third party’s use of the model. The developer could also add protective measures contractually.

In contrast, in an open source approach, a generative AI model may be implemented in unlimited ways, such that it would be impossible for developers to restrict or monitor how the model is used, and hence its impact on individuals. The developer is unlikely to have any knowledge of whether the broad societal aims asserted as its legitimate interest for processing people’s personal data have been achieved.

The ICO concludes that training generative AI models using web-scraped data may be feasible if developers adhere to their legal responsibilities and can provide evidence that they have done so. The ICO’s existing, detailed guidance on legitimate interests should be familiar to developers of generative AI models. The key compliance challenge will be identifying a specific legitimate interest and balancing this against individuals’ interests.

The ICO invites comments on its consultation series, and the deadline for expressing views on the first proposed chapter is March 1. Developers concerned that the ICO’s final position may affect their business model should give serious consideration to participating in the consultation.

© Arnold & Porter Kaye Scholer LLP 2024 All Rights Reserved. This Advisory is intended to be a general summary of the law and does not constitute legal advice. You should consult with counsel to determine applicable legal requirements in a specific fact situation.

Find more content tagged:

Key Contacts

James Castro-Edwards

Counsel

London

Email +44 (0)20 7786 6198

Dr. Beatriz San Martin

Partner

London

Email +44 (0)20 7786 2940

ICO Launches Consultation Series on Generative AI and Data Protection

Share this with others:

Key Contacts

Related Services