The Privacy-Copyright Collision: How AI’s Hunger for Data is Forcing Courts to Rewrite Both Bodies of Law Simultaneously

AI training pipelines sit at the intersection of two legal regimes designed independently and never intended to interact. When training data contains personal information and expressive content — as most internet data does — data protection law and copyright law collide. Courts in the US, EU, UK, and Germany are resolving that collision in real time, and the answers are still contradictory.

When a large language model is trained on internet data, it consumes two things simultaneously: the words, images, and music that copyright law protects as the expression of their creators, and the personal details — names, email addresses, identifying information, private disclosures — that data protection law protects as the information of the individuals those details describe. In the early years of the AI boom, the industry’s approach to this dual problem was largely to ignore it. That period is over.

Courts and regulators in at least four jurisdictions are now actively adjudicating whether AI training as currently practised — scraping enormous volumes of internet content, ingesting it into model parameters, and generating outputs that may reproduce fragments of it — complies with the law as written. The answers are contradictory. Two of US federal judges have found that training is a transformative fair use. A third has found that it infringes copyright. A German court ruled in November 2025 that a commercial AI company cannot invoke the text and data mining exception for permanent memorisation of song lyrics. A UK court dismissed Getty Images’ copyright claim but upheld a narrower trademark argument. The EU is simultaneously trying to clarify the GDPR’s basis for AI training, while the Digital Omnibus proposal, designed to do exactly that, is being stripped apart by member states concerned about its scope.

The collision between copyright and privacy was always going to happen. These two legal frameworks were designed to solve different problems and were never intended to interact. What nobody fully anticipated was the speed at which the collision would arrive, or the extent to which the courts and regulators handling it are working in isolation from each other, making rulings in one framework that create direct contradictions with the other.

The Structural Problem

In separate eras, lawmakers designed copyright law and data protection law for distinct purposes, employing separate enforcement mechanisms and philosophical foundations. Copyright emerged to incentivise the creation of expressive works by granting creators a temporary monopoly over their reproduction and distribution. Data protection law emerged to protect individuals from the misuse of information about them by entities with greater power to collect and exploit it. The two regimes overlap at the margins — a database of personal information, for example, can carry both copyright protection (as a structured compilation) and data protection obligations — but they were not designed to be applied simultaneously to the same act.

Copyright law

Protects expression

The text, image, music, or code as created
Infringement = reproduction without authorisation
Defence: fair use / TDM exception (jurisdiction-dependent)
Rights holders enforce; damages can be substantial
Does not depend on whether the content identifies a person

Data protection law

Protects persons

Personal information about identifiable individuals
Violation = processing without a lawful basis
Lawful bases: consent, legitimate interest, legal obligation
Regulators enforce; fines can reach 4% of global turnover
Does not depend on whether the content has expressive value

Applied to an AI training dataset, these two frameworks generate independent and non-overlapping compliance obligations. An AI developer that scrapes a news article needs copyright clearance for the text itself — but the personal data contained within the article (the names of people quoted, the journalist’s byline, any identifying details about individuals described) triggers a completely separate set of obligations under the GDPR or its equivalents. As the legal analysis firm Measured Collective summarised concisely in March 2026, a valid copyright licence, or a qualification for a text and data mining exception, still requires independently addressing data protection requirements. Compliance with one framework provides no shelter under the other.

Fair Use in Flux

The United States does not have a comprehensive federal data protection law comparable to the GDPR. The primary legal battleground for AI training in the US is therefore copyright, and the key doctrine is fair use — the statutory defence that permits reproduction of copyrighted material without authorisation when the use is transformative, does not harm the market for the original, and satisfies the other factors of the four-part fair use test.

Key US AI copyright cases: Status as of May 2026

The New York Times v. OpenAI & Microsoft

Most-watched case. The court ordered OpenAI to produce 20 million ChatGPT output logs in January 2026. Summary judgment on fair use not expected before summer 2026. No trial date set. Pending.

Bartz v. Anthropic

June 2025: the court granted a partial dismissal in favour of Anthropic, finding training transformative under fair use. First major ruling in Anthropic’s favour on training-stage claims. AI wins

Authors Guild MDL v. Meta

June 2025: court found Meta’s LLaMA training was fair use and transformative — but held that pirated training sources do not defeat the fair use defence. Diverged from Bartz on market harm. AI wins.

GEMA v. OpenAI (Germany)

November 2025: Munich Regional Court finds ChatGPT permanently memorised song lyrics, which fall outside the TDM exception. First European court ruling that AI training infringes copyright. The rights holder wins.

Getty Images v. Stability AI (UK)

November 2025: High Court dismisses primary copyright claim (model does not physically “contain” copies), upholds limited trademark infringement. Getty abandoned the copyright case mid-trial. Mixed/narrow.

Disney & Universal v. Midjourney

Filed 2025. First major visual-media plaintiff in the AI image generation space. Allegations of copyright infringement in training data and output generation. Litigation ongoing.
Pending

Of the three US federal judges who have ruled on the fair use question to date, two have sided with AI companies, and one has sided with rights holders. The split is smaller than it might appear: the two pro-AI rulings both found training transformative, but diverged on whether training on pirated source material defeats the fair use defence (the Meta case held it did not; the Bartz court did not reach this question squarely). None of the rulings has settled the underlying doctrine.

A development that received less headline attention than the fair use rulings is likely to prove more consequential in the near term: the use of discovery as a litigation weapon. In January 2026, Judge Stein ordered OpenAI to produce the entire 20-million-entry sample of ChatGPT output logs in the New York Times case, rejecting OpenAI’s argument that producing user conversations would violate privacy rights because users voluntarily submitted their communications to OpenAI. The ruling establishes that user interactions with AI chatbots may be discoverable in copyright disputes — a precedent with implications well beyond the Times case.

The GDPR Chokepoint

The European legal landscape is more complex than the US, because AI developers operating in Europe face three independent regulatory regimes simultaneously: the GDPR (for personal data), the EU Copyright Directive’s text and data mining exception (for expressive content), and the AI Act (for the systems themselves). None of these regimes was designed with the others in mind. All three apply.

The GDPR’s core problem for AI training is the consent model. The regulation requires a lawful basis for every processing activity involving personal data. The most intuitive basis — consent — is impractical at the scale of internet training data, because obtaining valid, specific consent from every individual whose personal information appears in a web crawl is technically impossible. The industry has therefore relied primarily on “legitimate interest” under Article 6(1)(f) of the GDPR: a balancing test that asks whether the controller’s interests in processing the data override the data subject’s interests or rights.

“GDPR compliance with training-stage consent is only one layer in a complex stack of regulatory issues. A well-documented GDPR position remains a key tool — but it does not resolve copyright, database rights, post-training litigation risk, or deployment obligations.”

Skadden, Arps, Slate, Meagher & Flom — analysis of CNIL legitimate interest guidance, June 2025

France’s CNIL published guidance in 2025 clarifying that legitimate interest can serve as a lawful basis for training on publicly available data — but with conditions. The UK’s ICO adopted a similar stance, announcing in February 2026 a formal investigation into Grok/xAI to examine if personal data was lawfully processed for AI training, with a specific focus on harmful sexualised content, including content involving children. The investigation signals that the ICO is prepared to move from guidance to enforcement when AI training data practices fall below data protection standards.

Meanwhile, the European Commission’s Digital Omnibus proposal of November 2025 attempted to introduce an explicit legitimate interest basis for AI training under the GDPR, reducing uncertainty by writing the answer into the text of the regulation. By March 2026, a leaked Council compromise draft had removed the proposed redefinition of “personal data” entirely, illustrating how politically contested any attempt to amend the GDPR in favour of AI training has become among member states.

The TDM Exception’s Hidden Limit

The EU Copyright Directive’s text and data mining exception was designed to permit the automated analysis of large volumes of text and data for pattern recognition, statistical inference, and knowledge discovery — without requiring copyright clearance for each source work. It was enacted with machine learning and natural language processing in mind, and was widely interpreted as authorising AI training.

The GEMA v. OpenAI ruling in November 2025 drew a line that the industry had not anticipated. The Munich Regional Court did not reject the TDM exception as inapplicable to AI training in principle. It distinguished two types of use: analysis, which extracts abstract information (syntactic rules, semantic relationships, general patterns), and this analysis falls within TDM; The court found that ChatGPT had permanently memorised GEMA-represented song lyrics — evidenced by its ability to reproduce them on demand — and that this reproduction-capable memorisation fell outside the scope of the exception.

The GEMA ruling’s doctrinal significance

The Munich court’s reasoning has implications beyond music and beyond Germany. If the TDM exception does not cover training that results in permanent memorisation of protected works, then every large language model trained on copyrighted text may carry copyright exposure — not for the training process itself, but for the model’s ongoing capacity to reproduce fragments of that training data.

The critical evidential question becomes: can the model reproduce specific protected content on demand? If yes, the rights holder’s argument follows the GEMA logic: the model memorised the work, and that memorisation constitutes infringement. If no — if the model has genuinely processed the content into abstract representations that cannot produce recognisable reproductions — the TDM defence may hold. This frames the legal dispute as fundamentally an empirical question about what the model actually contains, which is itself contested scientific territory.

The contrasting Kneschke v. LAION ruling from September 2024, which found in favour of the LAION non-profit, illustrates the commercial/non-commercial dimension: LAION’s open-source, scientific-purpose use of images was permitted under TDM; OpenAI’s commercial deployment was not. The line may run between those who extract information and those who package it for profit.

Japan’s Distinctive Position

Japan’s approach to the intersection of copyright and data protection in AI training is distinctive and deserves specific treatment. Article 30-4 of Japan’s Copyright Act has, since 2019, included an explicit provision permitting the use of copyrighted works for machine learning and information analysis, without the copyright holder’s consent and without regard to whether the purpose is commercial. This is broader than the EU’s TDM exception in one critical respect: it does not require that the use be non-commercial or for research purposes. A company training a commercial AI model on Japanese copyrighted works can invoke Article 30-4 as a statutory defence.

This provision has made Japan the most copyright-permissive major economy for AI training and attracted significant attention from global AI developers. But the Agency for Cultural Affairs’ 2024 guidance drew new lines around the exception: specifically, it clarified that Article 30-4 does not authorise the reproduction of training data in outputs in a way that substantially substitutes for the original. The memorisation-versus-analysis distinction that the Munich court applied under EU law has a Japanese equivalent, and litigation testing its boundaries is expected.

On the data protection side, Japan’s APPI amendment, approved by the Cabinet in April 2026, explicitly creates a pathway for using personal data for AI development without individual consent — provided the data is pseudonymised and documented safeguards are in place. This makes Japan’s data protection regime, in this specific context, more accommodating of AI training than both the GDPR (which requires a legitimate interest balancing test) and the UK GDPR (where enforcement action is already underway).

The Path Forward

The legal collision between copyright and privacy in AI training is not a problem that courts can resolve on their own — at least not in any timeframe that is useful to the industry or to the individuals whose rights are in question. Even in the US, where litigation is most advanced, Morrison Foerster’s analysis of the 2025 year-end position concluded that 2026 is unlikely to bring final answers to the core copyright questions. The New York Times case — the most consequential pending — will not reach trial before late 2026 at the earliest.

Three paths to resolution are emerging. The first is licensing: major AI developers are negotiating content licensing agreements with publishers, news organisations, and rights management bodies that provide a copyright framework for training data. These agreements resolve the copyright question for the contracted parties but do not resolve the broader doctrinal question, and they do not address the data protection obligations that apply independently of copyright status.

The second is legislation: the UK’s Data (Use and Access) Act 2025 requires an economic impact assessment on copyright works in AI development, due before Parliament by March 2026. The EU’s Digital Omnibus attempted to amend the GDPR directly for AI training. Japan’s APPI amendment is already in the legislative process. These interventions are jurisdiction-specific and do not resolve the multi-jurisdictional compliance challenge facing any AI company that trains on global internet data.

The third path — the one that litigation is slowly, expensively forcing — is doctrinal clarification through judicial decisions that become binding precedent. That process takes years. In the interim, AI companies are operating under genuine legal uncertainty across every major jurisdiction simultaneously — and the individuals whose personal information and expressive content provided the training data have rights under two separate frameworks that the companies may or may not be satisfying, depending on which court you ask.

Key Takeaways

Copyright and data protection law apply independently and simultaneously to AI training data. A valid TDM exception or copyright licence does not satisfy GDPR obligations, and GDPR compliance does not resolve copyright exposure.
In the US, two of three federal fair use rulings on AI training have favoured AI companies. The New York Times v. OpenAI case, with 20 million output logs now subject to discovery, is the most consequential pending decision.
The Munich Regional Court’s November 2025 ruling in GEMA v. OpenAI draws a critical distinction between data mining (permitted under TDM) and permanent memorisation of protected works that enables output reproduction (not permitted). This distinction is applicable across EU jurisdictions.
The UK ICO’s February 2026 investigation into Grok/xAI signals enforcement intent. The ICO has declined to adopt a permissive interpretation of UK GDPR for AI training without adequate safeguards.
Japan’s APPI amendment (April 2026) creates the most permissive statutory pathway for AI training data use without consent among major data protection regimes — but Japan’s copyright law framework is evolving, and the Article 30-4 exception has limits.
The EU Digital Omnibus proposal to amend the GDPR for AI training is politically contested. The proposed redefinition of “personal data” was removed from the Council compromise drafts in March 2026. The reform may not proceed in its original form.
Companies building AI on global training data face multi-jurisdictional exposure where the applicable rules are currently contradictory. Licensing agreements with major rights holders, documented DPIAs, and purpose-limited data governance are the minimum compliance position while doctrine develops.