New

Copyright vs fair use in AI: key 2025 court case insights

2025 has been a big year for copyright laws and the fair use doctrine. Neudata Consultant Jessica Li Gebert dissects three court cases to answer why they have produced two different outcomes.

Jul 18, 2025

Copyright vs fair use in AI: key 2025 court case insights

To date, the US District Court for Delaware and the District Court for Northern California have clarified the applicability and limits of the fair use doctrine in using copyrighted materials to develop AI models and applications. But more questions remain.

In Reuters (Westlaw) v. ROSS, the court ruled against Ross’s fair use defence and found in Reuters’ favour. In Bartz v. Anthropic and Kadrey v. Meta, the court ruled for Anthropic and Meta on the basis of fair use.

All three cases involved data owners suing AI developers over the use of copyrighted materials to train AI models. Yet we are seeing two different outcomes. Why is that the case? What does it mean to data owners and content creators looking to formally license their data to AI model makers? What other uncertainties remain?

In this post, Neudata Consultant Jessica Li Gebert dissects these three cases to answer the questions above.

The spirit of the copyright law and the fair use doctrine

First, let’s contextualise the lawsuits with the Copyright Act of 1976 - the primary legal basis of copyright law in the US.

The spirit of the Copyright Act is to promote innovation and progress by

granting creators and inventors exclusive rights to commercialise / otherwise use their works; and
providing a legal recourse to seek remedies should their rights be infringed.

But ironically, stringent copyright enforcement can at times stifle innovation. To the Android users out there, did you know the customisability of your phones technically came from a copyright infringement? When Google first developed the Android system, it copied portions of Oracle’s Java API codes. After a decade of litigation, in 2021, the Supreme Court of the US ruled for Google based on the fair use doctrine.

The doctrine permits the use of copyrighted works without first obtaining permission from the copyright owner based on four factors (applied by the judges presiding over the three AI copyright infringement cases):

Purpose and character of the use, e.g. commercial, nonprofit educational purposes, transformative use (transformativeness is generally understood as derivative work transcending the original work, or derivative work that is meant for a different or novel use)
Nature of the copyrighted work, i.e. how creative / original / innovative the work is and how closely it aligns with the Law’s intent
Amount and substantiality of the portion used in relation to the copyrighted work as a whole
Effect of the use upon the potential market or value of the copyrighted work

With that in mind, let’s delve into the cases.

Background and case summary

Lawsuit 1: Thomson Reuters and West Publishing v. ROSS Intelligence

Interim outcome: In favour of Thomson Reuters / West Publishing; Rejected ROSS's fair use argument

Status: 2020 - present, ongoing

Partial summary judgement granted Feb 11, 2025
ROSS appealing as of Apr 2025

Background

ROSS is a legaltech startup developing an AI-powered legal research platform.

To train its AI platform, ROSS sought to license legal headnotes from Westlaw, a legal research database wholly owned by Thomson Reuters. Westlaw headnotes are proprietary summary documents written in-house on each lawsuit; they are widely used by legal professionals.

Reuters declined as ROSS is a direct competitor to Westlaw.

Consequently, ROSS contracted LegalEase, a legal support services provider, to collect Bulk Memos (i.e. notes) from lawyers with instructions to refer to Westlaw headnotes but not directly copy/paste.

Reuters sued ROSS for copyright infringement. ROSS claimed fair use.

Data use case: For AI training (input)

Court: District Court, D. Delaware (Judge Stephanos Bibas)

Fair use assessment details

Purpose and character of use: Favours Reuters. ROSS’s use is commercial. ROSS’s use is not transformative, citing Warhol v Goldsmith: “If an original work and a secondary use share the same or highly similar purposes, and the second use is of a commercial nature, the first factor is likely to weigh against fair use, absent some other justification for copying.”
Nature of copyrighted work: Favours ROSS. Westlaw’s headnotes require originality but are “not that creative”.
Amount and substantiality: Favours ROSS.
Effect on market / value of original work: Favours Reuters. Cited factor 4 as the most important factor in determining fair use. Walking back on his previous opinion, Judge Bibas deemed ROSS’s AI platform a competitor to Westlaw; and regardless of whether Reuters uses Westlaw data to train its own AI, ROSS’s unlicensed use has an effect on the potential market for AI training data and that is sufficient for this factor to go to Reuters.

Lawsuit 2: Bartz v. Anthropic

Interim outcome: In favour of Anthropic on fair use argument

Status: Aug 2024 - present, ongoing

Summary judgment on fair use granted Jun 23, 2025
Piracy cause is still open

Background

Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson are a group of authors.

Anthropic has created and maintains a digital “research library” of copyrighted books. It built this library by (1) copying books from free websites LibGen, PiMiLi, and Books3, which are known for pirating books; and (2) legally buying physical books and scanning them into a machine-readable digital format to train its foundation model - Claude.

Bartz et al sued Anthropic for copyright infringement over (1) making unauthorised copies of their books and (2) using their books to train Claude. Anthropic claimed fair use.

Data use case: For AI training (input)

Court: District Court, N.D. California (Judge William Haskell Alsup)

Fair use assessment details

Purpose and character of use: Favours Anthropic. Machine learning or training LLM is akin to human learning, i.e. Anthropic’s use is educational. Second, training Claude is transformative. Third, changing a book’s format for storage purposes is fair use. The situation may be different if the plaintiff pursued copyright infringement over Claude’s output, as opposed to input.
Nature of copyrighted work: Favours Bartz et al.‍
Amount and substantiality: Favours Anthropic. The copying is extensive - these books are used to train Claude precisely for their completeness. But the extensive use is reasonably necessary for training Claude.
Effect on market / value of original work: Favours Anthropic. Training LLMs has not resulted in competing works being generated, nor creative displacement by alternative works that concern the Copyright Act. Training LLM is akin to teaching school children to write well, which does not result in an “explosion of competing works”. The situation would be different if LLMs had outputed (generated) exact copies of the plaintiff’s books or produced knock-offs.

Lawsuit 3: Kadrey v. Meta

Interim outcome: In favour of Thomson Reuters / West Publishing; Rejected ROSS fair use argument

Status: Jul 2023 - present, ongoing

‍Partial summary judgement Jun 25, 2025

Background

Richard Kadrey et al are a group of 13 authors - mostly famous fiction authors.

To train its foundation model - Llama - Meta considered paying $100m to legally license training data but found it challenging as book copyright is owned by individual authors, not publishers.

Instead, Meta collected an extensive amount of training data from legitimate sources (including Common Crawl, Wikipedia, GitHub, ArXiv, Stack Exchange, and Project Gutenberg) and illicit shadow libraries (including Books3, LibGen, Z-Library). Meta downloaded >650 copyrighted books owned by the plaintiffs.

The plaintiffs sued Meta for copyright infringement, violation of the DMCA and other claims. Meta claimed fair use.

Data use case: For AI training (input)

Court: District Court, N.D. California (Judge Vince Girdhari Chhabria)

Fair use assessment details

Purpose and character of use: Favours Meta. Llama is free to license but is ultimately developed for commercial purposes, which Meta expects to generate $460B-$1.2T revenue over the next 10 years. Training LLM is a transformative use. Llama can perform a wide range of text functions, i.e. an innovative tool.
Nature of copyrighted work: Favours Kadrey et al.
Amount and substantiality: Favours Meta. The copying was extensive but reasonably necessary to train LLMs.
Effect on market / value of original work: Favours Meta. Llama does not and cannot generate meaningful portions / reproduce the plaintiffs’ original works as testified by both the plaintiffs’ expert and Meta, i.e. not competition. Llama / LLMs could “write” books but they are unlikely to displace or dilute the market for human-written books, i.e. not substitute. The plaintiffs’ opportunity cost (or harm) in future book licensing is not cognisable as Meta’s use is transformative.

So… why are the fair use judgments in three similar lawsuits different?

Well, the operative word here is transformative.

Anthropic’s Claude and Meta’s Llama are both LLMs, aka foundation models with general-purpose functionalities, upon which specialised applications are built. For instance, Slack AI and Notion AI are built on Claude, and Grok is rumoured to be built on Llama.

Meanwhile, ROSS is a specialised AI legal research tool. It has two parts - its legal retrieval engine is built on IBM Watson’s natural language understanding (NLU) capability; its general text summarisation and generation are powered by OpenAI’s GPT. Basically, ROSS does not transform how legal research is done. It merely speeds it up by automating parts of the research process using an underlying technology. This underlying technology - NLU and LLM - is what actually changes the way we process information.

What does that mean to data providers and content creators?

Foundation model makers and specialised application developers need data! And they will obtain it.

As a data owner or content creator, if you have data they need and have published it online or in hard copies, it’s highly likely they already have a copy or may eventually scrape or scan your data.

So, what should you do?

If you’re waiting to play defence, pursuing copyright infringement can be extremely difficult, as demonstrated in the three lawsuits above. Foundation models are the biggest data takers, but they are also backed by powerful tech companies that have near infinite resources to engage in legal action and are protected by the fair use doctrine. Meanwhile, specialised application developers may take less of your data and are less likely to be protected, but they’re nonetheless costly to litigate.

Instead, you could formally monetize, or sell, your data to avoid having it taken.

I like to think of a successful data business as a 3-step recurring process:

Secure: if you’re publishing your data online, chances are AI bots are already scraping it. Your best defence is a log-in requirement as the Ninth Circuit ruled in the landmark hiQ Labs v. LinkedIn, any online content not locked behind a log-in or paywall is likely fair game to web scrapers (i.e. not considered unauthorised access under the CFAA). But that’s not feasible for those who need to maintain an open-access website for commercial reasons. Instead, you may want to explore other technical deterrents and the emerging idea of an internet tollgate that charges web scrapers/crawlers for data taken, such as Cloudflare’s newly unveiled Pay-Per-Crawl and startup Tollbit.
Monetize: having secured the trade secrets of your data products, it’s time to formalise your data licensing strategy. Your ideal customer profile depends on the use cases of your data assets. Historically, the most developed data markets are advertising, corporate BI, and alternative data (financial information). AI is the latest emerging market.
Generate: finally, you’ll need to continuously create fresh, relevant and useful data to stay relevant. While some models are now trained on synthetic or pseudo-labeled data, these approaches carry significant risks, such as error propagation, embedded biases, limited generalisation and a lack of realism. As a result, data generated through real human activities, like applying domain expertise, engaging in problem-solving, or interacting with others, will remain in high demand, even when supported by AI-assisted methods of data collection and processing.

To evaluate the external potential of your data, contact consulting@neudata.co.

Nevertheless, uncertainties lie ahead

As Judge Alsup noted multiple times in his memo in Bartz v. Anthropic, the specific case dealt only with inputting copyrighted materials to train AI. What about the outputs generated by AIs trained on copyrighted materials?

In a way, Reuters v. ROSS touched upon this issue - as a specialised application, ROSS’s output is a direct competition to Westlaw’s product offering. As such, ROSS was not granted fair use and was ruled liable for copyright infringement. Now that ROSS is appealing the decision, let’s see what happens next.

We shall also wait for the outcomes of Getty Images v. Stability; and Andersen v. Stability AI, Midjourney, DeviantArt, where the very same question is being litigated.

Ultimately, AI is a complex topic with wide-ranging impacts. The case laws surrounding its development and deployment are equally complex. Legal developments tend to lag behind technological advancements. Based on what we have seen so far, I think the courts are moving speedily and in a reasonable manner to provide us with more legal clarity and guidance. We just have to be patient in the meantime.

‍

All insights