AI Fair Use Landmark: Authors v. Anthropic

Artificial intelligence is rapidly transforming how we interact with information, but its hunger for high-quality data has created a collision course with copyright law. This conflict takes center stage in the recent case of Bartz et al. v. Anthropic PBC, where acclaimed authors challenged Anthropic’s use of their books to train the company’s popular Claude AI models.

On June 23, 2025, Judge William Alsup of the U.S. District Court for the Northern District of California issued a detailed order addressing whether Anthropic’s conduct qualifies as “fair use” under Section 107 of the Copyright Act. The decision provides a rare, in-depth examination of how a leading AI company sourced its training data and the legal risks that accompany such practices. Still to come, a trial on the issues.

Fair use

The fair use doctrine is a legal principle under U.S. copyright law that allows limited use of copyrighted material without permission from a copyright owner. Fair use provides for the legal, unlicensed citation or incorporation of copyrighted material in another author’s work under specific circumstances, including commentary, criticism, parody, news reporting, research and scholarship. Without the fair use doctrine, all copying of any amount for any purpose would be a violation of copyright.

Courts evaluate fair use claims using four factors spelled out in Section 107 of the Copyright Act: the purpose and character of the use (including whether it’s commercial or educational), the nature of the copyrighted work, the amount and substantiality of the portion used, and the effect of the use on the potential market for or value of the copyrighted work. Fair use is intentionally vague and does not contain quantitative limits or absolutes, but rather provides a sliding scale of considerations. The only way to get a definitive answer on whether a particular use qualifies as fair use is to have it resolved in federal court.

How Anthropic Built Its AI Library

The court order reveals that Anthropic, founded by former OpenAI employees, set out to create a “central library of all the books in the world” to serve as a permanent and searchable resource for AI research and model training. Anthropic employed a two-pronged approach combining pirated digital books with bulk scanning operations.

Anthropic downloaded millions of full-text books from notorious pirate sites including Books3, Library Genesis (LibGen) and Pirate Library Mirror (PiLiMi). The company knew these were unauthorized copies, a fact openly discussed in internal communications.

To circumvent licensing requirements, Anthropic spent millions purchasing used and new print books in bulk, then systematically removed book bindings, scanned every page and discarded the physical copies. The company cataloged and stored the resulting digital files indefinitely with the stated goal of retaining this library “forever” for both AI training and general research purposes.

Why Books? The Value for AI Training

Anthropic’s internal documents clearly established that books, particularly those by professional authors, were valued for their creative expression and high-quality writing. The company believed training on these works would make Claude’s outputs more compelling, accurate and “editor-approved.”

Despite having alternatives such as commissioning original content or licensing existing works, Anthropic chose not to pursue these options, citing cost and convenience concerns.

The Copying Process: Massive and Repetitive

The court order meticulously details the multi-stage copying process applied to each book. The process began with an initial copy transferred from the central library to a working training set. Next came cleaning, which involved removing headers, footers and extraneous text to create a new “cleaned” copy. The third stage was tokenization, converting text into machine-readable tokens and resulting in additional copies. Finally came iterative training, with repeated access and processing of tokenized copies during model training.

The scale was so extensive that Anthropic admitted it would be “impractical even to estimate” the total number of copies made.

Legal Analysis: The Heart of the Dispute

The central question is whether this extensive, unauthorized copying constitutes “fair use” under Section 107 of the Copyright Act. The court’s order highlights several critical factors that weigh against fair use protection.

The commercial nature of Anthropic’s use was directly tied to a billion-dollar product. The scale and permanence of the copying was systematic, large-scale and intended to create a permanent research library. Anthropic had ready access to alternatives, as the company could have licensed works or created original training data but chose not to. Perhaps most importantly, the company’s business case depended on the expressive, creative qualities of the books, not merely factual information.

Implications for AI and Copyright

The court’s findings establish the foundation for a potentially precedent-setting ruling on fair use boundaries in the generative AI era. If Anthropic’s conduct is deemed not to constitute fair use, the decision could reshape how AI companies source and use copyrighted materials and force a reckoning regarding the need for licensing or alternative data strategies.

For authors and publishers, the case serves as a bellwether for how their rights will be protected as AI systems become increasingly sophisticated. For the tech industry, it warns that shortcuts in data acquisition carry serious legal consequences.