, were found in the datasets used to train LLaMA. The complaint mentions ThePile in particular, which was created by a company named EleutherAI.
The suit quotes EleutherAI's own description of its dataset as using Bibliotik, one of several"shadow libraries" the suit condemns:"Bibliotik consists of a mix of fiction and nonfiction books [...] We included Bibliotik because books are invaluable for long-range context modelling research and coherent storytelling."
The suit then explains:"These shadow libraries have long been of interest to the AI-training community because of the large quantity of copyrighted material they host. For that reason, these shadow libraries are also flagrantly illegal." The author's representatives, lawyers Matthew Butterick and Joseph Saveri, write on their litigation website:"Much of the material in the training datasets used by OpenAI and Meta comes from copyrighted works—including books written by Plaintiffs—that were copied by OpenAI and Meta without consent, without credit, and without compensation.