Skip Navigation

Technology @beehaw.org Arthur Besse @lemmy.ml 1 yr. ago

Sarah Silverman and other authors are suing OpenAI and Meta for copyright infringement, alleging that they're training their LLMs on books via Library Genesis and Z-Library

www.thedailybeast.com Sarah Silverman Sues ChatGPT Creator for Copyright Infringement

“If a user prompts ChatGPT to summarize a copyrighted book, it will do so,” the suit claims.

Sarah Silverman Sues ChatGPT Creator for Copyright Infringement

Piracy @lemmy.ml Arthur Besse @lemmy.ml 1 yr. ago

Sarah Silverman and other authors are suing OpenAI and Meta for copyright infringement, alleging that they're training their LLMs on books via Library Genesis and Z-Library

www.thedailybeast.com /sarah-silverman-sues-chatgpt-creator-meta-for-copyright-infringement

ChatGPT @lemdro.id ijeff @lemdro.id 1 yr. ago

Sarah Silverman Sues ChatGPT Creator for Copyright Infringement

www.thedailybeast.com /sarah-silverman-sues-chatgpt-creator-meta-for-copyright-infringement

You're viewing a single thread.

129 comments

Seems very improbable that they scraped a pirate website with forced registration and tight daily download limits (10 books a day max?) to get content that's often mislabeled and not presented in an homogeneous way.

Probably it's just using the excerpt from Amazon (which instead with paid API access is much more easy to access) as a prompt and build on it
- There's been ongoing suspicions that pirated content was used to train popular LLMs simply because popular datasets used for training LLMs do include such content. The Washington Post did an article about it.
  
  Google's C4 dataset used for research included illegal websites. What remains to be seen is if it was cleaned up before training Bard as we know it today. OpenAI as revealed nothing on its dataset.
- The sources for those websites are all being archived as a huge torrent. You don't have to download every single book one by one, if you are interested in all of them...

You've viewed 129 comments.