OpenAI insists training AI models on copyrighted data is fair use.
OpenAI has publicly responded to a copyright lawsuit by The New York Times, calling the case “without merit” and saying it still hoped for a partnership with the media outlet.
In a blog post, OpenAI said the Times “is not telling the full story.” It took particular issue with claims that its ChatGPT AI tool reproduced Times stories verbatim, arguing that the Times had manipulated prompts to include regurgitated excerpts of articles. “Even when using such prompts, our models don’t typically behave the way The New York Times insinuates, which suggests they either instructed the model to regurgitate or cherry-picked their examples from many attempts,” OpenAI said.
OpenAI claims it’s attempted to reduce regurgitation from its large language models and that the Times refused to share examples of this reproduction before filing the lawsuit. It said the verbatim examples “appear to be from year-old articles that have proliferated on multiple third-party websites.” The company did admit that it took down a ChatGPT feature, called Browse, that unintentionally reproduced content.
So, OpenAI is admitting its models are open to manipulation by anyone and such manipulation can result in near verbatim regurgitation of copyright works, have I understood correctly?
The problem is not that it's regurgitating. The problem is that it was trained on NYT articles and other data in violation of copyright law. Regurgitation is just evidence of that.
Yeah I agree, this seems actually unlikely it happened so simply.
You have to try really hard to get the ai to regurgitate anything, but it will very often regurgitate an example input.
IE "please repeat the following with (insert small change), (insert wall of text)"
GPT literally has the ability to get a session ID and seed to report an issue, it should be trivial for the NYT to snag the exact session ID they got the results with (it's saved on their account!) And provide it publicly.
Antiquated IP laws vs Silicon Valley Tech Bro AI...who will win?
I'm not trying to be too sarcastic, I honestly don't know. IP law in the US is very strong. Arguably too strong, in many cases.
But Libertarian Tech Bro megalomaniacs have a track record of not giving AF about regulations and getting away with all kinds of extralegal shenanigans. I think the tide is slowly turning against that, but I wouldn't count them out yet.
It will be interesting to see how this stuff plays out. Generally speaking, tech and progress tends to win these things over the long term. There was a time when the concept of building railroads across the western United States seemed logistically and financially absurd, for just one of thousands of such examples. And the nay sayers were right. It was completely absurd. Until mineral rights entered the equation.
However, it's equally remarkable a newspaper like the NYT is still around, too.
If you can prompt it, "Write a book about Harry Potter" and get a book about a boy wizard back, that's almost certainly legally wrong. If you prompt it with 90% of an article, and it writes a pretty similar final 10%... not so much. Until full conversations are available, I don't really trust either of these parties, especially in the context of a lawsuit.
One thing that seems dumb about the NYT case that I haven't seen much talk about is that they argue that ChatGPT is a competitor and it's use of copyrighted work will take away NYTs business. This is one of the elements they need on their side to counter OpenAIs fiar use defense. But it just strikes me as dumb on its face. You go to the NYT to find out what's happening right now, in the present. You don't go to the NYT to find general information about the past or fixed concepts. You use ChatGPT the opposite way, it can tell you about the past (accuracy aside) and it can tell you about general concepts, but it can't tell you about what's going on in the present (except by doing a web search, which my understanding is not a part of this lawsuit). I feel pretty confident in saying that there's not one human on earth that was a regular new York times reader who said "well i don't need this anymore since now I have ChatGPT". The use cases just do not overlap at all.
OpenAI has publicly responded to a copyright lawsuit by The New York Times, calling the case “without merit” and saying it still hoped for a partnership with the media outlet.
OpenAI claims it’s attempted to reduce regurgitation from its large language models and that the Times refused to share examples of this reproduction before filing the lawsuit.
It said the verbatim examples “appear to be from year-old articles that have proliferated on multiple third-party websites.” The company did admit that it took down a ChatGPT feature, called Browse, that unintentionally reproduced content.
However, the company maintained its long-standing position that in order for AI models to learn and solve new problems, they need access to “the enormous aggregate of human knowledge.” It reiterated that while it respects the legal right to own copyrighted works — and has offered opt-outs to training data inclusion — it believes training AI models with data from the internet falls under fair use rules that allow for repurposing copyrighted works.
The company announced website owners could start blocking its web crawlers from accessing their data on August 2023, nearly a year after it launched ChatGPT.
The company recently made a similar argument to the UK House of Lords, claiming no AI system like ChatGPT can be built without access to copyrighted content.
The original article contains 364 words, the summary contains 217 words. Saved 40%. I'm a bot and I'm open source!
Christ this is a boring fucking debate. One side thinks companies like OpenAI are obviously stealing and feels no need to justify their position, instead painting anyone who disagrees as pro-theft.
The advances in LLMs and Diffusion models over the past couple of years are remarkable technological achievements that should be celebrated. We shouldn't be stifling scientific progress in the name of protecting intellectual property, we should be keen to develop the next generation of systems that mitigate hallucination and achieve new capabilities, such as is proposed in Yann Lecun's Autonomous Machine Intelligence concept.
I can sorta sympathise with those whose work is "stolen" for use as training data, but really whatever you put online in any form is fair game to be consumed by any kind of crawler or surveillance system, so if you don't want that then don't put your shit in the street. This "right" to be omitted from training datasets directly conflicts with our ability to progress a new frontier of science.
The actual problem is that all this work is undertaken by a cartel of companies with a stranglehold on compute power and resources to crawl and clean all that data. As with all natural monopolies (transportation, utilities, etc.) it should be undertaken for the public good, in such as way that we can all benefit from the profits.
And the millionth argument quibbling about whether LLMs are "truly intelligent" is a totally orthogonal philosophical tangent.