OpenAI strikes Reddit deal to train its AI on your posts
OpenAI strikes Reddit deal to train its AI on your posts

Reddit’s deal with OpenAI will plug its posts into “ChatGPT and new products”

cross-posted from: https://lemmy.world/post/15479755
OpenAI strikes Reddit deal to train its AI on your posts
Each time this pops up, there is a rush of people saying to delete or edit your comments.
They have a database of your comments and all your edits. Its easy to see when you mass delete or edit them. Anything done past a certain point in time, especially all at once, is automatically reverted.
By deleting and editing, you are taking the data away from scrappers making the dataset they are selling actually unique and more valuable.
I mean that's completely illegal at least in places like Germany where people have the right to be forgotten, but unfortunately you're still right. They already commited the biggest heist in human history and got away with it. I guess NFT grifters only got punished because they dared to also steal from some rich people while Altman and his cronies are smart enough to only steal from the other 99.9%. When they have your data once, you can't request it back anymore. Because the worst that can happen to them is a slap on the wrist and the cost of being in the fastest growing business of our times. In other words: World's fucked and shit sucks.
I just did a bit of poking around on the subject of the "right to be forgotten" and it's legally complex. Data without personally identifying information, and data that's been anonymized through statistical analysis (which LLM training is a form of) aren't covered.
Surely the use of user-deleted content as training data carries the same liabilities as reinstating it on the live site? I've checked my old content and it hasn't been reinstated. I'd assume such a dataset would inherently contain personal data protected by the right to erasure under GDPR, otherwise they'd use it for both purposes. If that is correct, regardless of how they filtered it, the data would be risky to use.
Perhaps the cumulative action of disenfranchised users could serve toward the result of both the devaluation of a dataset based on a future checkpoint, or reduction in average post quality leading to decreased popularity over time (if we assume content that is user-deleted en masse was useful, which I think is fair).
I think you need to make a special request to get that level of deletion that comes with gdpr. I'm not certain, I just remember other users specifically talking about how you need to send them an email so they have to comply.
I also wouldn't be surprised if their dataset is mostly stripped of user names to get around GDPR though I'm no expert.
All that to say I'd be very very surprised if they deleted comments in their dataset.
Very valid point of devaluating the user experience thought, especially when you take into account google searches. I'm sure they have already fallen off compared to a year ago where reddit would pop up half the time no matter what you searched.
Why would that be? It's not the same.
And what liabilities would there be for reinstating it on the live site, for that matter? Have there been any lawsuits?