ChatGPT provides false information about people, and OpenAI can’t correct it
ChatGPT provides false information about people, and OpenAI can’t correct it

ChatGPT provides false information about people, and OpenAI can’t correct it

It’s clear that companies are currently unable to make chatbots like ChatGPT comply with EU law, when processing data about individuals. If a system cannot produce accurate and transparent results, it cannot be used to generate data about individuals. The technology has to follow the legal requirements, not the other way around.
ChatGPT is not an information repository.
ChatGPT is not an information repository.
ChatGPT is not an information repository.
The correct answer to this problem is not "we can't correct it"; it is "this class of task is completely out of scope for ChatGPT, and we will do everything we can to make sure users understand that". Unfortunately, OpenAI knows damn well this is how the public perceives and uses its product and seems happy to let this misconception persist.
We do need laws to curb this, but it's really more a marketing issue than a technological issue. The underlying technology is amazing; the applications built around it are mostly garbage. What we have here is a hype trainwreck.
Yet, LLMs are trained on data - an information repository. They are capable of accessing and recalling the contents of that information repository, and relaying information from that repository to an end user. It may not be an information repository functionally, but it legally seems to have the capabilities to be classified as one. (I am neither a lawyer nor a programmer, and I am not in the EU.)
The software breaks the law, and the people who built it knew that this was likely the case. It was developed as a research project, which has very different legal requirements from a consumer product. They might not outright ban the software, but they might issue some hefty fines, etc. Banning a product is not the only recourse of the courts.
This is not correct based on my understanding of LLMs, but I am certainly not an expert. As I understand it, it's basically a statistics exercise in how they determine what order to put words into. They don't 'look stuff up' in their training data. They probably don't even have access to their training data once the model is complete. These models are trained on terabytes of data but are small enough to fit in memory, so it's impossible for them to still have access to all that. But it wouldn't matter if they did, because that's not how they work.
They don't recall information from a repository, the repository is translated into a set of topic based weighted probabilities of what words come next.
Those probabilities are then used to reconstruct a best-guess at what words are next when generating strings of language.
It's not recall, it's a form of "free" association, which is quite tightly bounded to the context, topic, and weightings of the training data.
This is not precise and is more likely to create average answers and sentences, rather than precise ones.
It's not recall, it's really convincing lies.
To clarify, I mean to say that users should not consider it an information repository, because it does not function as one, by design. Whether it should be classified as such under the law is another matter, one on which I do not have enough knowledge to comment. I do think OpenAI is presenting ChatGPT inappropriately, and I hope they will be held accountable for that.
I'm sure in the future we will see true databases built on the same technology (and they will be awesome, if implemented properly). But that's not what ChatGPT is (or, as far as I know, any other existing LLM-based application). Any information it is able to "recall" is almost a coincidence of how it was trained. You can sort of think of it like lossy compression. The LLM gets all of its information from its training set, but it is not designed to retain any specific information from the training set in full. In cases where it does, that usually means one of two things: