Because accuracy requires that you make a reasonable distinction between truth and fiction, and that requires context, meaning, understanding. Hell, full humans aren't that great at this task. This isn't a small problem, I don't think you solve it without creating AGI.
There's going to be an entire generation of people growing up with this and "learning" this way. It's like every tech company got together and agreed to kill any chance of smart kids.
I'm not an expert, but it has something to do with full words vs partial words. It also can't play wordle because it doesn't have a proper concept of individual letters in that way, its trained to only handle full words
they don't even handle full words, it's just arbitrary groups of characters (including space and other stuff like apostrophe afaik) that is represented to the software as indexes on a list, it literally has no clue what language even is, it's a glorified calculator that happens to work on words.
It can't see what tokens it puts out, you would need additional passes on the output for it to get it right. It's computationally expensive, so I'm pretty sure that didn't happen here.
LLMs have three major components: a massive database of "relatedness" (how closely related the meaning of tokens are), a transformer (figuring out which of the previous words have the most contextual meaning), and statistical modeling (the likelihood of the next word, like what your cell phone does.)
LLMs don't have any capability to understand spelling, unless it's something it's been specifically trained on, like "color" vs "colour" which is discussed in many training texts.
"Fruits ending in 'um' " or "Australian towns beginning with 'T' " aren't talked about in the training data enough to build a strong enough relatedness database for, so it's incapable of answering those sorts of questions.
Ok, I feel like there has been more than enough articles to explain that these things don't understand logic. Seriously. Misunderstanding their capabilities at this point is getting old. It's time to start making stupid painful.
Reminds me of how the "1800 gallons for one burger" statistic uses annual rainfall to calculate that as if it was captured, stored, and used from our kitchen sinks.