The conventional wisdom, well captured recently by Ethan Mollick, is that LLMs are advancing exponentially. A few days ago, in very popular blog post, Mollick claimed that “the current best estimates of the rate of improvement in Large Language models show capabilities doubling every 5 to 14 months”...
I think this article does a good job of asking the question "what are we really measuring when we talk about LLM accuracy?" If you judge an LLM by its: hallucinations, ability analyze images, ability to critically analyze text, etc. you're going to see low scores for all LLMs.
The only metric an LLM should excel at is "did it generate human readable and contextually relevant text?" I think we've all forgotten the humble origins of "AI" chat bots. They often struggled to generate anything more than a few sentences of relevant text. They often made syntactical errors. Modern LLMs solved these issues quite well. They can produce long form content which is coherent and syntactically error free.
However the content makes no guarantees to be accurate or critically meaningful. Whilst it is often critically meaningful, it is certainly capable of half-assed answers that dodge difficult questions. LLMs are approaching 95% "accuracy" if you think of them as good human text fakers. They are pretty impressive at that. But people keep expecting them to do their math homework, analyze contracts, and generate perfectly valid content. They just aren't even built to do that. We work really hard just to keep them from hallucinating as much as they do.
I think the desperation to see these things essentially become indistinguishable from humans is causing us to lose sight of the real progress that's been made. We're probably going to hit a wall with this method. But this breakthrough has made AI a viable technology for a lot of jobs. So it's definitely a breakthrough. I just think either I finitely larger models (of which we can't seem to generate the data for) or new models will be required to leap to the next level.
But people keep expecting them to do their math homework, analyze contracts, and generate perfectly valid content
People expect that because that's how they are marketed. The problem is that there's an uncontrolled hype going on with AI these days. To the point of a financial bubble, with companies investing a lot of time and money now, based on the promise that AI will save them time and money in the future. AI has become a cult. The author of the article does a good job in setting the right expectations.
I guess ChatGPT 4 has wised up. I'm curious now. Will try it.
Edit: Yup, you're right. It says "bro, you cray cray." But if I tell it that it's a recent math model, then it will say "Well, I guess in that model it's 7, but that's not standard."