There’s already more than enough training data out there. The important thing that remains is to filter it so it doesn’t also include humanity’s stupidest data.
That and make the algorithms smarter so they are resistant to hallucination and misinformation - that’s not a data problem, it’s an architecture problem.
Well, it's established wisdom that the dataset size needs to scale with the number of model parameters. Quadratically, IIRC. If you don't have that much data the training basically won't work; it will overfit or just not progress.
Chess bots (like Stockfish) are trained on game samples, with the goal of predicting what search path to keep looking at and which moves will result in a win. You get game samples by playing the game, so it made sense to have stockfish play itself, since the input was always still generated by the rules of chess.
If a classifier or predictive model creates it's own data without tying it to the rules and methods in reality, they're going to become increasingly divorced from reality. If I had to guess, that's what the guy in the article is referencing when talking about "sanitizing" the data. Some problems, like chess, are really easy. Mimicking human speech? Probably not
Yeah, because the human developers know the rules of chess, so it's easy to generate or verify perfect quality games at massive scale. Natural language can't be tackled like that; certainly not yet, probably not ever. Many have tried and failed to parse natural language algorithmically, but at the end of the day it seems to rely heavily on loose conventions and endless shared experiences. So, you need content from the wild, or you're basically letting the AI mark its own homework.
I work in AI. This is very common, and lots of companies use this. It's also very common in academia, as it's an easy way to get data. Synthetic data can range from totally fake to techniques like machine translation to transform data from one language to another.
When they say "AI generated", it's probably just using one of the API's the LLM orchestrates.
In other news, the world's wealthiest people are running out of money after burning through the entire planet. Sources say one of the world's multi-billionaires purchased a law firm that was in bed with the RIAA roughly 10-15 years ago when music piracy was supposedly costing more money than the GDP of all the peoples of the world, combined. "The Owners" (as they have recently rebranded) have decided to collect on this unpaid debt from every living soul, and from all the multinational companies who have been long-established as having no living souls whatsoever. A nameless, faceless, pitiless representative was quoted as saying: "Resistance... is futile. Your life, as it has been, is over. From this time forward, you will service... us."
Is that just a wild assumption, or...? One phenomena that has already been witnessed with AI is that it does in fact get worse if it trains upon it's own output.
You’re rooting for a revolutionary new technology to fail rather than get better
As long as the oligarchs who run and own these AI systems are at the helm, yes I'm rooting for it to fail. Better is in the eyes of the beholder. Because come on, we all know better is going to be defined as better for the oligarchs, not you or me.
There's nothing 'revolutionary' about a mass theft machine until EVERYONE IT'S STEALING FROM is getting paid out of the thieves' pockets for what was stolen from them; and the people that run it make no profit from it. Til then, it's just business as usual out of the west's necrocapitalists; and your business makes me vomit.
Then again, there is another obvious solution to this manufactured problem: AI companies could simply stop trying to create bigger and better models, given that aside from the training data shortage, they also use tons of electricity and expensive computing chips that require the mining of rare-earth minerals.
It's always been a boondoggle...
But there has to be something investors don't understand that they'll dump billions into.
Might as well stop producing new GPUs entirely, video games, video editing, shit basically anything done of a computer outside is a waste of electricity and rare earth minerals.
We don’t even need search engines, let’s go back to libraries and paper books!
As long as it’s not housing or food, we don’t need it. Let’s go full fucking anprim because anything else isn’t required to survive and is a waste of resources.
Easy -- their methods aren't sufficient to begin with. No amount of training data would be enough. But perhaps they can develop new methods with what they've learned.
While the article makes a big deal about a lack of data and even hint at synthetic data as an option, the truth is synthetic data is already being used and is just as good apparently at training. Such a misinformation article designed to stir the AI haters especially the headline.
They seem to be experimenting with that for sure, but need to ensure quality of the model doesn't degrade, as per source article:
Anthropic’s chief scientist, Jared Kaplan, said some types of synthetic data can be helpful. Anthropic said it used “data we generate internally” to inform its latest versions of its Claude models. OpenAI also is exploring synthetic data generation, the spokeswoman said.
Imo we've clearly hit a limit with vertical scaling of data. We need some kind of breakthrough on better ways to process what data we've got if we want to continue making meaningful progress.