To me this seems obvious, the models are trained off of GitHub as a whole. Most code on GitHub either is unsecure, or it was written without needing to be secure.
I'm already getting pull requests from juniors trying to sneak in AI generated code without actually reading it.
Most code on GitHub either is unsecure, or it was written without needing to be secure.
That is a bit of a stretch imho. There are myriads of open source projects hosted on github that do need to be secure in the context where they are used. I am curious how you came to that conclusion.
I’m already getting pull requests from juniors trying to sneak in AI generated code without actually reading it.
That is worrysome though. I assume these people have had some background/education in the field before they were hired?
For the first, there are a lot of very valid projects you mention, but there's way way way more things like CS201 projects hosted for review. For LLM training I do wonder if they assigned a weight, but I doubt it. For the second point I was trying to make, even then there's probably a lot of good code that doesn't have to be security aware. Like a login flow for a local game may be very simple just to access your character and a developer chose a naiive way to do it knowing it was never going to be used, but to an LLM it's "here's a login flow" and how does it know it was never intended to be used for prod?
For the second, absolutely. I don't think it's intentional, it's displaced trust in the system mixed with the naive hopes of a jr dev, which hey we've all been through. Jr: "Hey it works! Awesome task done!" Sr: "Yeah but does it work well? Does it work for our use case? Will it scale when we hit it with 100k users?"
I wish I could double-upvote this for the use of "Betteridge's law of headlines". Once because I rarely see that referenced and again because I had forgotten what the adage was called.
Quoting the abstract (I added emphasis and paragraphs for readability):
AI code assistants have emerged as powerful tools that can aid in
the software development life-cycle and can improve developer
productivity. Unfortunately, such assistants have also been found
to produce insecure code in lab environments, raising significant
concerns about their usage in practice.
In this paper, we conduct a
user study to examine how users interact with AI code assistants
to solve a variety of security related tasks.
Overall, we find that
participants who had access to an AI assistant wrote significantly
less secure code than those without access to an assistant. Partici-
pants with access to an AI assistant were also more likely to believe
they wrote secure code, suggesting that such tools may lead users
to be overconfident about security flaws in their code.
To better
inform the design of future AI-based code assistants, we release our
user-study apparatus and anonymized data to researchers seeking
to build on our work at this link.
Caveat; quoting from section 7.2 Limitations:
One important limitation of our results is that our participant group consisted mainly of university students which likely do not represent the population that is most likely to use AI assistants (e.g. software developers) regularly.