AI and the software supply chain: Application security just got a whole lot more complicated


As artificial intelligence (AI) captivates the hearts and minds of business and technology executives eager to generate rapid gains from generative AI, security leaders are scrambling. Seemingly overnight, they’re being called to assess a whole new set of risks from a technology that is in its infancy.

AWS Builder Community Hub

Many are being called on to develop policies for emerging use cases of large language models (LLMs) before most technologists even fully understand how they work. And veteran security pros are ruminating on long-term game plans for managing risks in an AI-dominated landscape.

Software supply chain security issues are now familiar to most application security and DevSecOps pros. But AI is upping the demand on security teams that are already stretched thin, adding new risk thats center on how AI researchers and developers source the AI models and training data they use to build their systems.

Here’s what your team needs to know about securing the software supply chain as AI wins the hearts and minds of business leaders.

[ Join Webinar: Securing the Software Supply CHAIN, Not Just the Links | Get report: Why Traditional App Sec Testing Fails on Supply Chain Security ]

Understanding AI and the software supply chain: Code, models — and data

AI adoption brings a lot of new security issues — and new twists on old issues — for cybersecurity pros to ponder, said Patrick Hall, data scientist at and co-founder of BNH.AI, a boutique law firm focused on minimizing the legal and technical risks of AI.

“AI security risks aren’t always the same as traditional cybersecurity risks, but AI does bring a lot of familiar risk elements together.”
Patrick Hall

Hall said that in addition to AI transparency and AI bias, things such as intellectual property, data security, and supply chain security are all on the table for security teams now.

“They all get blended together in a pretty difficult risk milkshake. These are all some very weighty issues that we’ll be addressing over the next few years.”
—Patrick Hall

Software supply chain security in particular will grow even more complex as a result of expanded AI deployments. Just as with other kinds of modern software development, AI systems are composed of a wide range of open-source components. Making things even more complicated is that, in addition to software components, AI systems are reliant on open-source AI models and open-source training data, said Hyrum Anderson, distinguished ML engineer at Robust Intelligence.

“Supply chain security is going to be a huge concern for AI risk management in the coming years. It’s software, models, and data.”
Hyrum Anderson

AI development is a science that is still deeply rooted in the experimental and collegial ethos of the research community. AI systems tend to be quickly evolving through a lot of third-party relationships and collaboration that cuts across organizations, with everyone sharing and reusing models and data freely. This creates difficulties not only in dependency mapping, but also in change management and version control, said Chris Anley, chief scientist for NCC Group.

“The involvement of third parties is really important in terms of AI security, as well as supply chain issues. It all depends on the use case, but there are a lot of risks out there from the supply chain and third parties.” 
Chris Anley

Over the last couple of months, a number of reports have begun to scratch the surface of how deep AI software supply chain risks can go. A June report from Rezillion, which examined 50 of the most popular generative AI projects on GitHub, reported that when it used the Open Source Security Foundation Scorecard as an evaluation tool, the average score for security postures did not exceed 4.6 out of 10. Some of the factors considered in scoring included trust boundary risks, data management risks, and inherent model risks.

In another recent report from Endor Labs, researchers noted that AI APIs alone are posing a considerable amount of risk in the software development ecosystem. ChatGPT’s AI, for example, is used in 900 npm and PyPI packages across a range of “problem domains,” the report found. It stated that the top 100 open-source AI projects on GitHub have an average of 208 direct and transitive dependencies. One in five of these AI projects have over 500 dependencies each.

Most telling, 52% of AI project repositories reference known vulnerable dependencies in their manifest files. 

As Anderson and his co-author, Ram Shankar Siva Kumar, emphasize in their book, Not with a Bug, but with a Sticker: Attacks on Machine Learning Systems and What to Do About Them:

“Its reliance on an open source ecosystem makes AI systems especially susceptible to supply chain attacks. Bugs in AI software are existing vulnerabilities inherited from the traditional space.” 

Securing the data supply chain: Data poisoning at scale

The early research from Rezillion and Endor offer glimpses of the more traditional code-related supply chain vulnerabilities presented by popular open-source AI projects in the software, and in models, which also usually support executable code. But what of the data supply chain? This is where things get particularly tricky for AI defense.

Securing the data supply chain for AI is going to pose problems that can’t necessarily be solved with a software bill of materials or many of the other solutions currently developed for software supply chain security. Some of the biggest risks posed to AI resilience and robustness — whether from adversarial AI attacks or just plain malfunction — crop up from the choice of training data used to set the assumptions made by an AI model in its path to providing meaningful decisions and/or output, said Steve Benton, vice president and general manager of threat intelligence at Anomali.

All AI and ML systems are technically “built” (the platform/infrastructure they run on), which means they are born, and they then go through a period of training and learning to acquire the knowledge they need to operate in the problem space they have been designed for when they’ve “grown up.”

“That training takes significant effort, and the training data itself acquires huge value, as it is the means by which any clone/replacement would also need to be trained. If this training data is compromised, the AI/ML can be poisoned so that it can no longer be trusted to produce the right answers.”
Steve Benton

Data poisoning attacks are one of the top threats to AI that keeps executives up at night, Anderson and Siva Kumar write in their book. Anderson explained that for many systems, it is a near-trivial task to poison enough of a dataset to throw off an AI model.

Elaborating on research that was published earlier this year and that he cowrote with a number of luminaries from Google, NVIDIA, and ETH Zurich, Anderson said:

“You only have to control 0.1% of the data to effectively poison some of the most popular models in the world.” 

That study, titled “Poisoning Web-Scale Training Datasets Is Practical,” is going to get a deep dive treatment next month at Black Hat USA by Will Pearce, one of the co-authors and the AI red team lead for NVIDIA. At Black Hat, the researchers will show that gaining control over that small amount of data is easier than most people could imagine.

Many models depend on open-source and shared data. And the data is scraped from the Internet to gain the scale of data necessary to effectively feed AI models with enough training. One of the key takeaways from the research: An attacker can feasibly poison deep learning models and LLMs at scale on the web with an investment of as little as $60.

One of the key takeaways from the research: An attacker can feasibly poison deep learning models and LLMs at scale on the web with an investment of as little as $60.

AI and supply chain security: A whole new challenge

As security teams try to wrap their heads around AI security, data supply chain security issues will mirror what the software world has already started facing with software supply chain issues. But proving data provenance is going to be an entirely new challenge — requiring mew research, new best practices, and new technology.

For now, Anderson said, one of the biggest things that the security industry can do is help drive awareness that AI researchers and developers should be publishing cryptographic hashes of their training data alongside their models to help with integrity checking.

*** This is a Security Bloggers Network syndicated blog from ReversingLabs Blog authored by Ericka Chickowski. Read the original post at:

Ericka Chickowski

An award-winning freelance writer, Ericka Chickowski covers information technology and business innovation. Her perspectives on business and technology have appeared in dozens of trade and consumer magazines, including Entrepreneur, Consumers Digest, Channel Insider, CIO Insight, Dark Reading and InformationWeek. She's made it her specialty to explain in plain English how technology trends affect real people.

ericka-chickowski has 76 posts and counting.See all posts by ericka-chickowski