Generative AI has recently become one of the major trends in our society, disrupting various industries and revolutionizing the way we approach various tasks. From creating realistic images (Midjourney, DALL-E, Stable Diffusion) to generating text (ChatGPT, Google Sparrow) and even coding (Github Copilot, Codex), generative AI is changing the game and the cybersecurity sector is no exception. With its ability to generate new data, discover patterns, and find solutions, generative AI poses new security challenges that need to be addressed before cybercriminals can take advantage of them. By harnessing the power of generative AI, organizations can stay ahead of the curve and proactively address the constantly changing landscape of cyber threats. The future impact of generative AI on cybersecurity is undeniable and it will start shaping the cybersecurity industry soon.
However, if the cybersecurity community lags behind in adopting and adapting to generative AI, it may result in a significant disadvantage. Cyber attackers may leverage generative AI to create sophisticated and targeted attacks that are difficult to detect and defend against. As a result, organizations that fail to keep up with these advancements in cybersecurity may face an increased risk of security breaches and other cyber threats. Therefore, it is important for the cybersecurity community to stay updated with the latest advancements and adopt new technologies and strategies to counter emerging threats.
Near-Future Impact of Generative AI on CyberSecurity
Rise in Social Engineering Attacks
Polymorphic Malware and Behavioral Detection
Challenges with Applying AI to Cybersecurity
About Symmetry Systems and DataGuard
Introduction
This position paper explores the potential impact of recent advancements in the field of generative AI on cybersecurity. We focus on how the development of large-scale generative models in natural language processing (NLP) and computer vision (CV) can be adapted to enhance security measures. As the advantages of incorporating machine learning (ML) into cybersecurity strategies are well established, we will not delve into that aspect further in this discussion. Instead, we aim to examine the innovative ways in which cutting-edge ML technology can benefit cybersecurity efforts and what new challenges it may pose to security experts.
Instead of presenting a comprehensive list of all possible impacts, we emphasize only significant impact areas for generative AI in cybersecurity. While new applications of generative models, including those in the security field, may emerge, we refrain from speculation. The effect of Generative AI depends on its rate of advancement. If it continues to grow at the same pace it has in the past five years, the impact will be greater than what can be currently anticipated.
Near-Future Impact of Generative AI on Cybersecurity
Software 2.0
Unlike traditional software development, where developers must explicitly program every step of a software system, Software 2.0 enables machines to generate their own algorithms based on patterns they discover in data. Developers of Software 2.0 need to define a task objective and, optionally, auxiliary subtasks to accelerate learning. AlphaZero, AlphaGo, Learned DB indexes, Speech synthetic, Image classification illustrate Software 2.0 paradigm.
Software 2.0 revolution is expected to bring significant changes to software engineering and CloudOps fields in the coming years. Nowadays, both software developers and cloud engineers have to build software or infrastructure systems from the ground up, which is time-consuming and often error-prone. The concept of low-code or no-code represents an initial phase of Software 2.0. According to Gartner, Inc., the low-code application platforms market is expected to grow 25% to reach nearly $10 billion in 2023.
The use of powerful AI models and optimization techniques will greatly impact this industry. The focus will shift from implementation to defining objectives and allowing AI to handle the implementation details. This subtle difference between those approaches will greatly change the technology landscape, requiring humans to focus on high-level goals instead of low-level implementation details. AI will be able to update implementation according to the constantly changing environment—cloud budget changes, release of new hardware, migration to new software frameworks, updates in network and security standards, etc. As a result, we will always have the most up-to-date software artifacts, for example, updated security/cloud configurations or most performant code. Also, such an approach will help to avoid having numerous legacy systems that usually pose a significant security risk.
Cloud IAM. Symmetry Systems is a leader in applying AI to cloud security—our proprietary AI-based framework can generate provably secure cloud IAM configurations for our customers. Specifically, we developed IAMAX, an AI-based framework for generating provably secure cloud configurations (Kazdagli et al., 2022). IAMAX focuses on Identity and Access Management (IAM) cloud component. IAM is a crucial aspect of cloud security that involves granting cloud identities access to resources through IAM policies. To provide a high level of security, IAM policies need to strictly adhere to the principle of the least privilege, i.e. to limit cloud identities’ access to resources that are required on a daily basis. As explained in the IAMAX blog, properly configuring IAM policies is challenging, leading to insecure policies in real-world scenarios. IAMAX automatically generates optimal IAM policies according to organizations’ security needs and continues to monitor and update them as necessary.
Symmetry Systems is a leader in applying AI to cloud security—our proprietary AI-based framework can generate provably secure cloud IAM configurations for our customers.
Enhanced Data Confidentiality
Data governance and privacy. A data-driven approach to security often conflicts with data governance and privacy because it requires training ML algorithms on vast amounts of sensitive data. Organizations are typically reluctant to share their data due to privacy concerns, as it may contain customer information or reveal confidential company information that could be exploited by cybercriminals.
Generative models trained using the Federated Learning (FL) approach (Google blog, Augenstein et al., 2020) can address privacy issues. FL is a recently developed approach to training ML models that avoids moving sensitive data to centralized storage. Specifically, FL allows training local ML models within the cloud environments of individual organizations and periodically updating model parameters on a centralized server or directly updating parameters of individual models when using an asynchronous update mechanism. This approach preserves the privacy of the data and eliminates the need for costly data transfers to a centralized data repository. Once trained, a generative model can be used for a variety of data analysis tasks, such as anomaly detection, synthetic data generation, anonymized data exploration, visualization, computing descriptive statistics, and many others. By decoupling the actual data and the analytics tools via generative model abstraction, we can preserve the privacy of organizations’ internal data.
Data Loss Prevention
Data Loss Prevention (DLP) systems are designed to prevent unauthorized disclosure of sensitive information by monitoring, detecting, and blocking the transmission of sensitive data. DLP systems use a variety of techniques, such as content analysis, fingerprinting, and user behavior monitoring to identify and classify sensitive data. They also enforce policies and rules that dictate how sensitive data can be transmitted or stored. DLP systems are used by organizations to protect their sensitive data from insider threats, external attacks, and accidental disclosure. They are an essential tool for maintaining data privacy, compliance, and security in today’s digital age.
DLP systems are usually heterogeneous in nature; they employ both ML-based methods such as Named Entity Recognition (NER), as well as various heuristics expressed in the form of context-free grammars or regular expressions. NER algorithms are optimized on natural language datasets that usually have a broad context around named entity candidates. However, DLP systems are applied in settings where context around named entities may be missing (e.g., logs, machine-formatted files, etc.). Moreover, machine-generated data may cause spurious matches against heuristic rules.
Large language models (LLM) can improve the accuracy of DLP via advanced language modeling and few-shot learning capabilities. Few-shot learning refers to the ability to apply machine learning techniques to novel tasks by using only a small number of examples as demonstrations. This allows for rapid adaptation and learning in new contexts. LLMs also remove the need for manual development of error-prone heuristics. Though LLMs are computationally expensive, they can be integrated into the DLP pipeline at the latest stages where the nature of a NER candidate still remains unclear (similar to recommender systems— Covington et al., 2016).
Rise in Social Engineering Attacks
Advanced generative models, especially, text models such as GPT (Brown et al., 2020, Radford et al., 2018) will bring social engineering attacks to the next level. Cybercriminals could exploit few-shot learning capabilities of language models to craft sophisticated versions of phishing and spear phishing attacks as well as to generate a large volume of misinformation that can be distributed to the general public.
Advanced generative models, especially text models such as GPT, have the potential to bring social engineering attacks to a whole new level, which is likely to lead to unintentional exposure of internal companies’ data. Our data security posture management solution (DSPM) protects your organization if it inadvertently falls victim to a sophisticated attack.
Spear phishing. Cybercriminals can utilize large language models (e.g., GPT) to streamline the process of spear phishing attacks, which deceive corporate victims into opening emails they believe are from trustworthy sources. Such emails can be utilized for malicious purposes, such as inserting malware, blackmailing, or fraudulent fund transfer. The advantage of GPT models in this context is their ability to mimic a specific writing style, making the emails appear more trustworthy. This allows even individuals with limited proficiency in common business languages such as English, Spanish, and French to create convincing spear phishing attacks. Two similar spear phishing attacks have happened recently, though they involved using “deep voice” technology alongside carefully crafted email communication ($35M Bank Heist and Voice Deepfake).
Misinformation. Generative textual models can have a significant impact on public opinion, target specific individuals, and influence local and national politics and elections by producing and automating large amounts of social media content designed for specific audiences (Data Misuse and Disinformation and Russia’s 2016 Meddling). The level of sophistication of large language models allows for the creation of convincing propaganda for mass digital manipulation campaigns regardless of the perpetrator’s command of a language, particularly English.
Polymorphic Malware and Behavioral Detection
Cybercriminals are in a great position to use text/code generative models for generating malware. Advanced malware can pose a significant risk to data security and privacy, and it can be a significant factor in the unintentional exposure of sensitive data to the public. Researchers from CyberArk have provided a proof of concept of how to use ChatGPT to inject a shellcode (Polymorphic Malware with ChatGPT) into explorer.exe on Windows OS. HYAS researchers went even further and built a proof of concept BlackMamba polymorphic malware. BlackMamba exploits a large language model to synthesize polymorphic keylogger functionality on-the-fly, and dynamically modifies the benign code at runtime—all without any command-and-control infrastructure to deliver or verify the malicious keylogger functionality.
Fine-tuning a state-of-the-art code generation model on a malware database is likely to further improve already impressive results. Such generative models lower the entry level into the malware development business; people are no longer required to know intricate OS-level details to design a working piece of malicious software. Moreover, ML can keep generating semantically equivalent malware mutants at no cost thus achieving the effect of polymorphic malware. It can also generate test cases to make sure that generated malware mutants operate as expected.
Automated generation of endpoint malware may not be as dangerous as many experts believe because it relies on executing a sequence of “unusual” API/syscalls that can be detected. However, polymorphic malware will render signature-based malware detection inefficient—the only way to deal with it is to employ behavioral ML-based detectors (Kazdagli et al., 2016).
The situation looks much worse in the case of malware that exploits misconfiguration (e.g. cloud-based attacks). Such attacks often escalate privileges or move laterally by exploiting over-permissioned configurations rather than using any “unusual” API calls. Thus, it puts extra pressure on designing context-aware behavioral detectors that can reliably distinguish between intended user activity and malicious actions. Definitely, such detectors will benefit from developing robust AI methods and incorporating domain-specific knowledge in a principled way.
Challenges with Applying AI to Cybersecurity
Quest for Robust AI
Compared to other subjective domains such as image and text generation, cybersecurity poses strict requirements for the robustness of AI-based solutions. Robust AI is a term used to describe artificial intelligence systems that are capable of functioning reliably and consistently in a wide range of environments and situations. Robust AI is expected to apply its knowledge in a systematic way and be able to adapt its understanding from one context to another, just as a human would (Marcus, 2020). The absence of proven robustness techniques, especially for modern deep learning-based AI, is a hindrance to the widespread use of machine learning-based systems.
The lack of robustness makes AI solutions brittle and overly sensitive to distributional changes in the input data. As a result, such solutions become vulnerable to evasion attacks (Oprea et al., 2023, Goodfellow et al., 2015, Ilyas et al., 2019), which aim to make small alterations to the input data that are undetectable to humans, such as modifying image pixels, adding noise to audio, or adding irrelevant text, in order to deceive the machine learning system. In the cybersecurity domain, these attacks can easily modify input data, such as adding extraneous log entries when analyzing logs or disguising malware with junk assembly instructions. Evasion attacks on machine learning algorithms can be seen as the reincarnation of traditional memory attacks (e.g., buffer overflow, return-oriented programming, etc.). The need for robust AI can be regarded as similar to the need for traditional software verification.
External Knowledge Base
AI algorithms rely on the data they are trained on to make generalizations and draw conclusions. Variety in the training data determines the extent to which the final system can generalize and identify patterns. However, the field of security is particularly complex and involves many nuances that are well known to security experts but are difficult to grasp through data alone.
The application of AI to cybersecurity will benefit from the development of efficient methods that can integrate domain-specific knowledge into AI algorithms. This will lead to more efficient training methods, requiring fewer samples, and reduce the risk of bias in the models due to limited training data. One of the ways of adding domain knowledge is to directly modify the objective function (Hu et al., 2018 and Joshi et al., 2020).
Neurosymbolic AI may be especially fruitful in the security domain due to the abundance of symbolic information. It aims to combine the strengths of two main AI paradigms: neural networks and symbolic reasoning. The neural network paradigm is based on data-driven, pattern recognition methods, while symbolic reasoning relies on a priori knowledge encoded in logical rules. Neurosymbolic research seeks to integrate these two approaches to achieve a hybrid AI system that can handle both structured and unstructured data, perform common sense reasoning, and support human-like explanations of its decision-making process.
Conclusion
Generative AI has emerged as a powerful force of change, disrupting and transforming various industries, including the cybersecurity sector. Its ability to create new data, identify patterns, and generate solutions offers immense potential for enhancing security measures. However, as with any new technology, there are also new risks and challenges that need to be addressed. By embracing generative AI and actively working to address its security challenges, organizations can stay ahead of cyber threats and leverage this technology to their advantage. With the undeniable impact that generative AI is already having on cybersecurity, we can expect to see continued evolution and innovation in the industry in the coming years.
About Symmetry Systems and DataGuard
Symmetry Systems helps organizations protect their most mission-critical and sensitive information with cutting-edge technology that secures data from the data out, not the perimeter in. Symmetry’s data security solution, DataGuard helps organizations understand:
- What data do we/you have?
- Who or what can access the data?
- How is the data being used?
Organizations today rely on massive volumes of data to operate. And with all that data comes security challenges. As the pioneer behind Data Security Posture Management (DSPM), Symmetry System’s is redefining what a data-centric approach to cybersecurity looks like—security from the data out, not the perimeter in—using a deployment approach that integrates Zero Trust and can examine data in different environments at the same time: the cloud, on-prem, or both.