Good morning, and good afternoon to others in other parts of the world. Apologies for running a bit late this morning. We’re here now. There always seems to be a technical glitch at the moment, but I’m glad we’re here.
So, good morning once again. I’m Claude Mandy. I’m the Chief Evangelist at Symmetry Systems, and I’m here to give you insight into what we presented first at the RSA Conference this year to our unconference crowd: the state of data and AI security. It was well received, and we’ve recently published it for broader consumption. So I’m pleased to take this opportunity to give you some insights from it and work through it.
As the Chief Evangelist, I get to do a lot of fun stuff. I’m a former Gartner analyst and a former CISO, and presenting research like this and the insights from it is something I really enjoy. So I’m glad I have the opportunity to do that at Symmetry.
Let’s dive into it.
First, I’m going to talk a little bit about our methodology and the approach we’ve taken. This is a data-driven study. It’s not a survey where we ask people their opinions or their perceptions of how they’ve performed. It’s actually based on data from over 50 organizational deployments of our platform across multiple industries. The vast majority are in the technology space, crossing over between fintech, medtech, and other components. We also have a large proportion in financial services and insurance, and some interesting companies from biotech and even federal sectors.
It is important to note that we have some air-gapped deployments where we do not have visibility into the data within that environment. We have no access, so we only used those non-air-gapped deployments. We took steps to ensure that when we extracted some of this data, we anonymized it before taking it out. So it’s very much just the metadata on the findings at a high level that we built into it. This underscores our dedication to data privacy and security more generally.
At a high level, this study shows that even when you’re looking at AI or anything else, there are two undeniable things: data is the new endpoint—it’s what we need to protect, it’s what organizations are focused on gathering, and unfortunately, it’s what threats are focusing on as well. And on the other side, identity is the new perimeter. When we look at these findings, they broadly fall into identity and data categories, applied to data and identity as they pertain to AI.
This flows into our secret sauce, which I’ll spend just a couple of minutes describing. When we look at identity on one side and data on the other, a couple of connective tissues form between them. The first one is understanding what that data is, including classification and aspects like that, really focusing on the data. Then, on the identity side and how it connects to those data objects, it’s about how it’s used—the operations people are performing on it and the permissions they have. These go hand in hand. You can’t perform an operation unless you have permissions, but even if you can’t perform an operation, it gives you insight into what people are trying to do from a threat perspective.
When we look at data security posture at Symmetry, our secret sauce is identifying all three of these elements: classification, permissions, and operations, pulling that together to answer simple questions like “who has access to my data?” While we couldn’t include all that information in our study, such as the percentage of contracts and types of vendors with access, the insights from our platform are powerful.
Without further ado, I’m going to jump straight into discussing some of our key findings. This is the highlight slide. I’m not going to talk about them all in detail. Some are pretty obvious when you look at them logically, but some stand out immediately.
The top two I would focus on to start with are:
There is sensitive data out there. The data stores we come across include relational database data stores, unstructured data within S3 buckets, large data lakes, and corporate collaboration services like OneDrive and Google Drive. About 10% of these contain some form of sensitive data.
The organizations we analyzed all had at least one account without MFA enabled and with console access or interactive access to their environment. This usually means access to some form of those data stores containing sensitive data.
With recent news about multiple breaches and discussions about organizations not enabling MFA, this finding is quite relevant. But it’s not the most insightful item we’ve found. I’ll focus more on the information we discovered when drilling down into it.
All organizations had a lot of personal information but of different types. We found more than 20 types of personal information, ranging from sexual orientation and gender to medical conditions and illnesses, which is relevant for HIPAA compliance. Every organization had names, email addresses, and phone numbers, which are personally identifiable information. This poses challenges as these are useful for social engineering campaigns but are not typically a focus from a security perspective.
It’s interesting to drill down into what we need to secure and how diverse that is. The amount of data stores containing sensitive data varies by industry. For example, data brokers, who sell data, have a lot of sensitive data. But in dev environments, we found less than 1% of sensitive data. In corporate collaboration services, it drops to 3%. Although there’s a huge amount of potential data objects to scan, the scale of the problem we face is significant.
The image on the left shows permissions and how they flow across environments. You can hardly see specific environments due to the lack of clustering. Essentially, it’s one environment because permissions allow access across environments. Research shows that only three compromised accounts or roles could give access to all data in an environment. The image on the right shows cleaned-up permissions, resulting in better segmentation and separation of dev, test, and production environments.
Secrets, credentials, and keys, such as API keys and AWS access keys, are another focus area. We found these in data stores outside of secret managers. AWS access keys were the most common, found even in GCP and OneDrive. This highlights the uncontrolled proliferation of keys and the need for organizational focus to solve this issue.
IP addresses can also be sensitive, especially when stored in logs. Organizations must ensure that IPs are used for the right purposes and are accessible only to authorized individuals.
Dormant data is a growing concern. About 60% of data stores had no operations performed in the last 90 days, a growth of 500% from 12 months ago. This includes legal archival data and OneDrive data from former employees. The security risk lies in unchanged permissions, which we help customers address.
Dormant identities are another issue, with 25% of identities performing no operations in the last 90 days. These include human and non-human identities. The number of dormant identities increases by 122% annually. Compromised credentials related to these dormant accounts pose a significant risk.
Let’s pause here to see if there are any questions on the environment before we deep dive into our copilot view. Please feel free to add comments on the live feed.
Diving into Copilot, this layout represents a simple organization with only ten people. The red lines show unintended access to sensitive data. Copilot doesn’t introduce new access but surfaces existing access more quickly. Our CEO, Moet, humorously noted that it solves the shortage of pen testers since every identity in your org is now AI-powered.
Organizations with Microsoft 365 plan to enable Copilot. OneDrive and SharePoint have millions of files, with a small percentage allowing anonymous or organization-wide access containing sensitive data. This proactive approach helps reduce DLP incidents and shift DLP lift.
But we need to have the right controls around it to make sure it’s only used for those purposes, so that restricted internal access provides proper governance. When sharing sensitive content, like IPs or logs, with other people, we need to ensure it’s for the right purposes and make sure it has proper classification when shared externally.
The same applies when we start to look at and unstructure the data types—they have their own specific needs. For example, 32% of the data, an even larger portion, includes images such as PDFs, JPEGs, and PNGs. A large proportion of the data consists of images, which raises the challenge: do you need to classify every single file? It’s computationally intensive and cost-prohibitive to do this quickly, especially if only one person has access to it and saved it. Do we really need to classify it if no one else has access?
These are some considerations we put into this combination of a data-first or data-centric and identity-first combination of data security posture management. We use this to implement targeted classification to focus on whether we need to classify those files and determine if we need to remediate them.
Hopefully, this provides an incredibly insightful view into our state of data and AI security report. There’s obviously more in the report itself, and I’ll post the links to download that data after this call. I hope this was a useful use of your time. If you have any questions, please reach out to me on LinkedIn
Thank you so much, and if there are any other questions, feel free to post them on the feed.