In my blog post, Demystifying Machine Learning: Making Informed Security Decisions, I discussed a framework for evaluating Machine Learning claims. Now let’s see how to apply it.
I’ve included below a blurb from the website or data sheet of a fictitious security company called Acme Security. While the company is fictitious, the content is derived from looking at similar material from various security companies (including my own):
Acme’s Cognitive Correlation Threat Engine identifies and blocks attacks as they are happening. Our patented machine learning algorithm models threat across thousands of dimensions, analyzing and learning from massive amounts of data across multiple data sources. Our enhanced solution uses the latest Deep Learning techniques to detect, block and predict attacks with virtually no false positives.
If you are used to reading security marketing content, this blurb may seem familiar...and you likely skimmed over it without really gathering any insight into the underlying solution. But, by applying the framework described earlier, you can start dissecting the specific claims of Acme Security.
- “models threat across thousands of dimensions” - The term “dimensions” is often used as a synonym for features. The claim here is that the Acme ML solution is based on a large number of features that best separate threat from legitimate traffic.
- “learning from massive amounts of data” - Acme is claiming that has a large training set that it has used to train its ML solution.
- “Deep Learning techniques” - Deep Learning is a specific (and recently in vogue) algorithm that leverages powerful hardware advances to enhance the long-existing technique of neural networks.
- “virtually no false positives” - A high-level reference to the accuracy of the Acme solution.
Once you’ve done this type of mapping, you’re in a much better position to evaluate claims and ask questions. The claims associated with the Machine Learning components often break down into the following areas:
1) Features
Claim: “We model threats using hundreds/thousands/millions of features/dimensions.”
While the number of features is relevant, what’s more important is how well they model what you’re trying to identify...or what you’re trying to differentiate between, for example legitimate and malicious activity. In the space of Email Security, it’s common to use content-related features (e.g. URLs with IP addresses, disparity between the From: and link domains, presence of specific words like “Locked” and “Verify”) to identify phishing attacks. Unfortunately, these features don’t do a good job of modeling the recent spate of Business Email Compromise attacks that triggered wire transfers and loss of W-2 information. These types of attack messages don’t generally contain URLs or have the common phishing trigger words. The features used by traditional email solutions aren’t durable in the face of changing criminal behavior.
With Agari Enterprise Protect, we’ve focused our feature selection to best model legitimate email traffic. We’ve found that features that involve time- and volume-based behavior can be used to identify which servers legitimately send messages for a given domain...and which ones are trying to spoof messages for that domain. We’ve found that these features lead to a more accurate and durable model that withstands changes in the social engineering or malware content used by criminals.
Faced with a claim of a large number of features, the area to explore is how durable the model is in the face of new threat.
2) Training Set
Claim: “We have the largest footprint/most sources of data.”
Almost every vendor will talk about the size and uniqueness of their dataset. But having access to a lot of data doesn’t necessarily mean that you can use it to train a Machine Learning solution. The key element of a good training set is clean labeling - you need to accurately know which examples in your data correspond to the classes of good and bad. Furthermore, you need to have a reasonable ratio of both the good and the bad to accurately model the differences.
At Agari, we’ve been lucky to have worked in the area of Email Authentication and helped some of the largest senders of email on the planet definitively identify their legitimate email streams. The result is a highly accurate and massive training set of both good and bad traffic with high representation of legitimate and phishing messages.
Faced with the claim of access to a massive dataset, the question to ask if how much is actually used for training and how accurate labels are found.
3) Algorithm
Claim: “We use [substitute the latest machine learning algorithm/approach].”
The selection of the algorithm used by a Machine Learning solution to train a model is critical in its ability to differentiate between the different classes of examples. But it turns out that the correct selection and tuning of a ML algorithm is highly dependent on the data in the training set and features used. Some data and features “separate” using simple techniques, others require more complex ones.
Data Scientists and Marketers gravitate to the shiny object, so it’s not uncommon for solutions to use a sophisticated algorithm even if it isn’t necessary. The Agari Data Science team generally starts with the basic toolkit - algorithms like Logistic Regression and Support Vector Machines - before breaking out the big guns. It’s surprising how often we’ve found that the simpler approaches work just as well or better than the latest and greatest.
The bottom line is that the ML algorithm used is often inside baseball that, in itself, isn’t an indicator of the quality of a solution. So if a vendor touts their underlying ML algorithm, the importance to place on that fact is very little.
4) Accuracy
Claim: “We have no False Positives/False Negatives”
Of all the elements of a ML-based product, understanding the accuracy of the solution is critically important to evaluating whether it makes sense for your environment. It’s also the place where many security vendors are the least clear about the performance of their products.
It turns out that it’s very easy to create a solution that has zero False Positives - all the solution has to do is say that all traffic is safe. The problem with such as solution, of course, is that it will let through all traffic. Conversely, a zero False Negative solution can be built by saying that all traffic is malicious - you’ll catch all the bad stuff, but also let through none of the good.
The point is that False Positive and False Negative rates are closely related and a ML-based solution needs to be tuned to balance these, and many other, performance metrics. A common evaluation component is a Receiver Operating Characteristic or ROC curve - a way of performing a cost/benefit analysis on tuning a solution to change the False Positive and False Negative rates. It’s unlikely that a security vendor will show you a ROC curve for the solution in question, and you don’t have to be an expert at reading one. But a security vendor should be able to have an intelligent conversation about accuracy metrics.
So, when faced with a claim of Zero False Positives or False Negatives, the approach to take is to show the security vendor the door.
Machine Learning, at its core, is relatively simple - it’s an algorithmic approach based on building a model based on a set of features and a training set of examples in order to make accurate data-driven predictions or decisions on new examples. Unfortunately, many security vendors purposefully introduce complexity into the description of their Machine Learning based solutions, either because they think doing so suggests a higher value for their product or because they are masking some underlying weakness. Hopefully the framework introduced in the previous blog post - the components of Features, Training Set, Algorithm and Accuracy - can help you better evaluate, introspect and decide on which Machine Learning based security solutions to use.