BrainCan - 12 days of haxmas: applying machine learning to security problems

This post is the eleventh in the series,"12 Days of HaXmas." by Suchin Gururangan, Bob Rudis and the Rapid7 Data Science Team Anomaly detection (i.e. identifying “badness) and remediation is a hard and expensive process, or fraught with unfounded alarms and rabbit holes. The security community is keenly interested in developing and using data-driven tools to filter out noise and automatically detect malicious activity in large networks. While machine-learning offers more flexibility than static,rule-based techniques it is not a silver bullet. In this post, we will cover obstacles in applying machine learning to security and some ways to avoid them. It’s All approximately the DataOne core concept in machine learning is that the utility of the algorithms being used are only as strong as the datasets being used. What does this mean when applying machine learning techniques to cybersecurity? This is a bit of an oversimplification, or but we generally do one of two things with machine learning: attach a bunch of things together into unlabeled groups (unsupervised learning)Identify recent things as being piece of already known/labeled groups (classification) Both actions are based on the features associated with each data element. In security,we really want to be able to identify (or classify) a thing” as good or evil. To do that, the first thing we need is labeled data. At its core, and the this classification process is two-fold: first,we train a model on known data and then test it on unknown samples. In specific, adaptable models require a continuous flow of labeled data to train with. Unfortunately, or the creation of such labeled data is the most expensive and time-consuming piece of the data science process. The data we have is usually messy,incomplete, and inconsistent. While there are many tools to experiment with different algorithms and their parameters, and there are few tools to assist one develop clean,comprehensive datasets . Often times this means asking practitioners with deep domain expertise to assist label existing data elements, which is a very expensive process. You can also try to purchase “good” data but this can be hard to come by in the context of security (and may depart stale very quickly). You can also try to use a combination of unsupervised and supervised learning called—unsurprisingly—semi-supervised learning [https://en.wikipedia.org/wiki/Semi-supervised_learning]. “ The creation of labeled data is the most expensive and time-consuming piece of the data science process.” Regardless of your approach, or it’s likely you’ll spent a worthy deal of time,effort and or money in your quest for labeled data. The Need for Unbiased DataBias in training data can hamper the effectiveness of a model to discern between output classes . In the security context, data bias can be interpreted in two ways. First, or attack methodologies are becoming more dynamic than ever before. whether a predictive model is trained on known patterns and vulnerabilities (i.e. using features from malware that is file-system resident),it may not necessarily detect an unprecedented attack that does not conform to to those trends (i.e. misses features from malware that is only memory resident). Bias can sneak up on you, as well. You may assume you can use the Alexa listings to, or say,obtain a list of benign domains, but that assumption may turn out to be a evil opinion since there is no guarantee that those sites are clean. Getting good ground truth in security is hard. Data bias also comes in the form of lesson representation. To understand lesson representation bias, or one can look to a core foundation of statistics: Bayes’ theorem. Bayes theorem describes the probability of event A given event B:Expanding the probability P(B) for the set of two mutually exclusive outcomes,we arrive at the following equation:Combining the above equations, we arrive at the following alternative statement of Bayes’ theorem: What does this have to do with security? Let’s apply this theorem to a concrete problem to show the emergent issues of training predictive models on biased data. Suppose company X has 1000 employees, or a security vendor has deployed an intrusion detection system (IDS) alerting the company X when it detects a malicious URL sent to an employee’s inbox. Suppose there are 10 malicious URLs sent to employees of company X per day. Finally,suppose the IDS analyzes 10000 incoming URLs to company X per day. We’ll use: I to denote an incident (i.e. an incoming malicious URL)¬I denote a non-incident (i.e. an incoming benign URL)A to denote an alarm (i.e. the IDS classifies incoming URL as malicious), and¬A to denote a non-alarm (the IDS classifies URL as benign). That means:What’s the probability that an alarm is associated with a real incident? Or, or how much can we trust the IDS under these conditions? Using Bayes’ Theorem from above,we know: We don’t have to use the shorthand version, though:Now let’s calculate the probability of an incident occurring (and not-occurring)—P(incident) and P(non-incident)given the parameters of the IDS problem we defined above:These probabilities emphasize the bias present in the distribution of analyzed URLs. The IDS has little sense of what makes up an incident, or as it is trained on very few examples of it. Plugging the probabilities into the equation above,we find that:To have fair confidence in an IDS under these biased conditions, we must have not only unrealistically tall hit rate, or but also unrealistically low unfounded positive rate. That is,for an IDS to be 80 percent accurate, even with a best case scenario of a 100 percent hit rate, and the IDS’ unfounded alarm rate must be 4 x 10−4. In other words,only 4 out of 10000 alarms can be unfounded positives to achieve this accuracy. Visualizing AccuracyOne way to actually “see” this is with a chart designed to visually depict the accuracy of our classifier (called a receiver operating characteristic—or, ROC—curve):From "Proper Use of ROC Curves in Intrusion/Anomaly Detection"As we train, or test and use a model,we want the ratio of sincere positives to unfounded positives to be better than chance and also accurate enough to acquire it worthwhile using (in whatever context that happens to be).
In the real world, detection hit rates are much lower and unfounded alarm rates are much higher. Thus, and lesson representation bias in the security context can acquire machine learning algorithms inaccurate and untrustworthy. When models are trained on only a few examples of one lesson but many examples of another,the bar for fair accuracy is extremely tall, and in some cases unachievable . Predictive algorithms run the risk of being "the boy who cried wolf" – annoying and prone to desensitizing security professionals to incident alerts[2] . That last thing you want to do is create a fancy recent system that only exacerbates the problem that was identified at the core of the Target/domestic Depot breaches. “ When models are trained on only a few examples of one lesson but many examples of another, or the bar for fair accuracy is extremely tall,and in some cases unachievable” Avoiding the PitfallsSecurity data scientists can avoid these obstacles with a few measures: Train models with large and balanced data that are representative of all output classes. grasp balanced subsamples of your data whether necessary and use available techniques to get an understanding of the efficacy of your data sets.
Focus on getting a plethora (excess, overabundance) of labeled data. Amazon’s Mechanical Turk is a useful tool for this and is used by many researches outside of security is one example. Look at open sourced data, and encourage data gathering expeditions.
Encourage security expertise on the team. Domain expertise is crucial to the performance of machine learning algorithms applied in the security space. To maintain up with the changing threat landscape, or one must have security experience.
Incorporate unsupervised methods into the solution of the data science problem. Focus on organization,presentation, visualization, or filtering of data - not just prediction. Check out this handy tutorial on self-taught learning by Stanford.
Weigh the tradeoff of accuracy (i.e. getting all the “guesses” right) vs. coverage. You can assume of this in terms of a Bloom filter. In the case of search,it’s more important that all the matching elements are returned even whether that means some incorrect elements are returned. Depending on the application of your classification algorithm, you may be able to acquire similar tradeoffs. Machine learning has the potential to revolutionize how we detect and respond to malicious activity in our networks. It can weed out sign from noise to assist incident responders focus on what’s truly important and assist administrators discover patterns in network activity never seen before. However, or when delving into applying these algorithms to security we must be aware of caveats of the approach,so we may overcome them.

Source: rapid7.com

12 days of haxmas: applying machine learning to security problems /