Before you jump right into data-driven security analytics, it is important to ensure you
at least have a basic familiarity with the two most prominent languages featured in nearly all
of the scenarios: Python and R. Although there are an abundance of tools available
for data analysis, we feel these two provide virtually all the features necessary
to help you go from data to discovery with the least amount impedance.
For people with an existing programming background, getting up to speed
with Python should be pretty straightforward and you can expect
to be fairly proficient within 3-6months.
You code may not be “pythonic” (that is utilizing the features, capabilities
and the syntax of the language in the most effective way), but you will be able to “get useful stuff done”.
Python’s flexibility and extensibility were especially appealing to the scientific, academic and industrial communities and which resulted in broad and quick adoption in these fields since 2000s.
For those who are new to statistical language, becoming proficient in R may pose more of a challenge.
Statisticians created R. If you can commit to suffering through R syntax and package nuances, you should be able to make some good progress in 3 -6 months as well. R makes it remarkably simple to run extensive statistical analysis on your data
and then generate informative and appealing visualizations with just a few lines of code.
Both Python and R provide an interactive execution shell or IDE environment that has enough basic functionality for general needs.
Why do we need both?
There are times when the flexibility of a general-purpose programming language comes in very handy, which is when you use Python.
There are other times when three lines of R code will do something that may take 30 or more lines of Python code to accomplish.
Since your ultimate goal is to provide insightful and accurate analyses as quickly
and as visually appealing as possible.
Knowing which tool to use for which job is a critical insight you must develop to be as effective and efficient as possible.
By having both tools in your toolbox, you should be able to tackle most, if not all, of the tasks that come your way.
If you do find yourself in a situation where you need functionality you don’t have, both R and Python have vibrant communities that are eager to provide assistance and even help
in development of new functions or modules to fit emerging needs.
> Before going off and playing with data, we need to develop a solid research question first.
> In this section, we will work on an example problem given to you by the manager of the Security Operations Center (SOC).
It seems the SOC analysts are becoming inundated with “trivial” alerts ever since a new data set
of indicators was introduced into the SEIM *Security Event and Information Systems”.
They have asked for your help in reducing the number
of “trivial” alerts without sacrificing visibility.
> This is a good problem to tackle through data analysis, and we should be able to form a solid,
practical question to ask after we perform some exploratory data analysis
and hopefully arrive at an answer.
Let’s use one security data analytics example on AlienVault’s IP Reputation Database –
this freely available data set that contains information on various types of “badness”
across the internet that include IP addresses.
First of all, Understanding the data-set by applying your security domain knowledge.
For each field description see the link here > Data fields such as Reliability,
Risk and x are integers > IP, Type, Country,
Locale and Coords are character strings > The IP address is stored in the dotted-quad notation,
not in hostnames or decimal format > Each record is associated with a unique IP address,
so there are 258,626 IP addresses > Each IP address has been geo-located into the latitude
and longtitude pair in the Coords field, but they are in a single field separated by a comma.
You will have to parse and ETL them further for analysis Now that you have a general idea
of the variables and how they look, it is time to bring your security domain expertise
into the mix to explore and discover what is interesting about the data.
A good first exploratory step is to look at the basic “descriptive statistics”
or five number symmary of a data set on the variables: – Minimum and maximum values;
taking the difference of these will give you the range (= max -min) –
Median (the value at the middle of the data set) – First and third quartiles (the 25th and 75th percentiles, or you could think of it as the median value of the first and last halves
of the data, respectively – Mean (sum of all values divided by the number of count)
As you look at these results, note that the Reliability column spreads
across the documented potential range of [1…10].
But the Risk column – which AlienVault says has a documented potential range
of [1…10] – only has a spread of [1…7].
You can also see that both Reliability and Risk appear to center on a value of 2.
Both Python and R languages have built-in functions
to calculate these descriptive statistics – summary() in R and described() in Python.
The data distribution graph has the potential to provide a whole new perspective,
oftentimes giving insights that numbers alone cannot reveal.
We start with a simple bar chart to get a very quick visual overview
of the Country, Risk and Reliability factors.
First, it is the distribution summary by Country.
It shows that China and US together account for almost 50% of the malicious nodes in the list.
Second, it is the distribution summary by Risk.
Look at Risk distribution graph, you can see that the level of risk of most
of the nodes is negligible (that is, so low that they can be disregarded).
There are other elements that stand out with this data though,
foremost being that practically no endpoints are in categories 1, 5, 6 or 7 and none in the rest
of the defined possible range [8,10].
This might be a sign to you that it is worth digging a bit deeper,
but the anomaly is significant evidence of bias in the data set.
Thirdly, this is the distribution summary by Reliability.
From its distribution graph, the Reliability rating of the nodes appears to be a
but skewed (that is, the distribution is extended to one side
of the mean or central tendency).
The values are mostly clustered in levels 2 or 4, with not many ratings above level 4.
The fact that it completely skips a reliability rating
of 3 should raise some questions in your mind.
It could indicate a system flaw in the assignment of the rating,
or is could be that you can at least two distinct data sets.
Either way, that large quantity of 2s and 4s and low quantity of 3s might be a sign
that you should investigate further.
Consider both the problem and the primary use case for the AlienVault reputation data:
imposing it into a SEIM or Intrusion Detection System/Intrusion Prevention System (IDS/IPS)
to alert incident response team members or to log/block malicious activity.
How can this quick overview of the reputation data influence the configuration of the SEIM
in this setting to ensure that the least number of “trivial” alerts are generated.
So the practical question is” “Which nodes
from the Reputation database represent a potentially real threat?”
Some form of triage and prioritization must occur, and it is far better approach
to base the triage and prioritization on statistical analysis on data and evidence rather
than a “gut call” or solely on “expert opinion” It is possible to see
which nodes should get your attention by comparing the Risk and Reliability factors.
To do this, you use a contingency table, which is a tabular view
of the multivariate frequency distribution of specific variables.
In other words, a contingency table helps show relationships between two variables.
After building a contingency table, you can take both a numeric and graphical look at the results
to see where AlienVault nodes “cluster”.
it is very apparent that the values in this data set are concentrated around [2,2],
where risk = 2; reliability =2 Now turn your attention to the Type variable (such
as Malicious Host, Malware Domain, Malware IP, etc.)
to see if you can’t establish a relationship with Risk and Reliability ratings.
In the figure, you can see the Malware Domain type has risk ratings limited to 2s and 3s,
and the reliability is focused around 2.
You can also start to see the patterns in the other categories as well.
Now you can remove “Malware Domain” type from next iteration
of analysis due to its low risk scores.
. Also it looks like Malware Distribution does not seem to be contributing any risk,
so you can filter that factor out of the remaining types as well.
After filtering out two types “malicious domain” and “malicious distribution” ,
you’ve reduced the list to less than 6% of the original and have honed in fairly well
on the nodes representing the ones you really should care about.
If you want to further reduce the scope,
you could filter by various combinations of Reliability and/or Risk.
This rather simple exploratory analysis through parsing and slicing done here doesn’t show
which variables are most important.
It simply helps you understand the relationships and the frequency they occur.
source: Tong Sun,
Adjunct Professor of Computing Security
Rochester Institute of Technology