Common Descriptive Analytics for Security Data 4: Correlation and Regression Analysis

Understanding the explicit relationship between attributes helps analysts uncover hidden patterns in data.
Correlation analysis is a method of statistical evaluation used to study the strength of a relationship between two attributes in a data set.
A Positive correlation is a relationship between two or more attributes whereby their values increase or decrease together.
Similarly, a negative correlation is a negative relationship,
whereby when one attribute increases, the other will decrease, and vice versa.
If there is no consistent linear pattern in the change between attributes,
they are said to be uncorrelated.
Hence, correlation analysis helps to quickly spot two factors that may be related and to uncover hidden pattern in data.
Please note that correlation is a descriptive statistical measure versus an inferential one, meaning that you can only describe the data set you are studying and cannot use the outcome to generalize a statement to a larger group or make predictions based on the outcome.
It is also important to remember that correlation is just showing some existence of a relationship between attributes, with no implication of “causation”.
For example, imagine that a hypothetical analyst looked at the relationship between security incidents and the number of security operations staff, and reported:
“There is a strong positive correlation between the number
of SOC analysts cause security incidents”.
In reality, the patterns of data have similar trends with nothing else implied.
Perhaps organizations with more incidents hire more SOC analysts, or after hiring more analysts, organizations discover more incidents.
Perhaps the two are both a product of something else completely,
such as larger organizations are targeted more and have more incidents and analysts.
Therefore, when you calculate relationships like correlation,
you have to be careful to keep it in context.
Here are two statistical concepts: co-variance and correlation co-efficiency.
Co-variance – refers to the tendency of one set of values to move with another set.
Correlation co-efficiency – normalizes the co-variance number to a scale ranging
from -1 opposite direction to +1 the sets move perfectly together.
0 indicates that the two data sets have no apparent relationship.
We won’t go into the math behind these statistical concepts in this course.
Again note that a highly positive correlation between two data sets need not simply a cause-and-effect relationship.
It simply means that the two sets move together.
Like most elements of statistics or any complex discipline,
there are many methods available to perform various tasks.
This is also true when calculating correlation between two attributes.
Pearson correlation method is widely used given that it can work with data on an interval or ratio scale, with no restrictions placed on both attributes being the same type.
If you have ordinal or ranked data, you should use two other algorithms –
Spearman or Kendall’s rank correlation method – instead.
We won’t delve into these algorithms in details, but you should have a solid understanding
of the uses and limits of each before applying correlation in your own analysis.

Here is an example illustrates 7 correlation scatter-plots
that has perfect positive correlation 1 to low positive (correlation co-efficiency =0.5),
to no correlation (correlation co-efficiency =0) and then to low or high negative correlation
to Perfect negative correlation ((correlation co-efficiency =-0.5, -0.9 and -1 respectively).
Knowing how well two data sets correlate is a fine way
of understanding the relationship between two attributes.
Security attributes, however, do not just come in pairs.
Most analysts want to explore relationships between more
than one pair of attributes at a time.
For this, we need a correlation matrix.
Correlation matrices provide a structured and compact way
to analyze many pairs of attributes at once.
A correlation matrix is a simple grid.
For every attribute n, the grid contains a row of cells containing the formula
for correlation with every other attribute.
Here we use an Example data set that one security consultancy company gathered
to keep track its each client engagement’s defect statistics.
The data set has the following attributes: – Assessment contract value, assessment duration,
assessment efforts, number of defects found, mean risk score, mean impact score, BAR score,
number of full-time equivalent (FTEs) performing assessment.
Defects found per FTE.
We would like to understand whether any of these attributes relate with each other in any way,
to uncover patterns and detect potential sources of bias.
To do this, we can construct a correlation matrix with a cell entry for each attribute vertically arrayed on the left side of the grid. The same cell entries are repeated in a horizontal array across the bottom.
See this sample correlation matrix.
Note that the darker shaded text highlights the strongly correlated attributes;
the lightly shaded text, strong negative correlations.
Notice how certain attributes correlate with others.
Duration, effort, and contract value strongly correlate, as one would expect.
The number of findings strongly correlate with the BAR score.
However, there are Two “surprising” observations: A very small correlation (0.11)
between the number of findings and engagement duration.
This might suggest that the client applications exhibit a wide range of defects and levels of code quality.
An alternate explanation exists, the lack of correlation could simply mean that longer testing yields no extra benefit in terms of the number of defects found.
There appears to be no correlation (0.09) between the number of consultants on an engagement and the number of defects found.
This implies that there may not be extra benefit to a large assessment team;
a small team with a core of highly skilled consultants should be sufficient.
If two variables are correlated, you can create an equation that will predict one
of the variables if the other is known.
This equation is known as a regression equation.
Regression analysis assumes that there is a relationship between two variables
and the relationship can be represented as a straight line (linear regression)
or a curve (non-linear regression).
Regression is a method of finding an equation describing the best-fitting line or curve for a set of data.
How to define a “best fitting” line or curve when there are many possible lines or curves?
The answer is the line or curve that is best fit for the actual data that minimizes prediction errors (e.g. Root Mean Square Error –
EMSE) Regression is not descriptive analytics,
but inferential analysis (still not direct causal inference) Estimate how different
observable inputs contribute to an observable output.
In previous example, if you want to estimate how “number of findings”,
“number of FTEs” contribute to the the engagement “Duration”.
With the regression analysis, not only you can estimate the significance of each attribute,
you can also estimate how strong that contribution is.
Given a specific X, estimate or predict what the output Y is.
The output of regression analysis is a formula or function f. Given specific inputs X,
you can make an estimate or predict what the output Y will be.
A classic example with this is the relationship between height and weight.
It is relatively intuitive that taller people weigh more, but if you add other variables
such as male or female, age and so on, you can estimate an expected value,
and establish an expected range of a person’s weight, height and so on.
Regression analysis is a powerful tool for estimating and comparing observed outputs.
What are similarities and differences between correlation and regression?
Similarities For standard linear regression coefficient,
it can be computed use the same algorithm such as Pearson’s correlation
as in correlation analysis, , although the coefficient’s meanings are different.
Both can be linear or non-linear relationship.
While correlation typically refers to the linear relationships, it can refer to other forms
of dependence, such as polynomial or truly nonlinear relationship.
Neither simple linear regression nor correlation answer questions of causality directly.
The differences between correlation and regression:
The regression equation Y=f(X) can be used to make predictions on Y based on the value
of X Correlation quantifies the degree to which two attributes are related,
but it does not fit a line through the data.
With correlation, it doesn’t matter which of the two variables you call
“X” and which you call “Y”.
You’ll get the same correlation coefficient if you swap the two.
With linear regression, the decision of which variable you call “X” and which you call
“Y” matters a lot, as you’ll get a different best-fit line if you swap the two.
The line that best predicts Y from X is not the same as the line that predicts X
from Y (unless you have perfect data with no scatter.
Correlation coefficient indicates the extent to which two variables move together,
while regression coefficient indicates the impact of a unit change in the known variable
such as X on the estimated variable Y.

Reference: Tong Sun,
Adjunct Professor of Computing Security
Rochester Institute of Technology

Author: McPeters Joseph

Joseph McPeters is a Security Researcher. He specializes in network and web application penetration testing. Contact: admin@incidentsecurity.com

Leave a Reply

Your email address will not be published. Required fields are marked *