Big data and security intelligence are the two popular topics in cyber security.
We are collecting more and more data from the infrastructures and increasingly
from our applications (including web applications, mobile apps,
and traditional enterprise applications), and connected devices.
This vast amount of data from diverse
sources in various modalities are increasingly hard to understand.
Terms like map reduce, hadoop, spark, elastic search, data science,
machine learning etc. are part of many discussions.
Now let’s dissect the security data analytics pipeline.
Since we are entering the age of data in cyber security, the challenge is no longer
where to get data from, but what to do with it.
The first part is about data collection and Extraction, Transformation and Loading (ETL).
Data ETL is used to migrate a collection of data from flat files or databases,
to another database (structured or semi-structured).
It involves data fusion, data cleaning and format transformation.
Extract is the process of reading or retrieving data from a data source.
For instance, many security data sets share common fields, such as IP address,
geographical longitude and latitude coordinates, click stream URLs, etc..
These fields need to extracted, represented and handled properly. Transform
is the process of cleaning, converting the extracted data from its previous form,
combining the data from different sources into the form it needs to be in so that
it can be placed into another database.
Transformation occurs by using rules or lookup tables or explicit programming.
Security data analytics often involves fusing data from different sources
to understand the association among different factors.
Let’s use IP address as an example.
Some information security practitioner may think of IP addresses as simply the string
used with in ping, nmap or other commands, But to perform security-oriented analyses
of your system and network data, you must fully understand as much as you can
about security domain data elements.
IP addresses are perhaps the most fundamental of security domain elements.
Certain transformation steps are needed to process IP addresses such as: –
converting IP address to/from a 32-but integer value –
this is important because the integer representation saves both space and time
and you can calculate some things a bit
easier with that representation with the dotted-decimal form.
– Segmenting and grouping IP addresses, since internally you might separate hosts
by functionality or sensitivity based
on the logical grouping mechanism your organization used. This
step could help you analyze the malicious activity in different segment of your
network or different groupings
of your computer devices.
Locating IP address is to map IP addresses
to individual devices that have unique MAC addresses.
On a broader scale, there are also ways to tie an IP address that lives on the
Internet to a geographical location with varying degrees of accuracy.
This data fusion can help to easily visualize security data on a map.
Finally, Load is the process of writing the data into the target database,
which can be later queried or retrieved
in business intelligence applications for decision making.
After you collected data and ETL data into a targeted database.
The next step is to explore and understand the data and discover the hypothesis
and what is interesting about the data.
Data exploration is the practice of using visual and
quantitative methods to understand
and summarize a data-set without making any assumptions about its contents.
It is a crucial step to take before diving into machine learning or
because it provides the context needed to develop an appropriate model
for the problem at hand
and to correctly interpret its results.
Data Exploration Methods: Summary statistics &
uni-variate visualization (e.g. histogram)
bi-variate visualization (e.g. box-plot distribution graph) ; Multivariate
visualizations to understand interactions
between different attributes ; Dimensional
reduction to understand the fields in the data
that account for the most variance between observations and allow for the processing
of a reduced volume of data ; Clustering of similar observations in the data-set
into differentiated groupings, which by collapsing the data
into a few small data points, patterns of behavior can be more easily identified
when we confirmed with the questions to be answered or hypothesis to be tested,
we are moving to the next phase: apply complex statistics modeling techniques
and advanced machine learning methods to drill and mine
and uncover the hidden patterns underneath the data.
Advanced predictive analytics for anomaly/fraud/spam detection:
Complex Statistical Modeling – Regression analysis, such as logistic regression,
polynomial regression, step-wise regression, ridge regression, Iasso
and Elastic Net regression – Principal Component analysis and factor analysis –
Bayesian modeling; Machine Learning approaches – Decision tree & random forest
Support Vector Machine – Naive Bayes –
Neural Network Many advanced machine learning algorithms can also use the
findings or outcomes discovered from data exploration as features
in building the classifier or automated detector.
Now after all the hard work of data munging, digging, mining and synthesizing.
Both exploratory and advanced analytics are outputing the findings and insights
from the vast amount of security intelligence data.
The key step here is how to communicate your findings
and insights to others and decision makers.
Communication is also critical in risk management.
Security is such a complex topic that it defies easy description.
The sheer scope and volume of available data overwhelms practitioners and laypeople alike.
It is quite challenge for most information security professionals to show security either literally or figuratively.
Dashboards & Reports Here is a quote by Stephen Few, the God father of information dashboard design: “A dashboard is a visual display
of the most important information needed to achieve one or more objectives
that has been consolidated in a single computer screen so it can be monitored at a glance. ”
In essence, a dashboard provides a single screen opportunity
to provide the most critical/relevant information in the most concise
and effective ways possible to enable the viewer
to quickly understand the elements being described and, if necessary,
make the most appropriate decisions.
We will not go over details about how to design dashboards or reports
for risk communication in this course. But
we want to emphasize that presenting your analytics insights and findings
to enable the data-driven analytics benefits in reducing the uncertainty
in managing cyber security risks.
For example, in understanding the threat landscape, discovering vulnerabilities,
quantifying the impact on critical assets, and measuring the control effectiveness.
In addition to feed
the insights into Dashboards and reports, these analytics-driven insights
and findings can also be fed automatically into business rules
for automated detections,
combined with the real-time threat intelligence stream and integrated with
the organization’s intelligent response, such as incident response, disaster
recovery and business continuity planning.
In summary, security data analytics pipeline consists of the following:
First, data collection and ETL which combine, clean and prepare
the data for the data exploratory analytic and advanced analytics Data exploration: to discover and explore the data characteristics and hypothesis Advanced analytic: this is a heave lifting phase to conduct sophisticated statistical modeling techniques, training the machine learning algorithms evaluating and testing the learned model. The discovered insights and knowledge are then fed into dashboards and reports for communicating the security risks and facilitate the risk management program and also fed into automated decision and actions tools, such as detection, response and recovery.
Source: Tong Sun,
Adjunct Professor of Computing Security
Rochester Institute of Technology