Common Descriptive Analytics for Security Data 1: Summary Statistics

Most of security metrics we introduced in previous units are raw data or simple observations.
In order to fit them for decision making dashboards and triggering actions, they need to be contextualized, processed, fused with other relevant data and analyzed.
Here is a set of common descriptive analytics methods for security data.

  • Grouping and aggregation through summary statistics, e.g. average, median,
    standard deviation
  • Time series analysis
  • Cross-sectional analysis
  • Quartile analysis
  • Correlation matrices.

We will describe core concepts of each method, what they mean and how they are constructed.
The techniques themselves are generic and not specific to security.
In addition, we will provide some guidance in the form of use cases and example data sets.

In descriptive analytics, summary statistics are used to summarize, group or aggregate a set of observations, in order to communicate the largest amount
of information as simply as possible.
The arithmetic mean serves as a standard aggregation technique.
It is the sum of a collection of numbers divided by the number of numbers in the collection.
While the arithmetic mean is often used to report “central tendencies” (or in other words to characterize what is “typical”), it is not a robust statistic,
meaning that it is greatly influenced by outliers (values that are very much larger
or smaller than most of the values).
In many cases, the “typical” might have been better off using the median rather
than the arithmetic mean”.
The media of a data set is the number that separates the top 50% of elements
from the bottom, when sorted in order.
In order words, a data set’s median indicates the point where half
of the elements lie above it and half below.
Like the arithmetic mean, medians help aggregate data sets into smaller sets by summarizing a range of records.
However, medians offer more insight for understanding the 80/20 rule
for performance related data, particularly if outliers present.
For example, consider a password-auditing tool that determines the number of seconds needed to crack user account credentials.
If the mean time to crack 1000 passwords was 1344 seconds, but the median is 822,
that would tent to suggest that outliers might have distorted the average
and therefore painted a rosier picture.
In short, medians offer some significant advantages over means,
particularly with respect to measuring performance.
Variance and standard deviation are also summary statistics Variance –
The average of the squared differences from the mean.
Standard Deviation (represented as “sigma”) – It is a measure of how spread out numbers are.
– It is the square root of the variance.
These summary statistics can be easily calculated in Excel,
R or Python using shortcut function or statistics library. The box plot (or box and whisker diagram) is a standardized way of displaying the data distribution based on the five number summary: minimum, lower quartile, median, upper quartile, and maximum.
In the simplest box plot the central rectangle spans the lower quartile
to the upper quartile (the interquartile range or IQR).
A segment inside the rectangle shows the median and “whiskers” above and below the box show the locations of the minimum and maximum.
This simplest possible box plot displays the full range of variation (from min to max), the likely range of variation (the IQR), and a typical value (the median).
Not commonly real datasets will display surprisingly high maximums or surprisingly low minimums called outliers.

Reference: Tong Sun,
Adjunct Professor of Computing Security
Rochester Institute of Technology

Author: McPeters Joseph

Joseph McPeters is a Security Researcher. He specializes in network and web application penetration testing. Contact: admin@incidentsecurity.com

Leave a Reply

Your email address will not be published. Required fields are marked *