A Brief History of Learning from Data

Prior to the 20th century, the use of data and statistics was still relatively undeveloped.
Much of the scientific research of the day used basic descriptive statistics
as evidence for the hypothesis validation.
The inability to draw clear conclusions from noisy data (and almost all real data is more or less noisy) made much of the scientific debates more about opinions of the data than the data itself.
One such fierce debate in the 19th century was between two medical professionals
in which they debated (both with data) the cause of cholera, a bacteria infection that was often fatal.
The cholera outbreak in London in 1854 was especially brutal,
claiming more than 14,000 lives in a single year.
The cause of the illness was unknown at that time and two competing theories from two researchers emerged.
One well-respected and established epidemiologist argued that cholera was caused
by air pollution created by decomposing and unsanitary matter.
Dr. John Snow, also a successful epidemiologist, put forth the theory that cholera was spread by consuming water that was contaminated by a “special animal poison” (this was prior to the discovery of bacteria and germs).
The two debated for years. Dr. John Snow was passionate and vocal, and relentless in proving his theory.
It’s said he even collected data by going door to door during the cholera outbreak in Soho district of 1854.

It was from that outbreak and his collected data
that he made his now famous map (shown in the figure).
The hand-drawn map of the Soho District included little tick marks at the addresses where cholera deaths were reported.
Overlaying the location of water pumps
where residents got their drinking water showed a rather obvious clustering
around the water pump on Broad Street.
With his map and his passionate pleas, the city did allow the pump handle to be removed and the epidemic in that region subsided.
Ronal Fisher (a professor of genetics) and many of his revolutionary contributions to statistics in pioneering the statistical model in scientific inference.
This could be the representative case for 20th century data analysis.
An agricultural research station north of London began conducting experiments
on the effects of fertilizer on crop yield.
In 1919, they hired a brilliant young statistician named Ronald Fisher to pore
through more than 70 years of data and help them understand it.
Fisher quickly ran into a challenge with the data being confounded, and he found it difficult
to isolate the effect of the fertilizer from other effects, such as weather or soil quality.
This challenge would lead Fisher toward discoveries
that would forever change not just the world of statistics,
but almost every scientific field in the 20th century.
What Fisher discovered is that if an experiment was designed correctly,
the influence of various effects could not just be separated, but also could be measured and their influence be calculated.
With a properly designed experiment, he was able to isolate the effects of weather, soil quality,
and other factors so he could compare the effects of various fertilizer mixtures.
And this work was not limited to agriculture, the same techniques are still used widely
in everything from medical trials to archaeology dig sites.
Fisher’s work and the work of his peers helped revolutionize science in the 20th century.
No longer could scientists simply collect and present their data as evidence of their claim
as they had in the 18th and 19th centuries, they now had the tools to design robust experiments
and techniques to model how the variables affected their experiment and observations.
In 21st century, human history is characterized by the shift from industrial production
to one based on information and computerization, also called the Information Age.
The web 2.0, social media, mobile devices, cloud computing
and IoT technologies have created an ever increasing large quantity of complex, noisy,
diverse and high speed and frequency data.
New trends and approaches are born from the practical problems created
by the Information Age: Novel data visualization techniques have been created
for the purpose of describing and exploring the data. In 2001, Leo Breiman,
a statistician who focused on machine learning algorithms, describes a new culture
of data analysis that does not focus on defining a data model of nature
but instead derives an algorithmic model from nature.
This new culture has evolved within computer science
and engineering largely outside (or perhaps alongside) traditional statistics.
The revolutionary idea is that models should be judged on their predictive accuracy instead of validating the model with traditional statistical tests (which are not without value by the way).
Since this new approach demands training, validating and testing data sets
and an iterative training/validation process to improve the accuracy, it is not well suited to small data sets, but works remarkably well with modern data sets.
There are several main differences between data analysis in the modern information age and the agricultural fields of 20th century.
First, there is a large difference in the available sample size.
Second for many environments and industries, a properly designed experiment is unlikely if not completely impossible.
You cannot divide your networks into control and test groups, nor would you want
to test the efficacy of a web application firewall
by only protecting a portion of a critical application.
In cybersecurity today, the big challenge is to be able to protect against the millions of new malware variants that are launched daily.
Many security solutions are incapable of detecting these fast evolving malware
because they rely on manually-tuned heuristics for creating handcrafted signatures.
This process is time-consuming and reactive, leaving organizations vulnerable
until the new signature is released.
Cyber security solutions that apply machine learning artificial intelligence utilize
manually selected features, which are then fed into classical machine learning modules to classify the file as malicious or benign.
But despite improvements in the rate and pace of detection, they are still lacking.
Deep learning is the next step in artificial intelligence.
It is also known as neural networks because it is “inspired”
by the brain’s ability to identify objects.
Similar to the way our brain is fed with raw data from our sensory inputs
and learns the high-level features on its own, in deep learning, raw data is fed through the deep neural network, which then learns on its own
to identify the object on which it is trained.

Recent advancements in deep learning has enabled technologies that leverage deep learning to exhibit amazing results across applications, such as object, facial, and speech recognition.
When applied to cyber security, it takes milliseconds to feed a raw data file and pass it through the deep neural network to obtain detection with the highest accuracy rate.
This predictive capability of being able to detect a never-before seen malware variant enables not only extremely accurate detection, but also leads the way to real-time prevention because at the very second a malicious file is detected, it is already blocked.

source: Tong Sun,
Adjunct Professor of Computing Security
Rochester Institute of Technology

Author: McPeters Joseph

Joseph McPeters is a Security Researcher. He specializes in network and web application penetration testing. Contact: admin@incidentsecurity.com

Leave a Reply

Your email address will not be published. Required fields are marked *