For this part of the exercise, I look at 2 IP Address and calculate similarity using Euclidean distance and Pearson correlation. I created a small dataset that is a nested dictionary. I did manual calculations, but python’s Pandas can work the numbers easily. I calculate the distance of Lisa from Kirk by isolating 1.1.1.1 and 2.2.2.2 and plot those on a graph. I do it for each of the combinations of people and each of the combinations of IP addresses. I even find people that are very similar and one that is not as similar. This model can help understand clusters and identify baseline conversations between people and visited IP addresses. Somehow it all makes sense to me.

Organizations have created thousands of models and have a solid understand of the business and priorities. The organization is planning to use predictive analytics and statistical techniques from modeling, machine learning, data mining and game theory that analyze current and historical facts to make predictions about future events. These reports require heavy interaction of the BI, visualization, and Infosec teams to produce real validated results.

My journey into Data Mining

Following up on my recent blog post on Infosec and Big Data, I decided to write more about the journey into my though process and investigation of how the Infosec industry is going to change. The next few posts will detail some learnings around building a variety of algorithms for a Big Data system.

I’ve learned that data mining involves the selection of 2 paths. Look at data to explain the past or use data to explain the future. A variety of algorithms have been created by SIEM vendors to identify attacks. Most of the algorithms are easy to reproduce for simple attacks like DOS and DDOS. Any attacks that flood, brute-force or break a threshold are easy to make. Twitter, Facebook, Netflix, Google and “put-your-social-here” have figured out how to data mine data/machine logs.

Data Mining —- Future —- Modelling

\_ Past ___ Exploration

From my own experience, I have found it trivial to create a few examples to “Data Mine the Past”. There are literally thousands of examples on how to data mine social. Everyone and their aardvarks have used a tool to “data mine” and visualized past data. I’m finding it less trivial to create “predictive models of the future”. Below is a graphic I found that explained the connection of Data Mining to other concepts.

Making your own “predictive models of the future”

Predictive Models are also known as “machine learning” and also known as “pattern recognition”. Many models use one model and give the user one answer. These one2one models are formula based models. The first challenge I ran into was to decide what tool to use to model in. R? Python? Weka? Mahout?. Next was finding a variety of data sets, cleaning the data sets, and loading them up on the tool of my choice.

NOTE: All solutions must use Hadoop. Googling “predictive model marketplace” didn’t help much. Why isn’t there a place on the web where people can freely share predictive models?

The next choice is finding or choosing the model I will experiment with. In predictive models, you have 4 choices to choose from: 1. Classification 2. Regression 3. Clustering 4. Association Rules.

Data Mining —- Future —- Modelling —- Classification

\_ Regression

\_ Clustering

\_ Association Rules

I have identified 3 predictive algorithms to start with to discover network attacks. I will create simple algorithms in each of the categories of k-means clustering, k-nearest neighbour, and association rules. A variety of papers can be found simply by searching Google for each type of algorithm concatenated with “network attacks”. There is quite a bit of math and theory around these techniques that I am not familiar with. Notwithstanding, I will try to explain and create trivial predictive algorithms in my next series of blog posts.