Project Overview:
A large amount of honeypot logs result in difficulties in data analysis and interpretation. In order to alleviate expert's workload and complexity of data analytic, this GSoC idea is to automatically build attack community graph for eliciting attack approaches and intention description.

The GSoC idea will be divided into three stages. The first constructs attack graph by extracting relationship among criminals, victims and malicious servers from honeypot logs. For this project, I will use dionaea logs, glaspot logs and kippo logs as the first-level raw data. From first-level data to conduct second-level analysis data, cuckoo sandbox developed by Claudio Guarnieri, PHP sandbox by Lukas Rist, Thug by Angelo Dellaera, Hale by Patrik Lantz and fast-flux detection are applied for advanced data collection and analysis. After completing data collection and processing, I will extract relationship from those data to build attack graph.
The second stage is to apply centrality mechanism to group graph into individual attack approach compartments. With evaluate the relative centrality of different attack approach compartments, attack community graph construction will be presented by connecting high-density attack approach compartments. Definitely, I will map attack approach compartments with its attack behavior intentions. The second deliverable is a python package to express the attack community graph and its intentions.

The Honeynet Project uses hpfeeds by Mark Schloesseras a generic authenticated datafeed protocol to collect honeypot data around the world. Ben Reardon used Splunk to do data analysis and visualization. The third is to develop a APP to present attack community graph and integrate into Splunk platform as the final deliverable.

Project Plan:

April 23rd - May20th: Community Bonding Period

- Prepare develop and testing environment

- Learn how to develop Splunk App

- Study social network graph drawing tools and library

- Reading social network centrality algorithm and paper

May 21st : GSoC 2012 coding officially starts

May 21st - May 28th:

- Decide which honeypot logs I want to use.
- Modify and integrate curent hpfeeds client to get the instance, called as first-level data set, and stored into testing site
- Format, indexing and create Splunk searches to extract relationship for graph construction.

May 29th - June 15th:

- Code advanced first-level data processing to get second-level data
- Develop python code module for attack social network construction.
- Integrate into Splunk App and show the first draft graph, which only show the relationship between nodes and edges.

June 19th - July 7th:

- Discuss whether we should do graph reduction based on centrality algorithm to decrease the visualization complexity
- Integrate into Splunk App and show the second version graph, which can show the attack compartments
- Evaluate current project results and scopes. Adjust project scopes and deliverables.

June 17th
Done last week:
- Data vectors design and implementation fro graphing and centrality calulation from Splunk indexing data
(Testing environment on http://114.35.193.28:8000/) (Code integrating with Splunk4HPfeeds from Frank)
- Centrality calculation coding, still have to do debugging )

Planned for next week:
- Centrality calculation coding debugging and integrating into Splunk4HPfeeds
- Using networkx to draw the graph
- Draw graph by applying the centrality calculation

June 24th
Done last week:
- Centrality calculation coding debugging and integrating into Splunk4HPfeeds (Still in debugging)
( I cannot proceed hpfeeds logs into Splunk directly. The solution is to store hpfeeds log into a file, then monitoring the log.)
- Using networkx to draw the graph. Code is on the folder graph_1 folder using glastopf_events, glastopf_sandbox logs

Planned for next week:
- Centrality calculation coding debugging and integrating into Splunk4HPfeeds
- Using networkx to draw the graph version 2
- Draw graph by applying the centrality calculation

Blocked issue:
- Splunk change his internal framework on newest version. It took lots of time to read the documentation.
- Graph version 1 data format is complicated and reduplicated a lot. Re-design the data format makes the graph simple and efficience.

July 1st
Done last week:
- Centrality calculation coding debugging and integrating into Splunk4HPfeeds (Still in debugging)
( I cannot proceed hpfeeds logs into Splunk directly. The solution is to store hpfeeds log into a file, then monitoring the log.)
- Using networkx to draw the graph version 2. Code is on the folder graph_2 folder intergrating with glastopf_events, glastopf_sandbox, glastopf_files, thug_events and thug_files logs. The data format is re-designing based-on relationship and node types.

Planned for next week:
- Centrality calculation coding debugging and integrating into Splunk4HPfeeds
- Using networkx to draw the graph version 2
- Draw graph by applying the centrality calculation

Blocked issue:
- My final exam is on this week. Sorry for delaying the progress.

Planned for next week:
- Study how to use D3 for dynamic graph
- Study how to splunk to substitute traditional DB sturcture

July 22st
Done last week:
- Study how to use D3 for dynamic graph
- Study how to splunk to substitute traditional DB sturcture

Planned for next week:
- Implement D3 on Splunk to show the social graph
- Implement DB select and search using Splunk for extracting data to present on graph.

July 22st
Done last week:
(1) http://140.116.163.148:8000/en-US/app/HpfeedsHoneyGraph/HpfeedsHoneyGraph
Display landing_site -> Hopping_site --> Malware Downloading with showing information when mouseover.
- This graph cannot show on the center of screen. I still do debugging.
- As you can see, the graph have too much single nodes. The problem is "http://www.aaa.com/sjdksd and http://www.aaa.com/weuwie" will show two nodes. I would like to discuss how to simplify the graph.

(2) http://140.116.163.148:8000/en-US/app/HpfeedsHoneyGraph/ThugFilesGraph
- This graph is the unchanged graph. After discussing with my mentor, Chris, I found a big mistake on force-collapsible graph on D3. This graph cannot display unique nodes.
- Malware A --> google.com Malware B --> google.com (google.com will show two different nodes on graph. Therefore, I take two days to change data format and merge sub-trees for displaying unique nodes.

(3) http://140.116.163.148:8000/en-US/app/HpfeedsHoneyGraph/ThugFilesUnique
This graph is to extract malicious hostname from cuckoo report --> run Pffdetect to detect fast-flux IPs --> do passive DNS lookup to find more corresponding domain and IPs.
- This graph take objects as unique nodes and link their relationship including show its information when mouseover.

Planned for next week:
(1) Add time search bar on the graph.
(2) Add geoip information on the graph
(3) Use REST API substitute pure python on running splunk search
(4) Use Splunk APP framework to pass information on the APPs.