Internal

Google Summer of Code 2011 Project Ideas

Please note that GSoC 2011 has now successfully completed. This content is being retained for reference only.

This page contains a list of potential project ideas that we are keen to develop during GSoC 2011 (we also have some additional project ideas currently undergoing internal review, which will be added here too once project deliverables and available mentors have been confirmed). You can also find our previous GSoC 2009 project ideas here and previous GSoC 2010 project ideas here too, if you are looking for inspiration.

We are always also interested in hearing any ideas for additional relevant honeynet-related R&D projects (although remember that to qualify for receiving GSoC funding from Google your project deliverables need to fit in to GSoC's 3-month project timescales!). If you have a suitable and interesting project, we'll always try and find the right resources to mentor it and support you. We are also always looking for volunteers who are enthusiastic and interested in getting involved in honeynet R&D.

Each sponsored GSoC 2011 project will have one or more mentors available to provide a guaranteed contact point to students, plus one or more technical advisors to help applicants with the technical direction and delivery of the project (often the original author of a tool or its current maintainer, and usually someone recognised as an international expert in their particular field). Our Google Summer of Code organisational administrators will also be available to all sponsored GSoC students for general advice and logistical support. We'll also provide supporting hosted svn/trac/mailman/IRC/etc project infrastructure, if required.

So unsurprisingly a number of our suggested potential project ideas fall into these research areas.

However, we are also interested in receiving project proposals and tool updates/new tool developments outside these research focus areas too, so hopefully this provides potential students with a wide variety of exciting topics to contributed to and be engaged with this summer.

Capture-HPC is a high-interaction client honeypot that is capable of seeking out and identifying client-side attacks. It identifies these attacks by driving a vulnerable client to open a file or interact with a potentially malicious server. As it processes the data, Capture-HPC monitors the system for unauthorized state changes that indicate a successful attack has occurred. It is regularly used in surveys of malicious websites that launch drive-by-download attacks.

Capture-HPC has been widely used and been described as the state-of-art high interaction client honeypot system in many academic papers, but has several area that can be improved:

It does not contain fine grained attack detection mechanism, i.e. Capture-HPC cannot tell us which vulnerability is exploited, or is likely being exploited

Capture-HPC can only detect attacks that are successful. If, for instance, a web page attacks an ActiveX component that is not installed on the system, Capture-HPC is unable to detect it.

The community would benefit from more regularly updated Capture-HPC Exclusion Lists, particularly for modern browser versions (and greater community sharing).

The goal of this project is to address these issues by extending Capture-HPC to monitor a web page’s interaction with ActiveX controls installed on the system. This will allow Capture-HPC to identify which vulnerability is being exploited.

Furthermore, ideally Capture-HPC should be able to emulate ActiveX controls that are not presently installed on the system. This project will investigate this in the second part of the project.

We believe that continuing to improve Capture-HPC will encourage more automated analysis of malicious websites, helping to detect new generations of client focused attacks and further improve web browser security and safety for Internet users.

Honeynet Project members have developed a number of leading open source low interaction honeypot solutions that are used to automatically record data about network based malware attacks, such as Nepenthes, HoneyTrap and Dionaea (developed during GSoC 2009/2010). We have a number of active international sensor deployments to collect malware globally and are in the process of rolling out a larger low interaction sensor network called HonEeeBox, which was initially based on Nepenthes in 2009/20010 but is about to be upgraded to Dionaea in Q2 2011.

The goal of this project would be to implement a rich web based user interface and management reporting tool to allow analysts to easily explore large amounts of network attack and malware data. Typical tasks will be to view attack rates per sensor, search for high level trends (growth of a particular malware strain over time, attacks from a certain location on a particular day, etc) or drill down into the geographic detail of individual attacks. End users of the system will be the operators of malware collection sensors or interested analysts within the secuirty community or the Honeynet Project.

As input, the system will take reasonably simple CSV type data from low interaction malware sensors (such as timestamp, source IP, attack type, attacker IP address, MD5sum, etc in the form of an HTTP POST or via XMPP). This data is then automatically enriched by submitting the malware binary samples to multiplesandbox and antivirus engines for analysis (both public and private). The output from this post processing analysis is usally returned as XML or text after a short period, by HTTP or email. We also perform IP geo-location and ASN resolution against IP address to provide more information about sources, including latitude and longitutude for spatial mapping.

This data will be persisted in a database (probably Postgresql due to its native IP data types and support for CIDR ranges), processed and then presented via a new web interface to multiple distributed analysts. This interesting project and attack/malware data set provides many potential data analysis, information presentation and information security data visualisation options for interested GSoC students.

We have a number of prototype reporting interface examples available internally (see HonEeeBox or Ore for some examples), but we would like to develop a new system from scratch that exactly meets our particular requirements.

Backgroundreading and designinspiration might be found by looking at how leading networksecurity and antivirusvendors or opensourcegroups current present similar information, or by applying skills you bring to the project from your personal experiences and specialisms. Successful students will also be lucky enough to have access to a number of the leading subject matter experts in this field as technical advisors.

We believe that this project is important to the community as it will help researchers to more easily understand the types of attacks routinely occuring on the Internet today.

Updated 03/04/11:
You can find an internal presentation about HonEeeBox by David from our annual workshop in Paris, during the last week on March 2011, here, which provides some additional background information about the project so far (and discusses the future too).

Finally, if you want data, you can find two sample SQLite databases containing historical Dionaea data for use in DB and UI prototyping here, courtesy of Markus (the principle author of Dionaea during GSoC 2009/2010). We'll also soon be able to provide an authenticated publish/subscribe feed of live data (with anonymized sensor source IP addresses) too. Details to follow.

Project Description:The value of the rich sets of data generated by the various Honeynet technologies become most evident when researchers can usevisualization to assist in the analysis of the datasets. Prior to visualizing datasets there are significant challenges to overcome including dealing with uncertainty, missing data, data integrity, data anonymization, data formatting, data manipulation, and data normalization. Once these challenges have been overcome, there remain additional challenges associated with finding the best visualization to identify the information hidden in the dataset and presenting that information in an interactive format that facilitates additional analysis. There are numerous research projects that can be addressed in this realm. Current priorities include the following:

Improved automated parameterized preparation of datasets for visualization in existing tools (such as those available through SecViz.org)

Creation of new and innovative dashboard approaches that combine data visualization tools and techniques to improve analysis and broaden perspective.

Evolution of tools or creation of new tools to better visualization data over time series.

Project scope could include an in-depth approach to solving one of the problems above through new techniques and tools, or a more general approach that addresses some or all of the above for a specific Honeynet dataset, such as the data from our HonEeeBox project described above.

Project Description:There are various sources of data relating to malicious activity on the Internet. These can can be collected in many formats and sources including the various collection tools provided by the honeynet project. Generally the data is collected as text files or as a database, in some non standard format. The elements of data within the data set vary widely, however in most cases there is a key set of metrics that are always related with a specific time and geolocation. The time of the event is generally always logged by the data collection tool, and the geolocation can be determined with varying degrees of accuracy with various geolocation tools. Analysis of geo-time series data helps us understand a very basic and fundamental question which is not always understood - "Where and when malware comes from and where does it attack".

The goal of this project to define an basic framework for time series gemapping that consists of the following elements.

A configuration file that defines the elements to be displayed, colours etc.

A parsing engine that reads a configuration file, parsers the data into a common format.

A rendering engine (processing and/or processing.js)

The output will be in the form of a map which displays whatever metric in a latitude orientated mesh structure, and also with a heatmap style. The latitude mesh/heatmap for each metric of interest can be toggled on/off via radio button, tabs or a filter. This map is rendered in either real time or at a speed dictated by the viewer at runtime, allowing the metric to be examined in terms of magnitude and geospace. The speed can be varied as required.

Mobile security is an emerging area with little visibility. Handset patch management is a major concern and in the last year exploits have emerged for both the android and the iPhone/iPad iOS, along with the appearance of the first mobile device botnets. This project is intended to be the first step in providing dynamic instrumentation for mobile malware analysis on the Android platform, as well as for use in future Android honeypots. The goal is to create a sandbox built on the Android platform to monitor and log events that an analyst could use to determine if something malicious has occurred and what a particular application is doing while running on the mobile device. Some key components of this system would:

Identify in android source code, the loading and the execution of an APK :

loading of an APK

execution of the classes .dex

Permissions

Mobile phone information

Tracing method calls (android API)

save network traffics

Modify the previous parts identified, in order to :

enable classical debug : step by step, breakpoints ... (with JDWP ?)

use an external program to debug code (baskmali, androguard) (it must be a free software)

change information (IMSI, IMEI ?) about mobile phone

save all information (files, network ...)

So, the project requires the modification of internal parts of the android source code (like the DalvikVM, Zygote) to build a modified system to allow the dynamic analyis of a particular application.

In this project we can perhaps modify the android emulator to make a link between the running application and an external program to debug applications.

There is also the TaintDroid approach, which has a paper on OSDI '10 and is open sourced.

This work could build on some previous prototyping work by Honeynet Project members, but it's more focused on low level analysis :

Chengyu Song (CN) implemented a simple syscall tracing prototype, based on Android emulator, and is very similar to qebek. It may not be very useful for Java based application, but as Google has begun accepting native applications, it could be helpful in the future.

Eugene Teo (SG) has a proof-of-concept ptrace application-level interception on Android which can let you do simple things like sandboxing and firewalling the browser. It is for a controlled environment and probably not very useful in real-world scenario. If you have difficulty compiling C++ code or getting ptrace/ARM code to work, he would be happy to share some experiences.

We have used AspectJ to instrument Android applications to add detection of application-level events. We are probably going to shift focus now on at Dalvik bytecode level instead, given that source will not be available and converting a stack-based VM machine code to register-based VM machine code will be challenging.

We could perhaps customize Droid ROMs or try to build a droid Honeywall.

Mobile security is an emerging area with little visibility. Handset patch management is a major concern and in the last year exploits have emerged for both the android and the iPhone/iPad iOS, along with the appearance of the first mobile device botnets. This project would develop a honeyclient crawler to retrieve Android packages and feed them into an automated analysis engine to attempt to automatically detect rogue software on the official and unofficial Android package sites.

Relevant work:

This work could build on some previous prototyping work by Honeynet Project members:

Chengyu Song (CN) implemented a simple syscall tracing prototype, based on Android emulator, and is very similar to qebek. It may not be very useful for Java based application, but as Google has begun accepting native applications, it could be helpful in the future.

Eugene Teo (SG) has a proof-of-concept ptrace application-level interception on Android which can let you do simple things like sandboxing and firewalling the browser. It is for a controlled environment and probably not very useful in real-world scenario. If you have difficulty compiling C++ code or getting ptrace/ARM code to work, he would be happy to share some experiences.

Ryan Smith (US) used AspectJ to instrument Android applications to add detection of application-level events. He is probably going to shift focus now on at Dalvik bytecode level instead, given that source will not be available and converting a stack-based VM machine code to register-based VM machine code will be challenging.

This project can probably be split in 2-3 parts (both are complementary):

1) the crawler (and classification?). The crawler must able to get APK files from both the official android market and unofficial markets. We think that the crawler might be used with a classification system to organize all files and more ...

2) the analysis honeypot. The honeypot can be a modified android emulator (to configure characteristics) with a modified virtual machine to log every interesting piece of information (and provide a way to communicate with an external program outside of the emulator). It will also have a detector to catch more easily suspicious APK (for example, in the recent DroidDream malware the "sqlite.db" was in fact another APK).

While there are many options that would be appropriate for mobile malware analysis, a first thought for a standalone project is to build a IDE-like gui for visually viewing the reversed smali code for Android malware. Components might include:

Graph-based UI displaying control flow of the code

Links from Graph View to Source View

Function/Object->Method List / Filter

Strings List / Filter

Flow in / out from a given point.

Function and variable renaming

Annotations, and notes

Students may choose to operate directly on Dalvik bytecode (.dex and .odex), or on the intermediate language Smali, used to represent Dalvik bytecode in an assembly-like syntax. Projects such as Androguard, APKtool, and Smali/Baksmali may be useful in providing basic Dalvik bytecode analysis, although a successful proposal will provide additional features beyond what are provided in these tools.

Students may also use tools like IDA (freeware only), eclipse, yEd, or JGraphT to provide the visualization and UI, but they should provide significant code for analytic features apart from displaying the graph representation. Some of these features are listed above, but should provide the analyst a deeper insight into the application, and quickly allow them to search and move around within the sample being analyzed.

Creativity and uniqueness is encouraged, however proposals should be clear on what they expect to produce, and a plan to achieve that within the short timeline of a single summer.

Related work:

We have code samples for various mobile malware such as Geinimi and SMS.Trojan. Example analysis:

Right now Ryan Smith has nearly 4000 nonclassified "off market" free apps, and a handful of the recent known malicious packages. He has some analytics of permissions requested in the manifest across the 4000 apps, and is currently working on static analysis of the decompiled smali files.

We have an existing tool called androguard (LGPL, full python) that can already be used to work with APK files. The tool manages class/dex files correctly and add an analysis module to search easily fields, methods, strings, variables. Background posts about DroidDream analysis:

Project Description:Many of today's most advanced attacks now happen at the web application layer with dire consequences: from web defacements, joining web servers into botnets, blackhat SEO, turning web servers to deliver drive-by-download and scareware attacks against end users. To capture web application attacks is a challenging undertaking as there is a wide range of vulnerability types that are being attacked. Glastopf (http://glastopf.org, http://www.honeynet.org/papers/KYT_glastopf is a web application honeypot developed during previous GSoCs which emulates thousands of vulnerabilities to gather data from attacks targeting web applications. The principle behind it is very simple: Reply with the correct response to the attacker exploiting the web application.

The goal of this project is to extend Glastopf to further enhance its capabilities to capture web application attacks. Possible extensions include:

Share dorks using a centralized approach in combination with the HPFeed framework. This can be considered a new way to share data between malware intelligence collection and analysis tools.

Support to use the PHP interpreter for vulnerability emulation. Currently Glastopf uses a simple pattern based string replacing method which is very static. A dynamic approach would be beneficial to enhance Glastopf’s ability to analyze the php sent by the attacker to craft an appropriate response.

Instead of the dork list, generate a web site from a set of dorks to hide the honeypots presence. It would be worth to check if automatically generated content provides more than just hiding the honeypots presence.

Add generation of proper responses for SQL injections, XSS, CSRF, XML entity injection, HTML injection and code execution. This could indicate if we are able to provoke further actions of those types of attacks.

Add data analysis capabilities to Glastopf. (This could also be merged into the data visualization project.)

Automated PHP analysis using a PHP sandbox and web server botnet monitoring using HALE (which was previous GSoC 2010 tool). This could lead into some exciting botnet research in this area.

Project Description:Malware is the raw-material associated with many cybercrime-related activities. Many Honeynet Project and security community members could benefit from a fully open sourced sandbox/sandnet solution to either locally analyse their malware collected, send malware samples to a central analysis platform, or be a node in analysis cluster architecture helping the community. Various public sandboxes exist (Threatexpert, Anubis, CWSandbox ...) and some chapters have their own solution, be barebones or virtualised but all those may lack a standard analysis model and some tools to extract critical information even-though they all may complement each other.

Cuckoobox, a lightweight solution that performs automated dynamic analysis of provided Windows binaries, was developed as part of GSoc 2010 (more information is available at http://www.cuckoobox.org/index.php). It is able to return comprehensive reports on key API calls and network activity.

The goal of this GSoc 2011 project is to extend Cuckoo. Possible avenues are the extention of APIs that are currently monitored and to implement database reporting.

Project Description:Virtualization provides new opportunities to observe the activity of a
running system in a very non-intrusive way. Different actions can be
observed without requiring changes to be made or software to be
installed in host being monitored, which is desirable because such
changes may be noticed by or even interfere with an object under
investigation. Use-cases for such systems range from malware analysis
tools (e.g. sandboxes), high-interaction honeypots (i.e., monitoring
the actions of intruders before, during, and after breaking into a
system), to system health monitoring of the operating system. The
project involves the development of a good framework for Virtual
Machine Introspection which will both allow for initial monitoring
capability, and provide a foundation upon which future work can be
based.

During this project the student will work on generic VM introspection
for the most important actions related to the use-cases presented
above. The main goal of the introspection is to be flexible and
extendable with respect to the virtualization technology and operating
system.

For the virtualization technology, we, as a community, want to start
with at least two different types of open source virtualization
technologies (e.g., Xen and KVM) in order to implement the necessary
abstraction layer. The layer allows for the addition of other
technologies in later projects. On the OS level (i.e., the “Guest” VM
being monitored), the introspection needs to be designed in a way such
that it can be used for different operating systems (e.g., Windows and
Linux) and versions of the same OS (e.g., Windows XP and Windows 7).
The necessary adjustments must be as minimal as possible in order to
allow easy for extensibility as new operating system targets are
added. Parts of the adjustments can be automated. This is a first step
in the project in order to save the student time ;)

The technical component aims at monitoring different actions that
occur inside the operating system. This includes (but is not limited
to) monitoring:

Files and I/O

Registry (for Windows)

Process operations (creation, termination, …)

A stretch goal and helpful tool for both debugging and forensics is
the extraction of memory. A full system memory dump can be very
helpful for later analysis of the state of an operating system after
an attack or system crash. The information about memory from within a
process is very valuable for malware analysis and threat
investigation.

There are some tools and libraries on which this work can be based
(although it is not required). These include Qebek, pyQemu, and VIX,
which are all Virtual Machine Introspection tools developed by
Honeynet Project members.

Project Description:IPv6 honeypots and data analysis tools and unicode DNS present challenges for our current tools. With the imminent potential exhaustion of IPv4 address space and the recent adoption of unicode, non-English character DNS, many commonly used network analysis and honeypot tools are yet to support or adapt to these fundamental changes, leaving security researchers and network operators with considerable gaps in their potential armory of incident response. This project will extend the functionality of selected tools (e.g., Honeyd and Honeywall) to provide IPv6 support.

- continued development of low interaction honeypots with true protocol awareness and the ability to programatically detect and extract payloads from previously unknown malicious shellcodes at close to typical WAN line speeds. Comparisons of the effectiveness of such systems.
- extending the capabilities of VoIP honeypots, to more accurately emulate vulnerable PABX/VoIP systems and capture and track VoIP attacks against telephony infrastructures
- expanding web application honeypots that detect common attacks against web applications (SQL Injection, SSI / RFI / LFI, etc), learn evolving attack patterns and honeypot discovery techniques (such as tracking the search engine queries use to locate the honeypot) and dynamically adapting themselves to better collect new attacks and their payloads.
- hybrid honeypots that dynamically decide if an attack should be handled by a low or high interaction honeypot, possibly with network traffic replay, to improve scalability and analysis accuracy

Project Description:Libemu is a small library written in C offering basic x86 emulation and shellcode detection. Libemu turns shellcode instructions into function calls the shellcode performs, so an analyst can quickly discern the actions of the shellcode and answer questions whether the shellcode is downloading a program or executing a process. Libemu is the de-facto standard when it comes to analyzing shellcode in an automated way, but emulation performed in software is quite slow, meaning that Libemu can't easily scale for use in in high performance environments.

The goal of this project is to replace Libemuemulation components with a virtualization based approach, since far greater performance can be achieved through hardware accelerated virtulization rather than software emulation.

Network sinkholes are servers/software that can respond to high volumes of incoming traffic and can be used to mitigate malware, such as by redirecting all botnet command and control (C&C) traffic to an ISP operated sinkhole server rather than the actual malicious targets. Successful sinkhole systems need to be able to respond to a range of fairly arbitrary network protocols and unexpected inputs, respond in basic realistic fashion and support a high volume of concurrent connections whilst at the same time logging all connection activity. Various high end commercial systems are available but no open source implentations are currently available to assist malware researchers and network operators.

Ideally it would be possible to simply perform 'apt-get install honeynet-sinkhole' and set some config steps and have a fully blown 'scalable' sinkhole type solution in place, although kernel tuning and event based system aimed at supporting high rates of concurrent connections (such as XMPP systems based on Erlang, etc) may also be worth investigating.

Ideally the following protocols should be initially supported

* IRC
* HTTP
* FTP
* SQL/OSQL/MSSQL
* DNS

And the solution should provide an extensible, modular approach to allow easy addition of new protocol support.

We can potentially provide some fairly high volume real world data sources for testing, if required.

Project 15 - Extending Wireshark AnalysisMentor name: Guillaume Arcas (FR) Backup mentor: Hugo Gonzalez (MX), François Hamelin (FR) Type: Extending an existing tool Goal at the end of the project: Up to 3 new Wireshark plugins Description:
Honeynet Project members often use the Wireshark network analysis application to student packet capture (PCAP) data generated by honeynets or malware analysis. However, there are some missing capabilities that if added would significantly improve our security incident analysis capabilities.

This project would build and release up to three new Wireshark plugins:

* WireShnork: a Wireshark plugin that would support applying Snort IDS rules and signatures against pcap files. This would be useful for network forensic, allowing analysts to automatically colorise packets that match a particular Snort IDS signature.

* WireSpade: as above for WireShnork, but using the Spade plugin statistical analysis pre-processor for the Snort IDS.

* WireViz: an HTML Visualization plugin for Wireshark that would render the contents of a web page contained within a TCP stream inside a pcap file as it appeared in the original browser, including images and javascript components. This will not replace the existing "Follow TCP Stream" function but extend and enrich it.

Description:
This project included further development of the VoIP module for the honeypot Dionaea (which was developed during GSoC 2009 and GSoC 2010). The VoIP protocol used is SIP since it is the current de facto standard for VoIP. In contrast to some other VoIP honeypots, this module doesn't connect to an external VoIP registrar/server. It simply waits for incoming SIP messages (e.g., OPTIONS or even INVITE), logs all data as honeypot incidents and/or binary data dumps (RTP traffic), and reacts accordingly, for example, by creating a SIP session including an RTP audio channel. As sophisticated exploits within the SIP payload are currently rare, the honeypot module doesn't pass any code to Dionaea's code emulation engine. This is a potential area for future investigation if such malicious messages are detected.

The current Dionaea SIP module from GSoC 2010 does not yet fully support some SIP scanning tools, such as sipvicious ( svmap.py, svcrack.py,svwar.py), smap, sip-scan, inviteflood or the recent Metasploit VOIP related modules (scanner/sip/enumerator, voip/sip_invite_spoof and etc), the VOIP tools in backtrack4, etc. Fully implementing support for these common SIP scanning tools will significantly increase the realism of the Dionaea SIP module and increase the likelihood of detecting more SIP based malicious activity.

There are today a few botnets specifically targeting SIP. The new SIP module would allow for easier tracking these botnets, perform fingerprinting and help alert the responsible hosts.

The next area in SIP security will be exploiting bugs in different vendors. Microsoft Lync is using SIP TCP and will be a high value target for hackers. Often these installations are connected to the PSTN to make outbound calls. Frauds costs have already been in the millions of dollars already.