Project 10 - HoneyProxy - HTTP(S) Traffic Investigation

Project Overview:
The project's goal is to create a full featured man-in-the-middle HTTP(S) traffic analyzer. HoneyProxy should allow both real-time and log analysis while offering extensive visualization capabilities. To accomplish that, we want to develop HoneyProxy as a browser-based application with a logging core written in Python.

Project Plan:

April 23rd - May20th: Community Bonding Period

May 21st : GSoC 2012 coding officially starts

May 21st - May 28th: Figure out how to add SSL interception, create a working prototype.

May 29th - June 5th: Finish logging core, including HTTP and HTTPS support. So early? Yes! Most logic is in the Frontend/GUI.

June 5th - June 25th: Basic version of the GUI should be present, add abstraction layer for content visualization

During the last week, I read through the mitmproxy [1] source and docs to understand the project's structure.

I managed to create a working prototype employing libmproxy, the mitmproxy proxy core, for SSL interception. We now have a basic prototype doing the SSL and interceptions stuff (fully mitmproxy compatible) where we can add custom code for handling the traffic.

I stripped down the libmproxy configuration options to make HoneyProxy a little bit more lightweight (e.g. no traffic replay).

The prototype already includes a Twisted WebSocket interface to communicate with the GUI.

June 1st

Fixed several bugs - proper termination on ctrl-c and dumping to an outfile work now.

Added Authentication to the WebSocket Communication. Although we're still away from production use, I think it's a good idea to keep theses things in mind from the beginning.

Added a basic WebSocket call for getting finished requests. This way, we can get all finished requests at any time. Performance is still bad, seems to be an issue with the socket.

June 7th

There wasn't much to see or play with until now, because I basically worked on the proxy server to build a solid basis for the GUI. Beginning with this week, I'd like to change that - I started working on the browser side to get some visible results: The initial version of the GUI now shows all traffic as dumped JSON. If a new requests comes in, it will be added automatically. :-)

All traffic gets saved in a Backbone.js [2] collection in the browser now. Open the Chrome dev tools console and play with the data, if you want. Some examples:

You can copy the URL and close the GUI at any time. If you open it up again, the GUI fetches all handled traffic from the proxy server and saves it in the traffic object in JavaScript again.

June 14th

API and HTML ports are now configurable...

Found a few bugs in libmproxy and merged them back into the original project. FOSS is great :)

You can now send messages to specific GUI clients in the proxy server. Useful for sending all handled requests on a new session.

Major: The GUI is powered by a Backbone View now (Backbone.Marionette.CollectionView to be specific). It now shows a full dynamic table containing all requests. If we modify our Backbone collection of flows (flow=req+resp), the table gets updated automatically. The Network Table is highly inspired by the chrome dev tools and will be our first (and probably main) visualization view. Of course it is not ready, but here's a first screenshot. The task for the next weeks will be adding a detail view for specific content types and enriching the table with more data from the traffic.

It looks like the WebSocket speed could be a bottleneck when reading big chunks of handled traffic from disk. I replaced the original version with an experimental implementation using Autobahn.WebSocket. This adds a dependency to the project, but it seems to be very much faster. Maybe we need to do more work here.

June 20th

Further elaborated WebSocket performance and talked extensively with the developer of AutobahnWS. Seems like WebSocket text message performance doesn't get above 1MB/s on a smaller machine. Maybe it's a good idea to lazy-load the response content via a separate JSON interface. Such a solution can be plugged in easily if necessary.

Added abstraction layer for content visualization. While this is mainly a requirement for the content preview mode, we are already able to show an icon corresponding to the content type in the main view.

Next step is to add the content preview mode for HTTP Headers and Request content.

June 27th

Added Table Sort for the traffic table. As this might be a performance bottleneck, HoneyProxy now uses the Google Closure Library as a well-tested and high performant table sorter. It might be faster to start sorting the Backbone Collection and reconstructing the view, but the current solution is already very fast and much easier to implement.

HoneyProxy now has a proper readme file including instructions how to set up a demo instance.

Added a SplitPane for the content preview, including an intelligent Resizer. We have discussed whether we should use iframes, JS dialogues alternatively. However, the SplitPane seems to be the best solution in terms of UX (Chrome Dev Tools, BURP...). As we want to allow the inspection of several files in parallel, I added a popout feature opening an iframe with the content.

Fixed a few neat bugs.

Major: Added Preview Mode and Header View. HoneyProxy now looks like a full traffic inspector! There are still some problems that should be worked out, but I estimate that we should reach beta status very soon. Yay!

July 4th

Updated HoneyProxy to support the latest mitmproxy release. We now have support for transparent mode on linux!

Add HTTP Server for serving content - You can now download now both request and response payload from the server. As this might expose sensitive data, download is only possible with the GUI authentication key which HoneyProxy takes care of automatically.

Added support for JS Flows. (minor)

Next: We are currently preparing for a BETA release due next week (see the milestone on GitHub).

The SplitPane supports horizontal alignment now.

July 11th

Added form data / payload field. You can now download & inspect the request content. This was kind of a real blocker for testing a release candidate, but we distinguish between raw payload and form data now in a similar way to the Chrome Dev Tools.

Added RAW panel for inspecting the HTTP flow in plaintext. We currently cheat a little bit about the HTTP status code (always 1.1), but we get access to this data with the next stable release of mitmproxy.

HoneyProxy now has a proper buildfile.

Updated README to include proper installation instructions.

A preview version is ready for testing! We will post an annnouncement this week.

HoneyProxy now uses google-code-prettify for syntax highlighting. As performance is a big deal here again, we decided to limit highlighting to smaller files. If you want to test it out, check out the beta, it's included.

There are also some other minor changes in the beta version - the currently selected row is now highlighted, images have their own icon and the certificate directory bug is fixed.

(Everything covered below this point is not included in the beta anymore)

HoneyProxy now comes with its own copy of netlib to stay independent from the latest stable version of mitmproxy

Added a proper templating system to HoneyProxy. This causes more network requests, but well - it's localhost.

Added information about the server certificate to the GUI. It's placed in the former "Headers" tab which is now called "Details".

Added Proof-of-Concept version of our DirDumper. An explanation + usage instructions will be added when it's ready.

July 24th

Finished working on the DirDumper. HoneyProxy now comes with a --dump-dir ./dumpdir/ option. When active, HoneyProxy stores all response bodies in a folder structure. E.g. a request to www.foo.com/bar/baz.zip places baz.zip in ./dumpdir/www.foo.com/bar/. Another example usage would be dumping all response bodies from a saved traffic flow: honeyproxy.py -r traffic -n --no-gui --dump-dir ./dir/ - the dir dumper is also added to the webinterface showing the directory structure.

[tl;dr: technical description of the dirdumper follows] The implementation of the DirDumper is a little bit problematic as both foo.com/bar and foo.com/bar/baz can be valid files at the same time. However, we cannot create both a folder and a file called "baz" in the same directory. A possible approach would be using folders for everything and placing __resource__ files in them. While this would be a much consistent structure, it doesn't represent the file system very well. As this view is for visualization purposes only, we took the approach to append [dir] to conflicting folders. Another issue we ran into was that the path length is strictly limited with no real python library facing these problems per OS. Our current solution cuts off both file- and dirnames if they get too long.

Also noteworthy are some neat bugs which are fixed now in HoneyProxy and mitmproxy. Especially reading flows with the -r option is now working on windows and much more robust in terms for failure handling.

The GUI has been migrated to a new authentication architecture using HTTP Basic Auth. Main reason for this has been the problem of the ever-expanding configuration options. HoneyProxy can now safely load the config file as JSON rather than storing it in the url hash.

August 4th

Exam time! I was really busy with university and Guillaume was on holiday, so we decided to pause development for a week. In the next two weeks, I am able to work full-time on HoneyProxy to compensate that.

August 11th

Working full-time, lots of news! :)

Added a quick start tutorial in the GUI if no flows have been recorded yet. It shows configuration instructions for first-time users.

Tried to port the traffic table and the main layout to ExtJS. This took a lot of work but turned out not to be flexible enough for our remote search.

Ported parts of the JS to RequireJS modules. This might be useful in the future, but we cannot modularize so much currently. Abandoned this for more valuable ideas

Internal rewrite of the Flow JS Model. Request and Reponse are now properly separated and the code is much cleaner now! :)

Fixed proper parsing of Content-Disposition headers

Major: Added search capability. HoneyProxy now comes with a search field supporting regular search, regular expressions and inverse search (think of a not() operator)! We also decided to add an option for skipping file content for performance. Filters (search requests) also apply to new incoming requests. Technically, the whole process turned out to be pretty complicated as we need to decode the request twice (Content-Encoding and Content-Type charset) with proper error handling for all those mis-configured web servers out there.

Major: Added capability to highlight requests that match specified criteria. This works similar to the search feature and adds colorful dots to the matching requests.

Refactored the main layout class, HoneyProxy now uses Dojo for the general layout. This is already a huge improvement over the last version, we plan to move the other remaining components (detail tabs) over to dojo after GSoC.

Improved User Interface - HoneyProxy now looks way more professional and has proper highlight indicators! :)

Updated mitmproxy and HoneyProxy to use argparse. We now have full support for configuration files :)