Optional Google OAuth2 authentication to allow you to delete or edit your uploads

A simple JSON interface makes it easy to post and download text from other sites

To download text via JSON make a GET request to http://texthole.arkem.org/download/[mesage_id]

The reply will be a JSON dictionary with the following keys:

body: the text body of the message

editable: whether the requestor can modify the text

creation: Creation time of the text

expiry: unix timestamp of the expiry of the text

message_id: the message id of this text

status: True if the request succeeded

error: If status is false more details here

Note: Only status and message_id fields are guaranteed

To upload text via JSON make a POST request to http://texthole.arkem.org/upload providing a JSON dictionary via the data form field.

Possible actions are:

New message: The body key is required

Delete: The delete key is required

Edit message: The body and overwrite keys are required

Request dictionary keys:

body: the text body of the new/modified message

delete: the message id of the message to delete

overwrite: the message id of the message to edit

authenticated: if set attribute the new message to the user

expiry: number of seconds (max 1yr) the text is to be valid for

Notes: One of body and delete is required. Overwrite and delete require a valid cookie to be sent with the request.

Reply dictionary keys:

message_id: the message id of the new/edited/deleted text

status: True if the request succeeded

error: If status is false more details here

expiry: unix timestamp of the expiry of the text

user: username of the owner of the text ("None" for anonymous)

TextHole is missing the following features (maybe coming soon):

An index of available texts

Text search

A javascript client library to make it even easier to integrate with TextHole

A way to authenticate via the JSON library

Please play around with TextHole and send me any bugs or ideas that you find. Please remember that everything in TextHole is public, I can see it, and so can everyone else. Finally, please don't use TextHole for evil.

Friday, September 7, 2012

VirusTotal, the online service that will scan uploaded files against dozens of AV engines has been acquired by Google. Here's the announcement. I think this is great, I'm a big fan of VirusTotal and I am looking forward to what Google and VT can come up with together.

Monday, July 30, 2012

Tavis Ormandy is at it again, this time offhandedly revealing a drive-by code execution vulnerability in Ubisoft's Uplay platform. A malicious website could cause the Uplay browser plugin to execute arbitrary commands on the victim's computer. The attack takes advantage of a feature that allows a visited website to launch a Ubisoft game but does not check that the command that the website issues corresponds to a legitimate game. The issue has been patched in an emergency update from Ubisoft.

Sunday, June 17, 2012

I've been playing John Robertson's YouTube choose your own adventure game The Dark Room and I've been having a great time. However, I need a little help navigating the room (you see, it's dark in there) and so I wrote a program to do a little cartography and create a map of the game.

An abbreviated map of The Dark Room (you can make a complete one with ytvidmap.py)

The map shows the videos that comprise The Dark Room (abbreviated here for space and to limit the spoilers) with the size of each node proportional to the number of views the video has and the colour signifying the number of outbound links from the video. The map was generated by ytvidmap.py from my ytmap repository and is created by processing the YouTube annotations. Sadly, the annotations aren't available from the YouTube GData API so I process the annotations with regular expressions. The map provides a huge boon in navigating The Dark Room but does not make escaping trivial (it's like John anticipated this kind of analysis).

After creating ytvidmap.py I realised that this approach could also be used to help me discover YouTube content by seeing who my favourite film makers and musicians linked to and in turn who they linked to. So I created ytusermap.py and started by plotting the people in Lindsey Stirling's YouTube video social network and ended up with a giant mess of relationships that quickly got out of control. After adjusting my scripts to build in some limits I ended up with this diagram of her closest neighbors.

The results aren't as pretty and you miss out on some of pyflag's analytical tools (such as filtering streams by user agents). On the other hand you do gain the ability to filter your output by tcpdump style filter strings and with a little bit of pcap preprocessing from tshark you can perform almost all the same comparisons.

plotpcap requires the python modules dpkt, pcap (from pypcap) and matplotlib. I used the versions available from the Ubuntu 10.04 repository but other versions are probably good too.

Here's some output generated from the same example data as the last post:

IPID versus Packet Number (note that without stream highlighting it gets a bit hard to read)

Monday, February 20, 2012

I've been a bit busy recently as I'm preparing to move across the world to the US to work at a small Internet company in the SF Bay Area. In the mean time though my current employer has been kind enough to let me contribute back some of the code we have written for the pyflag project (the link goes to my github page which has a fork of the project as the upstream site pyflag.net is down right now). Update: An alternate version (without the feature described below) is available on google code

The new features centre around identifying computers that are all lumped together behind a network address translation gateway (NAT). The idea is if you can identify the computers behind the NAT gateway you can attribute traffic to a specific system rather than only down to the network itself. The implementation is some visualisation tools in pyflag that allow you to plot certain packet headers fields against packet numbers or time.

Here's an example:

IPID field plotted against PCAP packet number

The plot takes the IP Identification field from the IP header and plots it sequentially against the PCAP packet number (pyflag also supports plotting against time). It looks like a big mess but you can see some lines and maybe some patterns in there. The IPID field is used to associate fragmented packets together for reassembly and it is generally left untouched by NAT gateways. Usefully different networking stacks have different strategies for picking IPID values.

In my anecdotal (non-scientifically determined) experience:

Windows machines start at 0 when the computer is booted and increment for each packet sent up until 2^16 and then start again. In some cases it seems to wrap at 2^15 which to me suggests a signed integer problem but I haven't conclusively figured out on what versions it happens on. Additionally, I've read (but not seen) that some versions of Windows send the field in host order rather than network byte order.

Linux machines pick a random number for the start of the connection and then increment the value for each subsequent packet of the connection. I've heard (but again not seen) that packets with the Don't Fragment bit set get their IPID set to 0 on Linux.

BSD machines (including Mac OS X) pick a random number for every packet.

So looking back at our example we can see a haze of small lines and also a couple of longer lines which suggests that we might be looking at one or more Linux boxes along with one or more Windows boxes. To test this theory I looked for any user-agent strings in web traffic and found the following:

User-Agent strings present in the sample PCAP file

Based on those user agent strings it looks like there is at least one Ubuntu system and one Windows system. Also of note is the presence of Java user agent strings as well as Transmission (the Ubuntu Bittorrent client).

If we revisit our previous IPID plot and tell pyflag to colour all the Chrome/Windows user agent string related streams blue we get the following:

IP ID versus PCAP number with Chrome on Windows streams highlighted

From this it becomes clear that there are two distinct lines of IPID growth which implies that behind this NAT gateway are two Windows systems, one which was active for longer and even sent enough packets that the IPID value wrapped. Knowing the shape of these lines means that you can associate other traffic (perhaps traffic with no distinguishing application layer features such as encrypted streams) to a specific computer and any metadata gleamed from other application protocols (like HTTP).

To make this even clearer there's another header field to consider, this time in the TCP header. There is an optional header in TCP called the timestamp value (defined by RFC1323) which is used to measure packet round trip times. By default Windows systems omit this value while most other systems include it (I've read that Windows can be configured to send timestamps and that in some cases will use timestamps if the client connecting to it uses timestamps). This means that if we exclude packets that have a TCP timestamp we should be left with all Windows traffic (assuming we exclude non-TCP traffic as well).

IPID versus PCAP number for Chrome user-agents, minus packets that have a TCP timestamp

After excluding packets with the TCP timestamp option set most of the background packets have been excluded. The remaining packets that don't fall on the lines are likely parser failures or packets generated by a Linux box that do not have a timestamp value for one reason or another (more investigation is required).

So we're convinced that there are two Windows system on the network and some yet to be determined number of Linux systems, if we change our filter to highlight Firefox on Linux and then plot IPID we get something that looks like this:

IPID versus PCAP number for Firefox sessions on Linux

The things to note here is that the IPID values change dramatically between connections, also that in general HTTP traffic seems to be in the minority of the non-Windows traffic and finally that we're no closer to determining how many Linux systems are present. However, if we consider the TCP timestamp field for a moment we learn that it's generally determined as:

The interesting part in this case is that wallclock - boottime should be unique among the hosts that use the TCP timestamp option and it should increment in a predictable fashion. So if we graph the TCP timestamp value of packets versus their PCAP number we get:

Again we can see that the Firefox traffic accounts for only a minority of packets and we also see that there're two distinct lines for the first half on the plot. These two lines suggests that there are two Linux systems and the line fragment at the end probably represents a reboot (and not wrapping because the timestamp values are 32 bit numbers and the values we see are around 2^18 at their highest) of one of the systems or the appearance of a new one.

So at this point I'm convinced that there are two Linux systems and two Windows system and that most of the Windows packets are HTTP traffic (using Chrome) and that while there is HTTP traffic it accounts for only a small amount of the Linux related packets. For the remainder of the Linux traffic I'd guess that at least one of the systems is transferring files using BitTorrent based on the Transmission user-agent that was present before. Maybe if we plot the traffic with the Transmission user-agent we'll be able to tell which computers were running BitTorrent:

TCP Timestamp versus PCAP Packet Number for the user-agent "Transmission"

At first this looks good, the line with the lower timestamp values is associated with Transmission and the higher one is not. Unfortunately this plot is ambiguous because the third line section is also associated with Transmission traffic and that line could easily belong to the top line section (after a reboot). If instead we ask pyflag to generate a table with only traffic that is not to or from ports 80 or 53 (to eliminate HTTP and DNS) we're left with a lot of connections between high ports transferring lots of encrypted (looking) data to our NAT gateway address which fits the hypothesis of BitTorrent traffic. When we plot the timestamp values again and highlight any packet from our Not-HTTP/Not-DNS table we get the following:

At this point I'm reasonably confident that both the observed Linux hosts are downloading files over BitTorrent once I combine this plot with some analysis of the ports / stream sizes seen while I'm equally convinced that the Windows systems are not using BitTorrent or at least that there isn't a significant level of BitTorrent traffic observed during this packet capture.

The above little demo is contrived but I have found that this kind of analysis can be really useful in characterising the use of a network. This example was constructed from 5 virtual machines, 2 running Windows XP, 2 running Ubuntu 10.04 and a NAT gateway running Ubuntu 10.04 and using iptables/netfilter to do the NATing. Also, just in case you were wondering the Windows machines were watching youtube (in particular nyan cat and techno viking) while the Ubuntu systems were each using BitTorrent to download ubuntu images (12.04 alpha for different architectures).

Future Work

Spring cleaning of the pyflag source (it's a little annoying to build and use right now)

More options on what to graph (maybe a system for generically plotting table information)

Ability to choose what to highlight based off the reverse side of a stream

Now that I've got the links handy I thought I'd also point at Michael Cohen's work. Michael is one of the authors of pyflag (project lead is probably a better description), and it's his ideas and that lead to the implementation of IP ID processing in pyflag.

Monday, January 2, 2012

I was going through the pages of an old defunct blog of mine and I saw this image and thought that I would repost it for old times sake. This is one of my proudest computer gaming moments of all time (from October 2009).