The crawler website has moved to a new host and a new domain name. crawler.trillinux.org will now redirect to crawler.doxu.org.

The crawler itself will be hosted on the same machine as the website and the database. This will allow more interactive use of the data that wasn’t possible before.

Previously the crawler and database were being hosted by Datz on one of his personal machines. The website was being generated on my personal machine every 5 minutes by querying the database hosted on Datz’s machine and a static copy was then uploaded to the web host. Because of the relatively slow upload speed of residential internet connections this process was slower than it needed to be.

2013 05 02

I updated the G2 Network Map. It now contains data from February 2009 to January 2012.

In total the crawler has collected 2.9 million unique hub IP addresses and 105 million unique leaf IP addresses.

The code I wrote for creating the map is published on github at IPv4 Heatmap source code. If you have a large collection of IP addresses that you want to visualize then let me know and I can help you out.

2012 01 13

So it’s been nearly a year since the last post. For the most part the crawler website has stayed the same with only the occasional bug fix as necessary.

However I have continued to experiment with new ideas and I am taking advantage of the more powerful computer to open up new possibilities. One such experiment is the G2 Network Map. Two years ago I created a small time lapse video showing the hubs on the network. I continued with that theme by creating a much higher resolution map of the network. Rather than doing a time lapse video it is instead and overlay of all hubs and leaves that have been seen in the last 2.5 years.

Double click to zoom into an area of the map. Click and drag to pan around. In the upper right you can choose which image to look at. Right now the choice is to see all hubs or all leaves for the last 2.5 years. The color shows the number of IPs seen in a particular block. The scale goes from black and blue being the least number of IPs to orange and red being the most IPs. Yellow, green and the other colors fall somewhere in the middle. IP blocks are labeled based on who owns those IPs. RIPE (Europe), ARIN (North America), APNIC (Asia), LACNIC (Latin and South America), and AfriNIC (Africa) are known as Regional Registrars. They have been allocated large blocks of IPs which they then distribute to companies in their geographic regions. If you zoom in further you can see blocks allocated to individual companies. I did all of the individual company blocks by hand and I mostly focused on areas where there was a lot of activity on the G2 network. So the main area I focused on was Europe in the lower left. But also some of LACNIC in the lower right and a few ARIN blocks in the upper left. If you see a white box around an area but the name doesn’t show up then zoom in a few more levels and the name should appear. In addition to showing the name the IP block is also shown.

To make the network map interactive I took advantage of a piece of software called OpenLayers. Normally it is used for showing geographic maps. That is why on the initial view half of the page is missing. Geographic maps are done in a 2:1 ratio and the IP map is 1:1.

To create the images I used the same ipv4-heatmap that was used to create the images for the video. But instead I rendered them at extremely high resolution. Because web browsers don’t make it easy to view images that are 16000×16000 I needed an alternative method. That’s when I decided to use OpenLayers, the same technology that is used for showing maps online. The way OpenLayers works is that it loads small images called tiles, which are pieces of the bigger image. I used a set of perl scripts and ImageMagick to chop up the high resolution images into tiles that can then be loaded by your browser as necessary when you pan around and zoom in.

I hope you like it. Let me know if you have any suggestions.

2011 06 07

I purchased a new computer recently to replace several of my old ones. Two computers in particular were responsible for retrieving graph data and plotting graph data respectively. The new system has a slightly different Munin configuration because I switched from Debian to Ubuntu. So the graph colors will be different. Also some of the long term graphs may be blank for up to a day until they get regenerated but no data was lost.

This new system will open up new possibilities for doing more processor and memory intensive analysis of the crawler data. Stay tuned for cool new features.

2010 07 10

Since the middle of August the crawler has been recording the time when hubs join and leave the network. This allows for certain time based trends to be realized. The hub is identified by its IP address. One way to visualize a set of IP addresses is with the Hilbert curve which was made popular by xkcd. The tool I used is called ipv4-heatmap. By generating a heatmap every 2 hours and then playing the images in order a time lapse video is produced. This video will show when users in different parts of the world are online depending on the time of day and the day of the week.

The world map shows two different data sets. Red circles represent where hubs say they are located. The size of the circle indicates how many hubs are reported to be at that location. Green circles represent how many hubs are in that country based on their IP address and mapping it to a country using Maxmind’s GeoIP. The location of the green circle is either drawn roughly in the center of the country it represents or at the country’s capital. The country location data was collected from Freebase.com.

The network size page shows graphically how the network size is changing. Each data point shows how many hubs or leaves joined or departed the network from the previous network measurement. So for example if one data point reads +200 hubs then 200 hubs joined the network between that time and the last network measurement.

This also is the debut of pChart usage on the crawler website. It makes creating beautiful and useful graphs easy. I look forward to using it more in the future.

2009 10 03

The network size is now featured on the front page once again. When the new crawler was implemented that statistic had to be dropped because it was too resource intensive to calculate with how the new crawler worked. But now that issue has been resolved.

Some background

The number of leaves on the network isn’t a good measure of the number of users on the network because most users connect to 2 or more hubs and are therefore counted twice. So the leaf count is approximately double the real network size. The unique leaves statistic that has been brought back counts the number of unique IP addresses that are on the network and so is much more accurate.

Website updates

A few minor styling changes were made to the website.

2009 05 25

Not much is going on with the crawler right now. I’ve been pretty busy lately and haven’t had any time to spend on improving the crawler. However there were a few subtle updates to many of the webpages. More detailed descriptions were added to many pages.

The uptime graphs page saw the most changes. Hubs that have been up for longer than 3 days will no longer be included on the graphs. This was done because the graphs were not very useful when the X-axis had to cover 100s of hours of uptime. New graphs were also added that show the uptimes by country for several of the top countries.

2009 02 12

I’m going to explain how crawlers work. There are three main tasks that a crawler has to take care of.

Find new hosts to crawl.

Request data from a host that is being crawled.

Display to the user the data gathered.

This design lends itself well to being distributed. Several host crawlers (those that perform task 2) can all be working in parallel and independently. All the host crawlers need is a coordinator (the one that performs task 1) to feed them lists of hosts that aren’t duplicated. The host crawlers then send their responses back to the coordinator which finds new hosts from the responses and then stores the responses. Lastly the aggregator or statistics generator (the one that performs task 3) periodically runs through all the data collected and creates useful ways for the user to view this information.

That’s how a crawler works in general terms. But the fun begins when a crawler has to actually be built. One of the most important decisions is how to store the data collected and how to store and distribute the list of hosts that need to be crawled next.

Relational Database Approach

This is the approach that g2paranha takes. It’s an easy and straightforward way to store data for anyone trained on these traditional databases. But the data that a crawler needs to store is for the most part not relational except for the links between hosts. Another problem with relational databases is that many of them lock all of the data whenever data is being written or read. This creates a huge bottleneck in a distributed environment where lots of both of these operations are being performed. So while it is easy to implement it may not be an optimal solution. On the positive side the extremely powerful SQL language is available for extracting statistics from the data.

Non-relational Database Approach

This is the direction I have been heading in for the crawler. It seems to be a good fit for the type of data that the crawler needs to store. But I have very little experience in this are. So far my only forays into this field have been with CouchDB which is still in heavy development. CouchDB looks promising but I haven’t had much luck with getting it to work.

So if anyone has experience in non-relational databases or in creating distributed crawlers I’d like to hear from you.

2008 11 01

My focus lately has been on hub uptimes. There is a new page showing hub uptime distribution graphs. It gives a visual representation of some of the categories on the uptimes page. The overall hub uptime distribution graph also features two vertical lines. The red line shows where the average hub uptime is and the green line shows where the median hub uptime is. Eventually all of the graphs will have this extra information.

The other major addition is to the uptimes page. The second table of information is new and expands on the information in the first table. The new table shows for each grouping/category:

average

median

minimum uptime

maximum uptime

the total number of hubs that fit this category

the number of hubs below the average for this category

the number of hubs above the average for this category

the ratio of hubs under the average and over the average

On the vendors page I added back the showing of the GDNA data. But for now no hubs will be showing up as GDNA. GnucDNA does not send any vendor code at all so the current crawler just shows them as UNKN. The old crawler looked at the hub’s User-Agent to figure out if it was GnucDNA and then set the vendor code appropriately. So while new data won’t be logged the same way the trend of GDNA can at least be seen again on the yearly vendor graph.

Lastly, there is an experimental feature that shows where each hub is on a geographic map if they provided that information in their profile. Of course most users do not reveal this information and some lie about their location but it can still be a lot of fun. You get to see that the G2 network has users all over the world and they’re all interconnected. This mapping feature is experimental because it requires a fast computer in order to work smoothly. So give it a try but understand that it may not work well for everyone. In my experience Chrome did really well while Firefox and IE did poorly so if you already have Chrome installed you might give this feature a try in that browser even if you don’t use it for anything else.

Green lines are used to show hubs that are connected together. Clicking on a marker will reveal additional information about that hub:

username if provided

name and version of the software they are using

number of hubs and leaves they are connected to

the actual country they are in based on their IP address and MaxMind’s GeoIP