This is another quick post. I have been working on this small framework for a while now and I decided to publish the code before I completely finished.

The hadoop-dns-mining framework enables large scale DNS lookups using Hadoop. For example, if you had access to zone files from COM, NET, ORG, etc (all free and publicly available), you could take each domain in these files and use this framework to resolve the domains for various record types (A, AAAA, MX, TXT, NS, etc). After resolving the domains, you can run them through an enrichment process to add city, county, lat, long, ASN, and AS Name (via Maxmind’s DBs). And, you could scale out this collection and processing effort using Hadoop, say, running over EC2.

There are some interesting applications of this type of system, like using the existing zone files names to brute force “generating” zone files for TLDs that do not publish them (most ccTLDs do not). For example, like this company has done: http://viewdns.info/data/

For more details check out my github repo. The README covers DNS collection and geo enrichment. I have code checked in that will store this data in Accumulo using a few different storage/access patterns, but more explanation will come later as I have time.

Abstract While playing around with the Nmap Scripting Engine (NSE) we discovered an amazing number of open embedded devices on the Internet. Many of them are based on Linux and allow login to standard BusyBox with empty or default credentials. We used these devices to build a distributed port scanner to scan all IPv4 addresses. These scans include service probes for the most common ports, ICMP ping, reverse DNS and SYN scans. We analyzed some of the data to get an estimation of the IP address usage.

It is a write up about performing an Internet scale port scan using thousands of compromised busybox embedded devices/linux servers.

While this is wildly unethical, and almost certainly illegal, the results of this study are pretty interesting and it is more interesting that the author decided to post all his code and data (~9TB uncompressed, 1.5 TB Compressed) online for free downloads.

The author also posted some interactive web apps that allow exploration of this data set:

It is definitely interesting to see how more and more network/security data is being collected and made available freely on the Internet. I am undecided whether this helps security or hurts security longterm. It definitely makes the situation worse i the short term.