Saturday, September 14, 2013

Masscan: the entire Internet in 3 minutes

Masscan is the fastest port scanner, more than 10 times faster than any other port scanner. As the screenshot shows, it can transmit 25 million packets/second, which is fast enough to scan the entire Internet in just under 3 minutes. The system doing this is just a typical quad-core desktop processor. The only unusual part of the system is the dual-port 10-gbps Ethernet card (most computers have only 1-gbps Ethernet).

Masscan is a typical "async/syn-cookie" scanner like 'scanrand', 'unicornscan', and 'ZMap'. The distinctive benefits of masscan are:

speed

better randomization

flexibility

compatibility

These are described in more detail below.

Speed

I gave a talk earlier this year describing the "C10M Problem", and how to design software for Internet scale. Masscan demonstrates some of these principles:

A 10gbps Ethernet hardware supports a maximum of 15 million packets-per-second. There's nothing particularly difficult reaching this theoretical limit with special-purpose software. Using a dual-port card, Masscan should therefore be able to reach 30 million packets-per-second. That it can only do 25 million is because my software hasn't been optimized.

The problem with the Linux kernel is that it's general-purpose, and not optimized for any particular task. Thus, the limit going through the kernel is around 2 million packets-per-second.

Randomization

A major feature of async port scanners is how they randomize the order of probes. Those other scanners (scanrand, unicornscan, ZMap) use poor randomization.

My choice is to "encrypt" a monotonically increasing index. Encryption has the two properties I need. The first is that it completely randomizes the output. The second is that it maintains a one-to-one correspondence between the input and output.

The problem with crypto, and the reason nobody else uses it, is that since it's based on binary operations like 'xor', it forces the target range into something that is an even power of 2.

My solution to this problem is to replace the binary operations with non-binary equivalents. I use an algorithm that looks like DES (one of the first crypto standards), but I replace 'xor' with a mathematically equivalent 'modulus' operation.

Thus, instead of scanning a range of 128 or 256 addresses (ranges that are even powers of 2), I can scan a range of 113 addresses in random order.

Flexibility

Because I've got this "encrypted monotonically increasing index", I have enormous flexibility in the scanner:

have thousands of target ranges of any size

scan many (or all) ports instead of a single port

scan UDP, TCP, and ICMP simultaneously

pause a scan, saving only the current index, and restart

seed the scan, so that it generates a different sequence of random numbers

Compatibility

Only Linux among these systems allows me to bypass the kernel. These other systems are dramatically slower, only around 300,000 packets/second.

Masscan is broadly compatible with nmap, the most popular port scanner. It supports many of the same command-line options, and produces similar (XML) output. Nmap isn't asynchronous, so this changes the results: Nmap produces a complete record for every computer it scans, whereas Masscan produces smaller records for each port it finds.

Conclusion

My tool is still pretty rough, but I'm talking about it because I'm tired of people telling me to look at "ZMap", which has gotten a lot of press in the last month for being fast. I'm sure their tool is great, and more polished than mine, but ZMap is 10 times slower than Masscan. Indeed, it's Linux not ZMap that deserves the credit for speed: in the last two years, Linux has dramatically increased it's packet transmission rate to about 2-million packets/second. Somebody ought to re-benchmark "scanrand" and "unicornscan"; I'd bet they'd be close or equal to ZMap in speed.

Replicating results

The above screenshot of 25 million packets-per-second is unusual, so I thought I'd document it in a fashion that can easily be replicated. Pretty much any system with an Intel 10gbps Ethernet card and PF_RING should get similar numbers, but here is my exact system:

The network adapters must be connected to something, like to a hub, or point-to-point to another computer. I connected them to a second nearly identical computer.

The PF_RING driver was installed with the option creating 2 RSS queues/channels per portinsmod ./ixgbe.ko RSS=2,2,2,2

The default configuration file /etc/masscan/masscan.conf has the following contents:

excludefile = /etc/masscan/exclude.txt

adapter[0] = dna0@0

router-mac[0] = 88-77-66-55-44-33

adapter-ip[0] = 192.168.1.2

adapter-mac[0] = 00-11-22-33-44-55

adapter[1] = dna0@1

router-mac[1] = 88-77-66-55-44-33

adapter-ip[1] = 192.168.1.2

adapter-mac[1] = 00-11-22-33-44-55

adapter[2] = dna1@0

router-mac[2] = 88-77-66-55-44-33

adapter-ip[2] = 192.168.1.2

adapter-mac[2] = 00-11-22-33-44-55

adapter[3] = dna1@1

router-mac[3] = 88-77-66-55-44-33

adapter-ip[3] = 192.168.1.2

adapter-mac[3] = 00-11-22-33-44-55

What this file does first of all is to specify 3880 "exclude" ranges/addresses. The software has to lookup in this large list for every packet in order to avoid the excluded ranges, which is one of the slowest parts of the system. The more ranges, the slower the lookup. If I don't have this large exclude list, I get 29.76 million packets-per-second -- which is the precise theoretical maximum of dual-port 10-gbps Ethernet. (In other words, it'd be intellectually dishonest to run a benchmark without a large number of excluded ranges).

The second part of this file is configuring the network adapter. Normally under Linux, this network adapter would appear as eth3 and eth4. But, since we've replaced the standard driver with the PF_RING DNA driver, they now appear as dna0 and dna1. Also, since we've created two queues per network adapter, this creates the four virtual adapters dna0@0, dna0@1, dna1@0, dna1@2.

Now that we've configured the network adapters, I used the following options on the command-line:

0.0.0.0/0This tells the scanner to scan the entire Internet.

-p80This tells the scanner to look for port 80 (http).

--max-rate 30000000This tells the scanner to try to transmit at 30 million packets/second.

--pfringTells the scanner to use the PF_RING drivers.

Note that things get finicky at 10-gbps. I have a second nearly identical system, with an Ivy Bridge processor instead of Sandy Bridge, but something things are getting throttled to 8 million packets/second (1 thread or 4 threads -- the total limit is the same). My point is that even though I've tried to document as close as possible the exact configuration to replicate this, I can't even replicate correctly on my two systems. It may take some debugging.

Running this at full speeds tends to melt network equipment (switches, routers, firewalls, IPS), so I advise some caution. Or, if you want something that'll stress test network equipment, this is a great tool for that purpose!

If you need help replicating this, drop me a tweet at @ErrataRob.

Real world

The problem is that I don't have a 10-gbps network to test on. My ISP let's me go out to 100,000 packets/second as long as I deal with the abuse complaints, but that's around 44-mbps. If anybody has a fast Internet connection they'd like to run this on, please contact me (such as on Twitter @ErrataRob). Scanning the Internet for port 80 doesn't generate a lot of abuse complaints, though at 10 million packets-per-second, you'll be rapidly hitting any Class A.

You can also take a look at recent Quarkslab's libleeloo that deals with IPv4 adress ranges aggregation and randomization : https://github.com/quarkslab/libleeloo/. You might also be interested in a blog about the math behind : http://www.quarkslab.com/en-blog+read+45

It's interesting to read about the address randomization approach because I just ran into nearly the same problem in the context of benchmarking a key-value store. I ended up using a similar approach to randomize the ordering of a large set of database operations. I originally wanted to use a block cipher but gave up on it and switched to linear feedback shift registers because of the power-of-two problem but now I might reconsider that. It sounds like if you can use a real block cipher then you must be getting excellent randomness out of it.