If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

Developing a Port Scanner in Python

Hey Hey,

I asked if anyone would be interested in a tutorial on how I built PyScan and after a few answers to that question I've decided to come through and write this tutorial. It actually works out quite nicely, I have to do a write up on the purpose and design of the scanner, so that my replacements can take over the project (a slightly modified version from that which I released here) and maintain it if they run into problems.

Now before we start the tutorial, I'm going to give you a little bit of background on what drove me to develop this, and why I designed it the way I did. The college at which I work has had many problems with viruses (I prefer virii, but for the benefit of most people reading this tutorial, I'll go with the spelling you most commonly use) running rampant on residence networks. These viruses for the most part have been Nachi, Welchia and MyDoom, all viruses which open remote ports. Unfortunately, residence students bring in their own computers, and we really have no control. Our only ability is to shut down the switch port they are connected to, which meant if we caught them remotely we could terminate their connection and they would have to call us to get it turned back on and we could then clean their machine of the infection and patch it (where possible).

This project had been underway for some time before I started here. Several times a day, someone would walk over to a machine and plug it into one of the 13 residence VLANs, and scan manually. They would then record all the PCs with open ports and move over to another VLAN. To me this was a waste of time, so together with Network Services, they setup a trunk port and I devised an autmoated scanning system, we came up with an effective way to scan. The result was this scanner (as the back-end anways).

After I finish this document, I'll begin writing the front-end for the software. Because I wanted this fully automated, I wanted to see the results of the scans from anywhere, my solution to this problem was to write a flag that would output a full html page. We maintain a textfile of common virus ports, so we wanted to use this for the scans. My solution was to make reading from a file the default and require a flag to scan with ports from the command line. Another problem was the amount of time it took to scan, while scanning one host at a time. For this problem I investigated threading, and ended up with a program that scans every IP address simultaneously. Now, I wanted simplicity because those maintaining this will not be programmers, and if they are they won't be coding in python. Python has two threading modules, the outdated thread.py and the newer threading.py. So for simplicity, and because I'm new to threading, I used the older module. It allows me create threads, however the program won't wait for the threads, it'll finish and just terminate them (more on this later), so I added a flag to allow the program to sleep long enough to wait for the threads to finish.

The last main question is, "Why python?". So many consider it to be an outdated scripting language, instead of a powerful programming language. They are very wrong. Python is as powerful and versitile as C and C++. It is portable because it is an interpreted language. This was a primary requirement as we have primarily Windows boxes, but we have a few linux machines, one of which will hopefully be on another trunk port soon.

I have written several tutorials on Python, so I won't explain the pure basics of the language, if it's covered in another of my tutorials I'm going to hope you'll read them. I will explain everything, just not the intimate details of things I have outlined in other posts. And now, on with the tutorial.

I am going to ignore comments and work with just the code. For your reference ''' (three single quotes) marks the beginning and the end of a comment in Python. If you have any questions about the comments (I am going to include the source at the end of this tutorial) please fee free to post them here, or send me a PM. I am also not going to be following the code in the order it is displayed, but rather in the order which it is executed.

This is the first chunk of code that you encounter, it is also the first chunk of code that is executed. The import command is similar to the include command from c/c++. There are two ways of using the import command. You can import the library as I have (import socket) or you could import commands from the library (from socket import *). The asterisk in the second command would import all of the commands when the program first runs, this can cause problems if too modules have the same command. It is usually preferable to import the library, or to import specific commands (from thread import start_new_thread) (more details on the commands later). This would only import the start_new_thread command, allowing you to access it directly start_new_thread(function to start). However as I said the preferred method is to import the library (import thread), then when calling commands you reference the library that the command comes from (thread.start_new_thread(function to start)).

I have imported several libraries, they are:

socket - Provides socket functionality to the program
sys - Provides a series of system related command (in this case sys.exit() will be used)
getopt - Provides for retrieval of arguments from the command line, and provides a series of functions to handle those arguments.
string - provides string handling/manipulation functions
time - provides time and date related functions (in this case the sleep command)
array - provides a series of commands and functions for handling/manipulating arrays.

The last section of this code is a try and except line. Since this is the first time that white space is used, I will mention how incredible important it is to use whitespace in python. Brackets don't exist in this language like they do in others to separate statements, the only way to separate them is with whitespace. A single tab is used to include code. If you have an If statement followed by three lines of code (all tabbed in), when the statement evaluates true, the three tabbed lines will be executed. Now on to the try and except. The code will try to import the thread library, however as it is being phased out (replaced with the newer threading library), newer installs don't include it, however they all include dummy_thread which provides the same functionality. If the program can't import thread.py it will receive an import error. The except line tells to run the command import dummy_thread and consider it to be the library thread (as thread) if it can't import the real thread library. Try and except is a great method of handling errors as they arise, there is a whole series of standard exceptions that are generated by python. A list of these exceptions can be found documentation at http://www.python.org/doc/current/ap...xceptions.html.

if __name__ == "__main__": is very similar to the int main() { line in c. Everything found after that (in C between the {} and in python after the : and preceded by white space) is executed first. Again I have used a try and except statement. I try to call the main function (the body of my code), and I pass it all the arguments that follow the program name. argv is a representation of the command line arguments. argv[0] would be the name of the program I am running, which is why I don't pass it on. The : after the 1 indicate that I want to pass all the remaining arguments 1, 2, 3.... 100... however many exist. As you can see the argv command is located in the sys library. I've allowed for two exceptions. The first is KeyboardInterrupt (this would be ctrl+c on most systems, ctrl+break whatever your interrupt is set to). Because the code will be "trying" to execute the main function for the rest of it's running, if I enter a KeyboardInterrupt at any time during the execution of this program, that message will be displayed and yes, to print a line to the screen in python it is simply print followed by the text in double quotes. For more on the basics of the print command you can view my first python tutorial, Python Introduction. For information on formatting options which are similar to the c/c++ standard that are available with the print command you can view my third python tutorial, Python Introduction #3. I have not used any formatting options in this program (again, I was going for simplicity), but they are very interesting to look into and use. The last exception is raised on any other exception (no specific exception is listed). It will simply call the usage() function and then cleanly exit the program. Since the usage() function is called quite frequently, we will look at it before we delve into the main() function.

**NOTE** I have removed certain characters from within the printed strings in order to shorten the lines to make this tutorial more readable.

Again you can see the evidence of whitespace, the usage() function is defined using the def command and then everything included in the usage() function is tabbed over underneath it and after a colon. Again, I ask that you remember that I have removed commands, simply to shorten the total length of the tutorial. This function is fairly basic, it calls the banner() function (we'll discuss it next) and then prints the functionality of the program in a fairly common fashion. The banner() function that it calls is also fairly basic, so we'll take a look at it right now.

It is simply a single line that prints the version and my name. The reason I have done this as a separate function is that, I also call it after program exists and during runtime. It allows me a few less keystrokes than if I was typing it out every time. Again you can see that we defined the function quite easily using the def command. If we were passing arguments they would be located between the (), as you will see later on.

Now for the large part of the code, which I am going to split into smaller chunks. The main() function is the heart and soul of the program. It ties the rest of the program together.

Code:

def main(argv) :

Again we define the function, however this time we tell it that the argument it receives (remember up top we passed it sys.argv[1:]) will be assigned to the variable argv. Had I used from sys import *, then I would not be able to use the variable argv because it would be a reserved keyword, and my program would crash.

This section of code is the sign of a sloppy programmer :). I didn't want to be passing large numbers of variables between functions (I know it's more effective and memory efficient, but at the time I just wanted to get the program written... I'll worry about efficiency on the next rewrite), so as a result I simply defined global variables. Like any other language, variables are only visible in their function, unless you specify that you wish for them to be global using the global keyword. After that I simply define a few default values. This is incase the user doesn't specify them on the command line, I don't want the values to be blank (this will cause a program crash). You can also see that I've called the float() function, which is available in the default command set of python. What this function does is take a string/int and, if possible, converts it to a floating point value. The reason for this is that a few of our command require floating point values further down the road.

This first if statement, is a little redundant, however it solved an issue I was having. It simply checks to see if the command name was properly received and exists if it wasn't, it calls the usage() function and then exits. The next command is another try and except statement. Certain libraries will include their own Exceptions that can be used, getopt is one of those such libraries. The getopt function is being called from the getopt library (I find it funny that most libraries contain a command of the same name). What this command does is it reads all the command line options (remember when we called main() and passed it argv[1:], and then it took that in as a single variable called argv?) from the variable argv and assigns them to the variable opts. Python is big on what is called a tuple (two arguments in a single argument is the easiest way to describe a tuple). getopt.getopt will go through argv (first argument) and look for any of the follow options h,t,w,p,i,n,u,s, which it is told to look for in the function call (second argument). As you can see certain options are followed by the colon (:). What this does is tells getopt.getopt that there will be a value following the option. For example, from our usage() function -h calls the usage() statement and doesn't need a value passed through it, however -t is the timeout we wish to set and it takes a value in seconds, therefore t is followed by a : and h isn't. I hope I'm making this clear, just let me know if you need clarification. Now I mentioned tuples because getopt.getopt will return a tuple, the first portion will be the option (h, t, w, etc..) and the second portion will be the arguments attached to the option (if it was followed by a :) if there's no additional arguments then the second portion will be blank. The two portions of the tuple are passed to the variable opts in the format (option, argument).The args variable, which was not used in this program after this point, is a list of the arguments that are left over after we finish with the option list. I could have used it as a more effective way to specify the IP address, something I may investigate at the next rewrite.

Code:

for opt,arg in opts :

This next statement of code is a long one, however I will break it up as best I can. Here we investigate the for statement. The variable opts is holding an array containing the information it received from the getopt.getopt command. What we want to do here is cycle through that array (which contains variables in the tuple format), and split up the tuple into two separate variables. We also want to execute the trust-testing once for every option. The perfect solution is the for statement. for opt,arg in opts: a very simple line of code that basically says execute this once for every line in the array opts and assign the values of the tuple to the variables opt and arg, so that opt will contain the option (h, t, w, etc.) and arg will contain either the value (if the option was followed by a :) or a blank string if it had no following argument. For a more simplistic example of the for statement visit my second python tutorial, Python Introduction #2. Now we are cycling through and trust-testing to see which options are set and assigning variables based on that. I will explain the if, elif, else statement here and then examine each test individually. Again with the IF statement whitespace is essential, the if statement is fairly simple. if &lt;variable&gt; == &lt;string&gt; : Remember that all loops and condition tests end with a :, and that you do a value comparison with == not = (this is used for value assignment). I have explained if statements more indepth in my first python tutorial, Python Introduction. Now let's investigate what happens when truths are discovered.

Code:

if opt == '-h' :
usage()
sys.exit()

If the user has specified -h at the command line, we get a result we have seen before the usage() function is called and then the program exists. This is a standard help response.

I mentioned previously that I ran into problems with the program exiting before the threads had exited. To counter-act this (and because I'm too lazy to used the advanced threading model), I decided to just make the program sleep before existing to give the threads time to close. If this trust exists the program tries to convert the value passed with the option to a floating point. If it cannot do this, the program will display its banner, tell you the format the time to sleep must be in and exit cleanly. If it hasn't exited and the value was a valid number then the timetosleep variable is set to the user specified value.

Along with sockets comes the option of a socket timeout. This was required because if you scan someone with a firewall and there's no default timeout value, the scan will take anywhere from 10-20X longer than normal. This lets the user specify the timeout value and basically acts the exact same as the previous truth. I have introduced int() in this statement, it should be float(), however I was lazy and sloppy. It works the same was as float() however converts the value to an integer rather than a floating point value.

Code:

elif opt == '-w' :
htmlout = "1"

You will see the results of this later on. This is for html output. It simple enables it if the user has specified this option.

**NOTE** I have modified some of the printed strings in an attempt to make this more readable. Please see the source for the complete text.

There is a fair bit of code behind this truth. This flag will tell the program to read ports from the command line is a CSV format. First I tell it that it is reading from the console rather than from the portlist by setting consoleports equal to 1. Next I have a try and except statement which is making use of the string library. string.split is fairly straight forward. It will take a variable, and a character and separate the string into a list (very similar to an array) which contains each section of the string as a separate entry. 10,20,30 if split in the same format as the above command and directed to the variable examplelist would result in examplelist[0] = 10, examplelist[1] = 20 and examplelist[2] = 30. If it fails to split the string, then it will exit in a fashion you are already familiar with. If the program succeeds it will then open a file in write mode. The file pointer will be editportlist and then the command file('consoleports.txt', 'w') tells it to open the consoleports.txt file in write mode. For more information on File I/O check out my third tutorial, Python Tutorial #3. The next line calls the len() function which is fairly universal in programming languages. It returns the length of something, since i gave it the list itself, and not one of the string associated with the list, it will return the number of strings the list is storing. I can then use a for statement (we know all about those don't we) and run it once for the range of 0 to the last line of the list. I will try to write the file using the file.write command (yes the file functions are all built into python). The write command takes 2 arguments, one is a pointer to the file, and the other is the string to write. If I get an error, I tell the user there was an error writing the program, however I continue running with what information I have. You can also see I'm writing \n to each line. This is the escape sequence for a LineFeed. For more information on escape sequences you can view my second tutorial, Python Introduction #2. After I have written the information to the file, I close the pointer to the file.

I have grouped these two statements together because they are the heart and sole of the scanner. They decide if the scanner will scan a single IP Address (i) or an entire class c (n). The value passed to them is the IP Address. I could have used args here, so that it would be a flag without an argument, however like I said before simplicity.

Code:

elif opt == '-u' :
tcpscan = "0"

The last truth test, and the one that still doesn't fully function. This will switch the scan over to a UDP scan, however the results are skewed when you run this.

I'm going to look at the next part of code in chunks, however at times I will reference above explanations. The Class C and Single IP code has many similarities and uses some of the same functions, I don't want to waste your time, or mine, explaining those functions twice.

Code:

if classcscan == "0" :
txtportlist()

The first thing that happens is I call a function I created called txtportlist(). In this section of code I have chosen a single IP scan. Let's take a look at the txtportlist() function before moving on.

Here I simply look to see if the user has specified ports, or if I'm using the default list. I determine this with a simple if statement. Both of the if statements simply define a file pointer, the only difference is the file that is used. As you can see I don't have a second argument in these file calls, that is because by default python will read the file, however I could have added , 'r' into the brackets. I then read the lines using the filepointer and appending .readlines(). I add these lines to a list and then close the pointer. I then use the length command to determine the number of lines in the list. Both of the resulting variables are global and can be seen by the entire program.

Code:

if tcpscan == "1" :
if htmlout == "0" :
tcpipscan()

If the scan is TCP based, and I'm not generating html output, I simply call the tcpipscan() function. Let's investigate it next.

This is a very simplistic function. It uses the variables generated in the txtportlist() function to create a for statement that will execute once for each port we wish to connect to. We store the port number in the variable port by converting it to an integer with the int() function and reading it from one of the lines of the list (we work up (down?) the list by using the value of the for statement counter). We then create a socket to use. I have explained sockets briefly before. We are using IP sockets (AF_INET) instead of raw sockets (which are also supported) and we are using TCP (SOCK_STREAM) instead of UDP (SOCK_DGRAM). For more information on creating a socket check out my first python tutorial, Python Introduction, and my Python Honey Pot Tutorial. Next We modify the timeout for the socket (remember it was set using -t) and then we see if we can connect. We use a truth statement with the not condition and the function scansocket.connect_ex(). This function will attempt to connect and return a value to let you know if you can or cannot connect. It takes a tuple (which is why we have two sets of brackets) which contains the ip address and the port to connect to. If the port is open, then we print the results and close the socket. The scan then loops to the next port.

If we are looking for html output we call the htmltcpipscan() function. I will not be explaining this function again as it is identical to the tcpipscan() function except that it adds html tags around the strings.

I have grouped the next two options (UDP Scan and UDP HTML Output Scan) together because they also do not need explanation. The exact same code is executed as above, however we define the socket type as being SOCK_DGRAM (UDP) instead of SOCK_STREAM (TCP).

Code:

elif classcscan == "1" :
txtportlist()

Now we are in the Class C scans, again the txtportlist() function is the exact same as above, and you can scroll up to view it.

I have grouped plain text and html output together because I will only be explaining the plain text output. Again the two functions are exactly the same. What we do in this case is use our string.split function to cut the IP Address into its separate octets. Then we start a for statement from 1 to 254 (the host we wish to scan... since it's designed for class C we are ignoring the network and broadcast addresses). This program assumes there are no subnets, because it was designed for our network and i know how the addressing is done. Inside the for statement we call thread.start_new_thread which let's us start a new thread with one of our own functions. For those of you that don't know what threading it, it's basically letting one part of the program run in the background while you continue on with the rest of it. Unix people know it as fork(), or as my unix prof used to like to say the parent forks the child (lame groan at the sexual innuendo). thread.start_new_thread requires too arguments... the function you are calling and a tuple (two arguments) to pass to that function. That is why here I switched over to passing variables to a function, and this also came as a rewrite to the original code so I had already decided to make the change over. Let's take a look at the threadedtcpscan now.

The first thing I do is create an empty array to store all the open ports I find, since I'm scanning all the IP Addresses at the same time, I don't want to just list the ports randomly, I want them grouped by host. I then create a counter and define my ip address, using the first three octets from the list, and the for statement's counter (x) as the last octet. This ensures that I cover every host and I don't repeat any. The next part is all the same, I use a for statement, and define the ports just as I did in the tcpipscan() function. I also create the socket in the same way. The changes come when I test if I can connect or not. If i can connect, I add the port to my newly created array using the .append function. I also increase my counter by one to keep track of how many ports I've found; I close the socket and repeat this process until I'm out of ports. If I didn't find any ports the program exits, however if I found ports it'll print the IP address followed by a list of ports it found. Using a for statement and the for statements counter (x) to specify where it is in the array. I also print more LF's so that it's formatted a little nicer.

**NOTE** I have modified the whitespace is this one example to improve readability of the post, please see the source code for the exact whitespacing.

Since the UDP Scan doesn't work properly, I never bothered to thread it. As you can see it executes like a normal UDP scan, however it uses the for statement to cover all the hosts. This is a slow method, but since it doesn't work and for that reason is unused, I see no reason why it would have to run faster.

Well that's about it people, that's all it took to develop PyScan, not really a lot when you think about it. I have plans of fixing UDP, and making it possible to scan UDP and TCP ports in a single session and I will always keep the community up to date on the progress of the project. This helped me kill a few hours at work anyways, the time flew by while I was writing it. I have spell checked and proofread, but as always things are missed. Send me PMs with problems (I'm sure they'll be at least one). Also let me know about any questions, comments, concerns or flames you have with this. I'm always looking for ways to improve. Because of the length of this program, I left the step by step process and concentrated on just the code. If you are wondering how to code in Python I reference you back to any of my previous tutorials:

To give you guys a bit of background on my python experience, I've been coding for almost 11 months, I wrote my first tutorial around the same week I started learning Python. It's a great language available from http://www.python.org and I would highly recommend everyone investigate it, even if it's just a lil bit, because it's a lot of fun. Look over my previous tutorials for steps on creating and running code, and as always have fun.

Peace,
HT

IT Blog: .:Computer Defense:.PnCHd (Pronounced Pinched): Acronym - Point 'n Click Hacked. As in: "That website was pinched" or "The skiddie pinched my computer because I forgot to patch".

Anyways... Python is an interpretted language, not a compiled language... some people call it a scripting language. It works like perl does, you don't compile anything, the same code will work on any platform, you simply have to have Python installed. The installation process takes maybe 15 minutes (including downloading) and it's completely free. Most Linux distrobutions come with python already installed, so it's just a matter of python &lt;file name&gt;, however you can add a line to the start of your python script #!/usr/bin/python (or where python may be located on your linux install). This will tell the command interpretter (your shell) to execute the rest of the code using the python executable. This means you could just type the name of the command or ./name (if it's not in your path).

Hopefully this clears things up for you,

Peace,
HT

IT Blog: .:Computer Defense:.PnCHd (Pronounced Pinched): Acronym - Point 'n Click Hacked. As in: "That website was pinched" or "The skiddie pinched my computer because I forgot to patch".

You can compile Python apps to run on machines without the Python interpreter, its always worked for me anyway, but then again I have only done simple programs. Anyway the program I used for doing it was py2exe http://starship.python.net/crew/theller/py2exe/ . The problem is that when you do this it makes the exe rather large, if I remember correctly with produces and exe and a dll on Windows machines that you can then distibute. Worth a try as it is free but you have been warned, it won't take much code before you are running out os space on a floppy, thats for sure.

Thanks for clearing that up HTRegz. I already installed it and tried a couple simple programs. I'll also try that link UKnetSec but it's probably not necessary because I'll mostly be writing stuff for myself. Thanks though.

udp fix

Veyr cool tutorial. Interesting thing to note is that this program found ports that nmap didn't on a friends comp. hmmm... firewalls are getting smarter eh?
Well, the reason why UDP doesn't work is because of how different it is from TCP. UDP relies on Layers 4 and up in the OSI layer model for ensuring the packets are delivered. TCP takes care of the connection itself. Hence, when you call scansock.connect() it doesn't really do anything. I'm currently looking for ways to determine if an udp port is alive. I'll let you know what i find out, kay?

Basically whats supposed to happen with a closed udp port is that it will send an ICMP error packet. If the recv gets anything back, then its obviously closed. else it is open. If you end up getting a message that over 1000 udp ports are open then their is a firewall.
So maybe you may want to instead put all of the ports into a list, then run some truth tests on the contents to check for this possibility. (NOTE i multiplied timeout by 10 to ensure that it doesn't timeout before a message is sent back. So this code will be really slow. Further testing with open and closed udp ports should let you figure out the timeout formula.)

Not very often that you see a thread bumped and useful information added it to it...It's a nice change. That was something I was unaware of (the ICMP sent for closed UDP), so thanks for the info there... it would have helped on a test a couple of days ago...

Anyways I've been discussing with some friends reopening this and turning it into a viable scanning solution... plus it'll give me something to play with... so I'll add your fix, comment my code and start adding additional functionality..

Peace,
HT

IT Blog: .:Computer Defense:.PnCHd (Pronounced Pinched): Acronym - Point 'n Click Hacked. As in: "That website was pinched" or "The skiddie pinched my computer because I forgot to patch".