Due: December 8, 2008

Background

Our textbook focuses on comparatively low-level aspects of mobile networking. A higher-level issue is "location aware" computing, in which a mobile user gets different results depending on where he or she is currently located. For example, a google search for "coffee" ought to return different results depending on where you are.

Location awareness is a high-level issue, but it relies on some low-level mechanism for estimating the computer's current location. There are a variety of approaches, some quite sophisticated, such as GPS. One comparatively low-tech approach is to use the set of WiFi access point beacons that are received, and the signal strength with which each of them is received, as a way of estimating position.

Objective

You and a partner will develop a program that can analyze WiFi beacon signal strengths and determine which of ten locations they are likely to come from. Five of the locations are on the third floor of Olin Hall and five are from the Campus Center (both floors).

You will develop your program based on four samples of beacon signal strengths that I captured in each of the ten locations at four somewhat different times. Each of these forty samples contains signals strengths for however many access points were received. I will tell you the location (and time) for each sample.

I will then later give you ten new samples, from the same ten locations but recorded on a different day. The second batch of ten will be numbered rather than labeled with locations. Your goal will be to use your program to estimate each of these unknown locations.

Ideally, your program would pin down the specific locations. If the evidence is equivocal, your program may need to output that the data is from one of a small list of locations (such as either of 2). Hopefully you will be able to do better than merely getting the building correct.

In an ideal world, you wouldn't use human analysis of the known-location samples to design your classification program. Instead, you would use a "machine learning" program that would analyze the known-location samples and based on them infer what the relevant distinguishing characteristics are. That way, we could easily go beyond ten locations, by just "training" your program with some more known-location samples. Another way to think of this is that your program, when it is presented with a batch of unknown-location data, will be determining which known location's data is most similar to it. (The "learning" might consist of as little as squirreling the known locations' data away for later comparison with the unknown-location data.) Because of this possibility of using a learning approach, we will refer to the initial known-location data as the "training data" and the later unknown-location data as the "testing data".

If you are less ambitious (but more in the mood for tedious work), you could analyze the training data by hand and use your own human judgment to write a classification program. The classification program would automate the analysis of testing data. That is, you could feed it one of the ten unknown collections of signal strengths and it would print out its best guess (or guesses) as to the location.

You could even earn most of the points for this assignment without writing any program at all. By analyzing the training data and using your human judgment, write down a precise specification of how the classification program (if it were written) would classify a batch of testing data. Then, when I give you the testing data, you can manually classify it, not by exercising any human judgment, but rather by rigidly applying the criteria you developed from the training data.

Hopefully you will gain some appreciation for how a machine learning system would learn the WiFi beacon signatures of locations, even if you do part or all of the process by hand.

Data gathering

I gathered all the data using my MacBook, which has a WiFi card of the specific model AirPort Extreme (identifying codes 0x168C, 0x87; firmware version 1.4.8.0). I used KisMAC (release "trunk r319") to passively capture the IEEE 802.11 beacon frames, including in particular the signal strengths as reported by the AirPort Extreme card. These signal strengths are said to be in dB, but without any specification of what the reference power level is. A simple interpretation would be that larger numbers mean stronger signal.

I set KisMAC to scan all 11 WiFi channels, hopping between them every 0.25 seconds. Each time I collected data, I allowed KisMAC to keep scanning for about 30 seconds. Each time I returned to a location, I tried to position myself in approximately the same spot, though of course there was a bit of variation.

I exported the data from KisMAC in MacStumbler format, a simple textual file format in which each line corresponds to one access point that was detected. The most important two columns in the output are the second and fourth. The second column is the access point's wireless MAC address, i.e., its BSS Id. The fourth column is the maximum signal strength detected for that access point on any of the channels during the scanning period. (Because WiFi channels overlap with their neighbors, the beacon is typically picked up more weakly on neighboring channels.)

Training data

Each of these files in MacStumbler format has a name that encodes the location and time at which it was gathered. Those from "outside" a room are from the hallway, not outside the building.

As a convenience for you, I've also packaged these all together into a single zip file, training.zip. If you unzip it, you will get a directory called training that contains the above forty files.

Testing data

The following testing data was collected on 2008-11-20 in the same MacStumbler format as the training data. Once you have classified this testing data, you can look at this key so you can assess how well your classifier performed. (Don't peek!)

As with the training data, I've also packaged these all together into a single zip file, unknowns.zip. If you unzip it, you will get a directory called unknowns that contains the above ten files.

Data manipulation

You may find it useful to pre-process the data before doing any manual analysis on it or using it for input in an analysis program. For example, you might want to select out a subset of the data, change the format of the selected data, and sort the selected data based into some particular order. All of these tasks can be automated using general-purpose tools available on our Linux systems. If you have a data-manipulation goal in mind, but don't know how to achieve that goal using our tools, I'd be happy to consult with you, doing some situation-specific teaching about the tools.

Report

Your report should be written for an audience that is familiar with the WiFi concepts covered in our textbook but is not familiar with the assignment. As a simplification, you can also assume familiarity with campus geography. (A better scientific report would include a campus map and the relevant floor plans.)

You should explain in general terms what you did, the specific approach you took, and the results you achieved. If you wrote a program (whether using a machine-learning approach or a hand-crafted classifier) you should include it. If you developed classification rules but no program, you should include them. Assess how successful your classifier was on the testing data and indicate any problem areas that should be the focus of future work.