Parsing Wake County School System Attendance Assignment Site With F#

As a follow up to this post, I then turned my attention to parsing the Wake County Public School Assignment Site. If you are not familiar, large schools districts in America have a concept of ‘nodes’ where a child is assigned to a school pyramid (elementary, middle, high schools) based on their home address. This gives the school attendance tremendous power because a house’s value is directly tied to how “good” (real or perceived) their assigned school pyramid. WCPSS has a site here where you can enter in your address and find out the school pyramid.

Since there is not a public Api or even a publically available dataset, I decided to see if I could screen scrape the site. The first challenge is that you need to navigate through 2 pages to get to your answer. Here is the Fiddler trace

The first mistake you will notice is that they are using php. The second is that they are using the same uri and they are parameterizing the requests via the form value:

Finally, their third mistake is that the pages comes back in an non-consistent way, making the DOM traversal more challenging.

Undaunted, I fired up Visual Studio. Because there are 2 pages that need to be used, I imported both of them as a the model for the HtmlTypeProvider

I then pulled out the form query string and placed them into some values. The code so far:

Skipping the 1st page, I decided to make a request and see if I could get the school information out of the DOM. It well enough but you can see the immediate problem –> the page’s structure varies so just tagging the n element of the table will not work

I decided to take the dog for a walk and that time away from the keyboard was very helpful because I realized that although the table is not consistent, I don’t need it to be for my purposes. All I need are the schools names for a given address. What I need to do it remove all of the noise and just find the rows of the table with useful data:

So working backwards, I need to parse the 1st page to get the CatchmentCode for an address, build the second’s page form data and then parse the results. Parsing the 1st page for the catachmentCode was very straight forward: