Diving into the data — ChiPy Blog #2

In my last blog post for this project, I was hoping to have wrapped up data analysis for the Chicago parking ticket data. Alas, I was so fresh, naive, and optimistic. I should know by now — things never turn out the way you plan.

Working with data takes time. You find you have ideas that are dead ends, or that your data is incomplete for your purposes. And when you’re getting the hang of a new coding language at the same time, progress sometimes looks more like what you tried and failed to do than what you have successfully pulled off. With that introduction, I’ll dive right into where my Pythonic efforts have taken me and how each little stop gets me closer to achieving some presentable results.

Whoa, looks like most people get on the boot list at least partly because of having expired plates or registration. Good start, but from here my questions led me down a rabbit hole. I wanted to know whether the ticketing process always followed the city protocol of:

(1) 2 –3 unpaid violations

(2) A notice that they intend to find your car and boot it

(3) 21 days after the first notice, if the tickets are unpaid, the car is actually booted

(4) If the tickets remain unpaid 24 hours after a boot, then the car is towed.

The problem was…there was no towing data available. So all I had to go on was the 5 levels of notice_level:

“VIOL,” which means a notice of violation was sent

“SEIZ” indicates the vehicle is on the city’s boot list

“DETR” indicates a hearing officer found the vehicle owner was found liable for the citation

“FINL” indicates the unpaid ticket was sent to collections

“DLS” means the city intends to seek a license suspension.

I was curious about instances in which cars were towed directly instead of being put on the boot list first, but without the towing data, there was no way I could follow cars as they moved from “SEIZ” status to towed status.

But alas, still I tried.

What is “final determination” and where is it hiding in this data?

The issue was that I couldn’t determine if the “final determination status” that made someone eligible for the boot list meant the “FINL” or “DETR”. I settled on “FINL” and tried to narrow the data set down to license plate numbers that had one of those two statuses.

#this is creating a list of all the license plate numbers that have a boot warranttickets_SEIZ = tickets["license_plate_number"][tickets.notice_level == "SEIZ"]#and all those that have a "FINAL" or sent to collections. tickets_DETR = tickets["license_plate_number"][tickets.notice_level == "FINL"]

Then I checked out how many license plates each status had.

len(tickets_SEIZ)len(tickets_FINL)

SEIZ had about 9000 plates, while FINL had about 4000. That told me that FINL wasn’t the criteria I thought it was. This data sample indicated that you didn’t need to rack up FINL tickets sent to collections to get on the boot list, since the FINL number was so much lower than SEIZ.

To explore further, I wanted to see how many violations each license plate got for either FINL or SEIZ. The following two lines of code return a list of license plates with a count of how many times they appeared.

#this returns the number of FINL violations per license plate numbercount_FINL = tickets_FINL.value_counts()#this returns the number of SEIZ violations per license plate numbercount_SEIZ = tickets_SEIZ.value_counts()

FINL vs. DETR

Since FINL didn’t seem to be making sense, I ran the whole thing over again, but this time comparing DETR and SEIZ. DETR returned about 6000 license plates, which was more promising than FINL, but still not what I was looking for. If you need 2 — 3 violations to get to SEIZ, I’d expect at least double the amount of DETR to SEIZ records. I wasn’t seeing that here, so I suspected that either both DETR and FINL were not what I needed to track the journey to the boot list OR that the random sample wasn’t random enough for my purposes.

I was ready to give up, but just for fun, I made a dataframe with just the SEIZ and DETR records:

Fun with anonymized data

Another thing I realized is that the anonymized, hashed license plates numbers may not be completely unique, meaning that it would be impossible to follow individual vehicles through their boot list journey. The thing that made me suspect this was this one license plate number that showed up as having 1383 instances of boot list notifications in the data set. Since this was a 50,000 row sample from only one year, 2015, it seemed impossible to get that many notifications in such a short period of time.

Once I took a look at “scofflaw,” I discovered that it probably served as a catch-all for miscellaneous tickets, since taking a look at the data showed that there were many different car makes all listed under the same plate number. Since no other license plate number had that many violations, I could keep that one in mind as a catch-all and exclude it from my future analyses.

What’s the gain from all this?

I learned that trying to follow a handful of vehicles with a high number of violations probably wasn’t going to work out too well with a random sample. What would work better for this would be to grab a complete year or other time span of data and try to see how statuses change over that period of time. Since this data is a random sample from a year, I can’t truly follow things over time since the sample is incomplete.

I also learned that I didn’t pick a good way to track how vehicles end up on the boot list. Both the DETR and FINL status don’t seem to match up to the “final determination” status that gets you a SEIZ status after 2–3 chances. Sadly, it doesn’t seem like comparing the notice_level statuses is the way to go to follow a vehicle over time.

My goals!

Further build my skills in pandas. Since I learned R first, it really felt like a “language barrier” to get the data manipulation tactics I learned how to do in the tidyverse translated into pandas. I coped with the help of my mentor, Ed, Google, StackOverflow, The Quick Python Book (that mentees got at the last Project Night!),the Pandas Cookbook, and Cheatsheet. I’d like to get more to the point where I’m using these as references rather than examples.

Instead of looking at SEIZ in terms of its relation to other notice levels, take a look at SEIZ in relation to the total amount of payments that a person has made after getting boot status, plus the current amount due when they land on the boot list. “Follow the money” to see how it relates to getting a boot. I plan to work on this in the sample data and then explore it in the more complete data set after I find something interesting.

Take a look at the most common violations that land people on the boot list and see how they compare to the cash amount for each violation. Are people getting on the boot list more often because they get one or more tickets with really high fines that they can’t afford to pay? How does the boot list stack up with big ticket items vs. smaller ticket items?

Pull a year from the main dataset and try to follow the boot journey for a complete year. Since the main dataset is impossibly large (trying to load it actually crashed both Ed’s computer and mine), Ed walked me through some code he wrote to read in the file and save it by year or by row. The code for reading this huge file in by year (plus some of my comments on what the lines do) is below.

"""program to read the parking_tickets.csv file and break it intosmaller files by year.Assumes: headings are first line in file. year is beginning of second element in observation list. observations are already in consecutive date order. """fin = open('data//ticket_sample_data.csv', 'r')line = fin.readline() # read first line in fileheadings = line # save first line as headingsfile_year = 'tbd' # creates variable for yet-to-be-created output files. #change file_year to the year you want to pull from the file.while line: # loop through and process observations line = fin.readline() if line: line_as_list = line.split(sep=',') obs_year = (line_as_list[1][0:4]) if (obs_year == file_year): fout.write(line) else:

I’m excited that a bit of automating things with Python has now slipped into this data analysis project too. Sometimes you need to do a little bit of everything to make the thing work.

Now that I have a better idea of the limitations and scope of the data, I can work with what I have and come out with analyses and findings that I can visualize. I’m aiming to wrap up analysis and start building a visualization of my findings for the web in a few weeks.