Down in the Muck of Data

I’ve been back in the ULP dataset. I was working in it about three
weeks ago and set it aside while I went down to Boston to work on
other projects. I had some spare time this weekend, though, and
decided to open it back up.

I’d foundered on cleaning up and encoding the counties that I talked
about in this
post. Both
state and county names were saved using FIPS codes, but the data has
plenty of typos. This shouldn’t surprise us. All of these records were
typed in by a person, and there are more than 440,000 of them. Even a
99.5% accuracy rate would leave more than 2,000 corrupted
entries. Thus cleaning, which always takes the same dreary form:

Tabulate the recorded values

Flag any values that shouldn’t appear, given your list of acceptable
values

Try to resolve what the anomalous value should have been, if
possible

Set any remaining anomalous values to missing

Encode the sanitized list of characters into a factor variable

This is simplest with states, which are only supposed to have around
50 values. You know that 37 is a legitimate value because it’s a
valid FIPS code for a state (North Carolina, FWIW). By contrast, &7
is an error, because there is no FIPS code with an ampersand. And 3
is ambiguous–it could be 03, 30, or something else
entirely. Finally, a number like 81 is outside the range of FIPS
values. In all of these cases, I would set the state value of those
records to missing. So, I tablute the state variable, find all of the
typos, and do that:

It’s more complicated for counties because the list of invalid county
FIPS codes varies by state. Thus 007 is invalid in Delaware (which
only has three counties) but fine in New Jersey. Most county FIPS
codes are odd, but when you add a county after the fact (I’m looking
at you, La Paz County, Arizona!) or have an independent city
(Baltimore, St. Louis, too much of Virginia), they follow different
schemes. There’s no real way around this other than to tabulate the
values by state, go through each state, and flag unacceptable
ones. This is what a laptop and a lot of documentaries that you can
listen to more than watch are for…

Eventually I created a named list of lists, then used a nested loop to
visit each state and set its individual list of invalid values to
missing. The savvy among you will recognize this as a dictionary
bodged into R:

The limits of cleaning

Let’s note right here how often judgment enters into this process. We
don’t talk about it enough. Most datasets have problems–typos,
missing data, machine or language-encoding hiccups, shifts in coding
schemes. The analyst has to make decisions about how to deal with
these, and historically those decisions were almost never
documented. To this day, no one wants to write up every single design
choice they made. That is why it’s important to share your data and
code; but also that’s why it’s important to make lists like these,
however tedious they seem. If I were to go through my data in Excel or
the like, changing errors as I saw them, there would be no record of
what I’d done, short of analyzing two datasets side by side. Even
then, different rules could produce similar results in some
cases. Don’t just clean your data; whenever possible, clean your data
with code.

Having finally cleaned up the counties and managed to encode them, I
moved on to the complainant and respondent variables. Per the
codebook, these are four-character strings:

Notice that there’s no apparent translation for how to interpret those
four characters! Elsewhere in this codebook they refer to another PDF
that lists union names, of which I have a copy:

It seems that 036 corresponds to the Retail Clerks, per the
codebook’s example, so that’s good. But what does the 7 at the start
mean? And sure, I can imagine they use 000 to mean employer–or does
that mean employer association, while R means employer? And what
code indicates an individual?

Here, I’m running up against the limits of the available
documentation. For my research, I mostly care whether it’s the union
or the employer filing the complaint, and I can probably back it out
of these data, but it’s worth noting that I cannot proceed from here
without making some judgment calls. For the moment, I’m going to focus
on the last three characters in these variables, and come back to the
first ones.

Suspicious error patterns

With that in mind, I tabulated the last three characters of the
complainant variable, and started the same procedure as for states
and counties. Eventually I compiled this list:

Yes, I’m aware how long that list is–I had to type it. But typing
has its uses; you notice when you’re hitting some keys more than
others. For example, there are a lot of errors that include the first
few letters of the alphabet. That might just be an encoding issue. But
there are also a lot of errors that include keys from the right-hand
side of the home row. In particular, the letters “J”, “K”, and “L”
show up a lot more than you’d expect from random error, with “M”, “N”,
“O”, and “P” somewhere close behind.

That’s weird, right? Erroneous “Q”s, “A”s, and “Z”s show up a lot,
because they’re right next to the Shift and Tab keys. “O” and “P”
maybe, since they’re below the “9” and “0” keys. But “J”? It’s
right in the middle of the letter keys.

Then it hit me: these weren’t typed on a modern keyboard. These data
come from punched cards, and those would have been typed on a
specialized keypunch machine. I have no idea what those keyboards
look like. To Google!

I have reason to suspect that the NLRB and AFL-CIO were using IBM
mainframes. Even if I didn’t, those were the most common type. With
the rise of System/360, the most common keypunch machine was the IBM
29:

Bingo. The IBM 29 didn’t have a separate numeral row. Instead there
was a telephone-style numeric keypad on the right-hand side of the
keyboard. Since punched cards didn’t use upper- and lowercase, you had
“number-shift” and “letter-shift” keys (the “numerisch” and
“alpha” keys here). If you stuttered on a “4”, you’d enter a “J”,
and so on. That accounts for a lot of these typos. It even suggests
where they ampersands might be coming from–a missed “3”!

Delving deeper

This doesn’t solve all problems, of course. It still isn’t clear where
all the “A”s, “B”s, and “C”s are coming from, nor what that first
character in the complainant and respondent variables is supposed to
represent. But whenever I find something like this–a connection
between a seemingly random string of digits and the mechanical or
human processes that generated it–I feel like I’m coming to
understand my data better.

The next and biggest mystery, at least on this front, is where all the
curly-brace characters are coming from! Bear in mind that, because
these were punched cards from an old IBM mainframe, they were encoded
in some variant of EBCDIC
instead of ASCII, and EBCDIC
didn’t have curly-brace characters. This suggests some sort of
encoding-translation error, but parsing that may be a bridge too far,
even for me.