EEOC EEO-1 Survey code

I have done research using longitudinal data on establishment
workforce composition. These data come from the Equal Employment
Opportunity Commission (EEOC). Since 1966, to monitor compliance with
the Civil Rights Act, the EEOC has required any private-sector
establishment with more than 100 employees (50, if they do significant
federal contract work) to file an EEO-1 establishment survey. These
forms have some information about the firm itself, such as its
location and industry. The key data are a matrix of occupations and
race/sex tuples. Employers enter counts of employees in each relevant
cell. Thus an EEO-1 form will show you how many Hispanic women are
employed as Technicians, for example. (See a sample form
here.)
The EEOC does not separately verify these reports but does reserve the
right to audit them. Prior studies have concluded that the EEO-1
surveys are among the best data we have on workforce composition over
time.

The EEOC thinks that it is important for researchers to work with
their data, in part to see whether and how they are achieving their
mission. It’s a rare and refreshing stance for a government agency to
have! Access comes through the Intergovernmental Personnel Act. Thus I
have at times been an employee on secondment to the EEOC, at a salary
of $0 per year. This gives me access to the EEOC’s data but also puts
me under the same privacy and data-protection requirements as any of
the agency’s employees.

tl;dr: Researchers cannot just share EEOC data with one another. This is good and bad. Obviously it would be great if these
data could be shared with no restrictions. Given the state of the law,
though, this isn’t going to happen. And if one were to try to get
access to EEOC data through a channel like the Freedom of Information
Act, what data the agency could release would be heavily
redacted. This would make linking EEOC data to other data sources
virtually impossible. Overall, then, the agency’s access program is a
net positive.

Nonetheless, this process slows down research with these data. In
particular, initial conversion of the EEOC data can be a bear. The
EEOC complies with data requests by sending researchers
SAS7BDAT-format files. Never heard of those, you say? You are not
alone! This is the “SAS 7 Binary Data” format, which is kind of
old-school even if you do use SAS. Never mind NumPy, R, or Stata.

In principle, you can convert SAS files to other formats using
software like Stat/Transfer. In practice, several fields in these
files are compressed, using a janky, proprietary, obsolete SAS
compression algorithm. S/T throws errors when converting these, and
the number of errors grows quickly on the older files. There is also
no reason to think that any of these errors are random.

My solution, the last time I got raw data from the agency, was to
write some SAS code that iterates over the files and writes a CSV for
every year. (Many thanks to Simona Abis,
lately of INSEAD, who basically walked me through this on very short
notice!) Then I have a Stata do-file that cleans up these CSV
files. The end result is a longitudinal dataset covering 1971-2014,
save for the years therein where the EEOC does not have digital data
(1974, 1976, and 1977).

This file could be built differently–for example, I did not want to
faff with encoding county information on the oldest, 1966 file, and
have not included such cleaning in the script. However, I think it
would be vastly easier for another researcher to modify my script
than to write the dang thing from scratch. Through such fits and
starts does science proceed! This file is also the starting point I
use for analyses in a couple of projects, like this
article on
establishment-level racial employment segregation, and that I use for
procedures like geocoding the establishments; so it is useful to
preserve this version of the script.