Tutorial: Using SQLite to analyze an education data set

Tutorial: Using SQLite to analyze an education data set

In this tutorial, we’re going to do some example queries of the 2013-14 Civil Rights Data Collection from the U.S. Department of Education Office for Civil Rights. To get started, download the CRDC_NC.csv file from our Google Drive.

The full CRDC dataset for the United States takes a very long time to open, so we’re going to just start with the N.C. dataset (even isolating N.C., this CSV file is still 12.6 MB). But if you start with the full data set, this is the code to export the N.C. data as a new table:

/*create table for just NC*/
/*CRDC_NC is the name I chose for my table, you can call it what you’d like*/
CREATE TABLE CRDC_NC AS
SELECT *
FROM CRDC2013_14
WHERE LEA_STATE = 'NC'

Once you’ve imported the CSV file into the database management system of your choice (we’re using SQLite DB Manager), you can start querying the dataset. There are quite a few fields in this dataset, so what I do is open up the field list in a Google Sheet to keep for reference beside my DB Manager.

Let’s say, for instance, that you wanted to see whether black students who received more than one out-of-school suspension are overrepresented. In order to determine this, we’ll need to figure out four ratios for students in preschool:

And then compare that to:

Then we’ll have to figure out how that translates into the fields in our dataset. In this case, we have all of the data points broken down by race/ethnicity and gender, so we’ll have to add the male and female gender fields. For the first comparison, the fields we’re going to use are:

And compare that to:

We’re also going to want some comparison numbers to make sure that our calculations are relatively in the correct range. Here we have an analysis of the same dataset from 2011-2012, and here we have some data from N.C. Health and Human Services. From these datasets, we know:

In 2011-12, black children represented 18 percent of preschool enrollment, but 48 percent of preschool children receiving more than one out-of-school suspension.

In 2011-12, white children represented 43 percent of preschool enrollment but 26 percent of preschool children receiving more than one out-of-school suspension.

The total number of regulated child care facilities in N.C. in 2013 was 252,434.

Now, we’re ready to start our SQLite queries. If you have never been exposed to SQL before, go to this W3 schools guide. These are the relevant pages for this post:

We’ll start by determining the total number of students enrolled in preschool.

NOTE: You’ll notice in all of my examples that I’ve used a CAST() function (SQL functions are capitalized by convention, but not rule. It will work the same whether you use “CAST()” or “cast()” but uppercasing the functions just makes them easier to read.) — this is because my particular database manager imported all of my fields as text. In order to be able to compare my fields numerically, I had to cast them as integers. The format for a CAST() function is as follows:

CAST(expression AS type-name)

If your fields were imported with the correct data types, you do not need to do a cast! In that case, your query would look like this:

/*DENOM: total students enrolled in preschool*/
SELECT SUM(TOT_PSENR_M + TOT_PSENR_F)
FROM CRDC_NC
WHERE TOT_PSENR_M > 0
OR TOT_PSENR_F > 0

However, my query includes the CAST() functions:

/*DENOM: total students enrolled in preschool*/
SELECT CAST(SUM(TOT_PSENR_M + TOT_PSENR_F) AS INT)
FROM CRDC_NC
WHERE CAST(TOT_PSENR_M AS INT) > 0 OR
CAST(TOT_PSENR_F AS INT) > 0

This query will return this to our console area in DB Browser for SQLite:

Now we can calculate our second denominator: the total number of students who have received more than one out-of-school suspension.

/*total preschool students receiving > 1 out of school suspension*/
SELECT SUM(TOT_PSDISC_MULTOOS_M + TOT_PSDISC_MULTOOS_F)
FROM CRDC_NC
WHERE CAST(TOT_PSDISC_MULTOOS_M AS INT) > 0 OR
CAST(TOT_PSDISC_MULTOOS_F AS INT) > 0

This query returns 82 students.

Now that we have our denominators, we can find our numerators. Notice in these queries the use of OR in place of AND. Using AND will remove some of our results by narrowing the query, especially when we get to the discipline queries, so be careful.

The queries are similar to the one above (again, you can remove the CAST functions if you don’t need them):

/*NUMER1: total black students enrolled in preschool*/
SELECT CAST(SUM(SCH_PSENR_BL_M + SCH_PSENR_BL_F) AS INT)
FROM CRDC_NC
WHERE CAST(SCH_PSENR_BL_M AS INT) > 0 OR
CAST(SCH_PSENR_BL_F AS INT) > 0

/*NUMER2: total white students enrolled in preschool*/
SELECT CAST(SUM(SCH_PSENR_WH_M + SCH_PSENR_BL_F) AS INT)
FROM CRDC_NC
WHERE CAST(SCH_PSENR_WH_M AS INT) > 0 OR
CAST(SCH_PSENR_WH_F AS INT) > 0

/*NUMER3: total black preschool students receiving >1 out of school suspension*/
SELECT CAST(SUM(SCH_PSDISC_MULTOOS_BL_M + SCH_PSDISC_MULTOOS_BL_F) AS INT) AS TOTAL
FROM CRDC_NC
WHERE CAST(SCH_PSDISC_MULTOOS_BL_M AS INT) > 0 OR
CAST(SCH_PSDISC_MULTOOS_BL_F AS INT) > 0

/*NUMER4: total white preschool students receiving >1 out of school suspension*/
SELECT CAST(SUM(SCH_PSDISC_MULTOOS_WH_M + SCH_PSDISC_MULTOOS_WH_F) AS INT) AS TOTAL
FROM CRDC_NC
WHERE CAST(SCH_PSDISC_MULTOOS_WH_M AS INT) > 0 OR
CAST(SCH_PSDISC_MULTOOS_WH_F AS INT) > 0

For NUMER1 we get 6950, for NUMER2 we get 7850, for NUMER3 we get 52, and for NUMER4 we get 18. Note that these NUMER1+NUMER2 and NUMER3+NUMER4 will not summate to their respective denominators because we are lacking the numbers for other racial/ethnic groups.

Now that we have all of our numerators and denominators, we can write them down and calculate our ratios.

So we have determined:

In 2013-14, black children represented 29 percent of preschool enrollment, but 63 percent of preschool children receiving more than one out-of-school suspension.

In 2013-14, white children represented 32 percent of preschool enrollment but 22 percent of preschool children receiving more than one out-of-school suspension.