Gender Pay Gap Analysis

Lauren Renaud

December 15, 2015

Project Scope:

Is there a significant difference in income between men and women? Does the difference vary depending on other factors (e.g., education, profession, criminal history, marriage status, etc.)?

Data Summary

To analyze this question, we’ll use the National Longitudinal Survey of Youth, 1997 cohort data set. This dataset is comprised of about 9000 youth who were initially interviewed in 1997, and then were interviewed more times in the following years. The dataset seeks to produce a longitudinal study of respondents transition from teenage to adult years. This gives us an opportunity to look at how income and gender intersect with other factors, particularly from the teenage years.

Note: The most recent survey data is from 2011. Any references to “last year” refer to the year prior to the survey, 2010.

First we should know the count of respondents, broken down by gender. We have 4385 females and 4599 males in the survey.

Next we’ll look at the mean income from last year, broken down by gender.

Gender

Income

female

29997.82

male

37911.57

We can see here already that the average income from last year for women is lower than it is for men, by $7913.75, or that women are making 79.13% of what men make, on average. We’ll go into more detail about the statisitical significance of the difference later.

If we look at boxplots of income broken down by gender, we see that the interquartile range for women is lower than that of men, in addition to the lower mean. The outliers for the top earning women catch up with the outliers for the top earning men.

Note: At this point in the data summary we are excluding the top coded values. The rationale and further analysis regarding top coded values will be explained later in the report.

Now that we’ve looked at the data and observed a difference, we can run a t-test to find the statistical signifigance of this difference.

At a 95% confidence interval, we find a p-value of 0, indicating that the difference in the means of male and female income are not attributable to random chance.

So, to begin, we can say that yes, there is a significant difference in income between men and women. We will now consider the impact of other factors.

Race

The mean income from last year, broken down by gender and race, can give us a starting off point for exploring other factors that may contribute to the wage gap. This table displays average income by gender by race, followed by the absolute and then percentage difference for each race in the survey. We can see that the gender wage gap exists for all racial catergories in the survey, though, by varying amounts. Looking at the boxplots, there appears to be less of a difference between the income means by gender for Blacks than for other races. There’s a large difference in the means for mixed race people, but there are also only 83 respondends coded as mixed race, making up only 0.92% of the survey. This low sample size makes it difficult to make inferences about this group.

Female

Male

Abs Diff

% Diff

hispanic

26314.59

34099.99

7785.391

77.17

mixed

30814.29

38714.29

7900.000

79.59

white

30928.24

36671.09

5742.849

84.34

black

25493.25

28109.43

2616.181

90.69

Industry

Now we can also look at mean income by gender and industry. Again we see women making less, on average, than men across most of the categories.

The mean income for women is actually greater than men for acs special codes. Similarly to what we saw in the breakdowns by race, though, it is worth nothing that only acs special codes made up only 10 of respondants, which may mean that a singular or small number of outliers may be skewing this data.

Female

Male

Abs Diff

% Diff

mining

29000.00

51600.00

22600.000

56.20

agr forest fish

21946.15

38904.76

16958.608

56.41

active military

30000.00

52684.21

22684.211

56.94

utilities

34725.56

51880.95

17155.397

66.93

construction

27704.76

34723.43

7018.664

79.79

retail trade

23575.64

29240.15

5664.508

80.63

other public services

23123.93

28496.95

5373.027

81.15

public admin

39580.64

47602.89

8022.247

83.15

transport warehouse

30202.03

36137.34

5935.314

83.58

entertain accom food

20168.33

23842.41

3674.081

84.59

fin insure real estate

35392.19

41608.47

6216.283

85.06

manufacturing

33301.85

37805.65

4503.798

88.09

edu health social

30128.16

33902.60

3774.441

88.87

professional

30122.45

32840.23

2717.785

91.72

wholesale trade

32056.84

34919.99

2863.149

91.80

info comm

37027.68

38044.50

1016.823

97.33

acs special codes

36500.00

21333.33

-15166.667

171.09

If we look at boxplots of the distribution of income by gender and industry, we can make some other important observations. It appears that the sample of female active duty military is very small, and if we look at the data we can find it’s actually only 1. The lower quartile for men in the mining and utilities are above the upper quartile for women in those industries, while for agr forest fish the quartiles don’t even overlap. The differences seem smaller for professional, wholesale trade, entertain accom food, and edu health social.

Methodology

Missing Values

When bringing in and intially coding the data, I excluded missing values from numeric variables. While we can possibly make some assumptions about someone who, for example, did not know their income from last year, when analyzing and computing numeric values it is very difficult to do something with those assumptions.

Unfortuantely, we were missing last year’s income data for 40.98% of respondents. While that is unfortately a large percentage of our dataset, it still leaves 5302 respondents, which is a large sample size. The same goes for industry, where we were missing 31.37%, but still have 6166 answers to analyze.

This does introduce a limit into the data, but for the most part the number of missing values was not too great.

For categorical values, things like valid skip and non-interview were coded into the analysis as NA, while in most cases for categorical values refusal and don't know were coded in as such. The refusal and dont know values were ignored for some values where they comprised a small sample, but were analyzed further where they comprised a more signifigant proportion of responses.

Topcoded Values

For the most part, I removed topcoded values.

One instance where it made a difference was in looking at average income of men and women by industry. The averages by industry were displayed above in the data summary. The table below displays the industry, then mean female and male salary and the absolute difference and percent difference for means, all excluding topcoded values, followed by the absolute and percent differences if you include the topcoded values. The final column finds the difference in percentage points between the means that included the topcoded values and those that did not. This table is sorted by the final column.

Female

Male

Excld Diff

Excld % Diff

W Diff

W % Diff

Differences

info comm

37027.68

38044.50

1016.823

97.33

-3979.351

106.22

-8.89

active military

30000.00

52684.21

22684.211

56.94

21433.626

61.83

-4.89

utilities

34725.56

51880.95

17155.397

66.93

11326.222

70.98

-4.05

fin insure real estate

35392.19

41608.47

6216.283

85.06

5987.251

85.22

-0.16

agr forest fish

21946.15

38904.76

16958.608

56.41

16958.608

56.41

0.00

acs special codes

36500.00

21333.33

-15166.667

171.09

-15166.667

171.09

0.00

wholesale trade

32056.84

34919.99

2863.149

91.80

3723.681

91.23

0.57

professional

30122.45

32840.23

2717.785

91.72

3534.962

90.79

0.93

transport warehouse

30202.03

36137.34

5935.314

83.58

6830.363

82.44

1.14

entertain accom food

20168.33

23842.41

3674.081

84.59

6803.854

82.92

1.67

manufacturing

33301.85

37805.65

4503.798

88.09

4229.323

86.13

1.96

edu health social

30128.16

33902.60

3774.441

88.87

4674.081

86.82

2.05

other public services

23123.93

28496.95

5373.027

81.15

6256.523

78.71

2.44

public admin

39580.64

47602.89

8022.247

83.15

9830.407

80.53

2.62

retail trade

23575.64

29240.15

5664.508

80.63

6005.371

77.06

3.57

mining

29000.00

51600.00

22600.000

56.20

27350.100

52.31

3.89

construction

27704.76

34723.43

7018.664

79.79

14555.482

71.76

8.03

Focusing only on the Differences column we can see that some industries – info comm and construction in particular at -8.89% and 8.03% respectively, followed by active military, utilities, and mining – have high differences depending on the inclusion or exclusion of the topcoded values. The next question is how big of a difference it makes to our analysis that these values are different.

Industry

Count

% Respondants

r1

construction

430

4.79

r2

info comm

149

1.66

r3

mining

45

0.5

r4

utilities

38

0.42

Construction workers make up 4.79% of the respondents, which is a fair amount, while the number of respondants for other industries comprise a small portion of our sample.

Unexpected Variables That Had No Connection & Other Relationships

I had expected to find a difference between drug use and income by gender, but it was not very different.

I also thought there may be a difference income by gender based on household income growing up, that wealthier households would possibly set men up to be wealthier to a greater extent than women. However, it appears that greater income as a teenager means greater income as an adult but the difference by gender stays about steady, as seen in this graph below. In order to do this analysis I had to exclude some low, negative household income values that I think may be been erroneously entered.