Sports statistics create a great opportunity to measure the relationship between productivity and income. The data is much more detailed than that typically available to economists. The basketball data set collected by Kahn and Sherer is very rich and allows us to test a number of hypotheses.

Suppose that we want to find out the role of race in determining salaries. A simple-minded way of doing this is to run the following regression:

ls SAL c RACE

where RACE is 1 if white; 0 otherwise.

The results suggest that there is no discrimination against black basketball players since the coefficient of RACE is negative, implying that whites make less than blacks (Please note that I sometimes use black and white for a short hand to the preferred African-American and European-American). While simple income comparisons (between ethnic backgrounds or genders) are commonly done, it is wrong methodologically, since one needs to control for productivity. In this case, productivity means how many baskets and rebounds each player makes. The work by Kahn and Sherer provides guidelines on proper econometric methodology.

The Kahn and Sherer article, like most of the articles chosen for study in this course, is an exemplary model of research. Its results are convincing for a variety of reasons: (1) There is not one, but several related studies employing different data, all of which confirm in different ways the basic ideas. (2) The authors undertook various formulations of the econometric model and the effect of RACE was robust to the alternative formulations. (3) the authors have chosen a good data set -- the performance variables are relatively close to the ideal. (4) the authors are aware of the possible biases inherent in the data and account for them.

The purpose of this course is to get you to think for yourself and develop critical understanding. You will not just replicate someone else's work (including mine). In this spirit, one should always critically assess others' work and try to improve on it. With regard to Kahn and Sherer's study, I believe that there is room for improvement in their choice of variables. In choosing variables one should think carefully. One does not just throw in variables which seem to make sense. One chooses the formulation that makes the most sense. Furthermore one needs to carefully consider the data.

I start with the last point first. In this study income is a function of performance. If we do not include bonuses for playoff games, then income does not depend on this year's performance but rather on previous years' performance. That is, salary contracts are made before the start of the season and depend on previous years' performance with the preceding year's performance being most influential (unless there was a multi-year contract). Ideally we would have salary as a function of lagged performance. In this data set, we are given the total points over all seasons. Thus this data set implicitly assumes that performance is the same each year. Such an assumption is incorrect. But that is what we have to work with.

In this study Kahn and Sherer use logs so that, in the original formulation, the variables are multiplied. Suppose that one thought that salary (SAL) should be a function of total offensive rebounds (OFFREB) in a year. Then one might want to have either OFFREB per year as a summary or break it down into constituent parts OFFREB PER MINUTE * AVERAGE MINUTES PER GAME PLAYED* GAMES PER YEAR. The authors have these last two variables denoted by MINS and GAMES respectively, but they have OFFREB per game not per minute. Given MINS and GAMES, it makes more sense to have offensive rebounds per minute than per game.

Also note that POINTS is career points scored. It should be in the same units as OFFREB (either per game as the author did or per minute as I have suggested).

I believe that the interesting variable is average minutes played by year, MINPYEAR, rather than its constituent parts, GAMES * MINS. Therefore MINPYEAR should be substituted since the constituent parts give no clue as to worth, and we should save on degrees of freedom when there is no cost in doing so.

Also, I think that the variables should be per minute rather than per game (then minutes instead of games) since per game conflates productivity per minute and number of minutes per game and the variable games may not vary as much as minutes played per game. Also the negatives are more meaningful per minute. Someone who plays only a few minutes per game will have fewer fouls per game than someone who plays a lot of minutes per game; a measurement of fouls per game would make it look like the more fouls, the higher the pay.

We want to capture the negatives and one of the negatives is missing shots. The authors use career field goal percentages (fraction made) but this is already embodied in total points. Again one might want to think of this as a formula. Instead of total field goal points, the authors should have used field goal points attempted per minute times field goal percentage. But better yet, instead of having FTPCT and FGPCT the authors should have had FTMISSED and FGMISSED (field goals missed per minute and free throws missed per minute). Once again, the negatives are in the same unit of account as the positives.

I am somewhat skeptical about the use of CENTER and FORWARD. If players in these positions are better, they should be captured in the other variables such as OFFREB or ASSISTS. To also include CENTER would then be double counting. I do not see CENTER and FORWARD as proxies for other unmeasured variables, but those who know more about basketball may disagree and want to include them. While the authors do not use height, some students wanted to include height because taller players would be more productive, other things being equal. However, we already have these measures of productivity (for example, rebounds) and therefore one should not include height.

STEALS and BLOCKS are such a rare event that I doubt that they would add to someone's salary. Now they might be a proxy for other skills, but the rarity of observations suggest little confidence in the coefficients. I might be inclined to drop them from the equation.(1)

I would also be inclined to drop DRAFTNO since most of the other variables should be a good predictor of the number. If I were to keep it, it would be as a residual from the predicted DRAFTNO when the independent variables are the above productivity numbers (See section B2).

There are two kinds of approaches to econometrics--throw everything into the soup (hoping that the econometrics will clarify the relationships) and carefully choosing the key ingredients (so that we know what we are eating intellectually). I prefer the latter approach. Hence I do not want both rebounds and height in my equations.

The authors also use several variables concerning the characteristic of the local area, including RACEMSA, POPSMA, INCSMA. I am skeptical that these variables would be relevant. My skepticism does depend on how I characterize the market for basketball players. I believe that players are in competition with one another. To illustrate, suppose that there are two black players of equal skill and one player plays in a heavily white city and the other in a heavily black city and fans are prejudiced in favor of their own race. The team owner in the heavily black city will not pay more for the black player since he could get the other black player from the white city for less. Hence racial bias will not appear as variations in back pay across cities.

Now theory is a good guide to setting up equations and choosing variables, but ultimately theory needs to be confronted with data. These variables could be left in and we could let the data show whether Wittman is right. My own taste is to not do this regarding the variables under discussion. In general, I like to limit the number of questionable variables thrown into the equation. If it is a central issue, then I will keep such variables in, even if questionable, since that is the question. Here I feel that these other variables are not as central to the question I am trying to answer (is there discrimination, not whether fans are the source of discrimination) and I will choose to not include them.

HOMEATT is also a questionable variable. If the players draw the crowds because of their personalities or whatever beyond the wins implied by WINPCT or scoring, then it may be OK. But it may have nothing to do with the present players or embodied in the other variables and therefore useless. I would be inclined not to use it.

In a nutshell, SEASONS, GAMES, CENTER, FORWARD, FTPCT and FGPCT would be dropped, FGMISSED and FTMISSED would be added, and all measures of productivity would be per minute. I would also drop RACEMSA POPSMA and INCSMA

Using the variable we have identified in the last section (and dropping those that I found objectionable), a priori (before looking at the data) my choice of independent variables are:

POINTPM = (2*TLFGM + TRIPTM + TLFTM)/TLMINS

It is useful to consider the equation for POINTPM in greater detail. Total field goals made (TLFGM) includes 2 pointers and 3 pointers while free throws are worth 1 point. Therefore a triple pointer gets 2 points for being a field goal plus 1 point for being a triple pointer which adds up to3)

OFFREBPM = OFFREB / TLMINS

DEFREBPM = DEFREB / TLMINS

ASSISTPM = ASSISTS / TLMINS

PFOULSM = PFOULS / TLMINS

MISFGPM = (TLFGA -TLFGM) / TLMINS

MISFTPM = (TLFTA -TLFTM) / TLMINS

Note: All these variables can be generated by putting "genr" before the equation.

Since SEASONS has a zero in it, we first must change the smpl to exclude that observation. Luckily, there were no values of TLMINS that were zero:

The R-square of .56 is very high for cross section, especially considering the fact that the independent variables are not the same type of thing as the dependent variable. If one ran consumption against income, both are in dollars and consumption is a large part of income so a high R-square would not be surprising. In time series money might be regressed against money lagged. Again a high R-square would not be surprising. But here the high results are not guaranteed by the formulation of the data.

The F-statistic, 31.6, is large and significant.

More importantly, almost all of the coefficients have the correct sign giving us considerable confidence in the results. The more points per minute, offensive rebounds per minute, defensive rebounds per minute, assists per minute and minutes played, the higher the salary; the more fouls per minute and missed field goals per minute, the lower the salary. The only wrong sign is associated with missed free throws. It should be negative, but it is positive although not at all significant (0.90 probability). According to these results, being white is worth an extra $108,078 a year. The result is very significant (0.003 as a one tail test). Also according to the results an extra point per minute is worth $1,332,533 (remember this is based on data for 1985-86, when salaries where considerably lower).

While the regression results are very supportive, one multiple regression is not conclusive. One should check whether the results are robust to alternative formulations, and other studies based on other data sets should be undertaken. I will now briefly discuss two alternative specifications based on the same data set.

In the regression just discussed, the independent variables had an additive effect. I choose this because I felt that points and rebounds are additive in their effect on salary, not multiplicative (although minutes and points per minute are clearly multiplicative). Also a linear equation is easier to interpret. However in many empirical studies, it is common to assume a multiplicative effect between the independent variables (equivalently, that the variables are additive in their logs). Therefore, I took logs of all the variables considered in the previous multiple regression. Note that WHITE = log(RACE + 1). This is because log(0) is undefined while log(1) = 0.

The regression results are a bit different than our earlier formulation. In general the coefficients are smaller, and the standard errors higher. LPFOULSM is only significant at the 10% level. However, in some ways the model suggests a better fit: the intercept is positive, the R-square is 0.6978, and LMISFTPM is negative. In any event, it remains true that whites again make more than blacks.(2)

One student suggested a totally different formulation. The measured variables may not capture the true productivity of a basketball player. Sports professionals may be able to better assess productivity than students doing a multiple regression. Therefore the student suggested an equation somewhat similar to the following:

SAL = A + B (TEAMSAL - SAL) + C ALLPRO/SEASONS + D DRAFTNO + E RACE

Because SAL is both the dependent and independent variable in this equation we must group SAL on the left of the equation:

DRAFTNO should be negative since a higher DRAFTNO means an earlier pick. SAL is subtracted from TEAMSAL so SAL is not partially regressed against itself. Note that the sign of E depends on the racism of sportswriters and basketball scouts relative to the racism occurring in salaries. For example, suppose that sportswriters tended to choose whites for ALLPRO and that they overrated whites more than owners of teams overpaid whites. Then the coefficient of RACE would be negative since payment to whites would be less than thought justified by sportswriters (even though owners tended to slightly overpay white players). Still my a priori is that the coefficient of RACE will be positive.

As can be seen, the coefficients are in the predicted direction, but the coefficient of RACE is insignificant (.357 as a one tail test). Once again the R square is quite high and the equation as a whole is very significant.

Note that before I ran the regression, I decided not to include ALLSTAR. This is because I felt that ALLSTAR and ALLPRO would be highly correlated, creating multicollinearity problems. The regression results, LS ALLPRO C ALLSTAR, suggest that I was right to be concerned.

One also needs to be aware of the potential biases that might arise when variables are only be imperfect proxies. Consider the variable SAL -- 1985-1986 Pro compensation. As the authors note, SAL does not include non-salary compensation such as bonuses. So what we might think of as yearly income may not be the same as the actual variable chosen. Suppose that SALARY underestimates yearly income that is E[u] < 0. then our assumptions justifying the use of least squares is violated and our least squares estimate of the intercept term is biased downwards from the true intercept. Suppose that the non-measured salary is likely to be greater for Whites (which the authors argue is the case, but their argument is not that compelling; there is also little reason to believe that the reverse is true). Then the least squares assumption regarding independence between the error term and the variable, RACE, does not hold and the least squares estimate of the coefficient on RACE (1 for white) is downward biased from the true relationship. Now if bonuses are not correlated with RACE, then the estimated coefficient of RACE is not biased but its variance is larger than otherwise.

This may have two components: more white players and playing them more often than justified. This is the key to discrimination in competitive markets -- segregation. One set of firms discriminate and the other non-discriminatory firms gain by reverse discrimination.

The researcher needs to know economics in order to test for discrimination since salaries are part of labor markets. Also it is virtually impossible to test the apriori hypothesis of no discrimination (since statistical tests are designed to reject, not accept)

As stated in earlier lectures, one purpose of this course is to make you into unrelenting empiricists so that whenever you hear a "factual" statement you ask the following: (1) how in principle the statement could be tested if any data were freely available and (2) how the statement can actually be tested given existing data.

To illustrate from my own personal experience, when I saw the movie, "White Men Can't Jump," I immediately thought of some hypothetical tests. One could ask for a random sample of black and white men (or black and white pro basketball players) to jump and record how high their feet got off the ground or how high their hands reached (controlling for the person's height). But more exciting from the viewpoint of today's lecture, we have data to indirectly test the hypothesis. 4

Consider the following data:

genr REB = OFFREB + DEFREB

genr REBMIN = REB/TLMINS.

Note that REBMIN is only an imperfect measure of jumping ability since getting rebounds also depends on being in the right place at the right time. In econometrics we often have to make use of imperfect proxies. On the other hand, some might say that part of being a good jumper is being at the right place at the right time.

Our a prior expectations are that the coefficient of RACE is negative and the coefficient of HEIGHT is positive. The results are very strong. Both coefficients have the right sign and are highly significant (0.003 and 0.0000, respectively). The R-square is 70%.

In the movie, the white player was not able to do a dunk shot but he was very good at shooting from a distance. Unfortunately, the data collected by Kahn and Sherer does not have statistics on dunk shots. However, other data may provide clues to jumping. Two point goals are shot close to the hoop, while 3 point goals and free throws are shot from farther away and are less likely to involve jumping.

In this formulation the coefficient of HEIGHT should be positive and the coefficient of RACE should be negative. The results are only mildly confirming. The signs are in the correct direction, but the levels of significance are 0.134 and 0.305. The R-square is 0.008.

There is no one correct way of defining variables and setting up equations. I have combined several variables into one dependent variable measure (TLS1). The above equation looks for comparative advantage, not absolute advantage (a black could be twice as good as a white player in two point field goals and three times as good in three pointers, and hence would look comparatively worse using the measure I have invented).

Another possibility is to control for overall basketball ability, perhaps measured by minutes played in a season. One might then use one of the two following equations:

genr TLS2 = TLMINS/SEASONS

ls TLFGM c TLS2 HEIGHT RACE

For a copy of the printout see the full (paper) copy

or,

genr TLS3 = TLFGM/TLFGA

For a copy of the printout see the full (paper) copy

On the other hand, the statement about white men not being able to jump may be a statement about basketball ability in general and measured on an absolute scale. In this way we would not want to control for ability in general since the statement would imply that blacks had a higher ability in general. Total points per minute might be regressed against RACE and height.

genr POINTS = 2*(TLFGM - TRIPTM) + 3* TRIPTM + 2* TLFTM

genr POINTPM = POINTS/TLMINS

ls POINTPM c RACE HEIGHT

For a copy of the printout see the full (paper) copy

But here we know the answer already since blacks make up 75% of the National Basketball Association players and only 11% of the population, blacks are on average better players than whites.

Which of these equations is best? Obviously, it depends on the question you are trying to ask. But one can also judge the question. The last equation is boring because we know the general answer already. Equation 1 answers the initial question most directly, but it is in the same spirit as equation 5. It is a judgment call, but my feeling is that equation 2 (where the dependent variable is TLS1) is best. It asks whether blacks play a different type of game than whites, not whether they are better. I think that this is a more interesting question with a more interesting answer since the answer is not so obvious. Equations 3 and 4 ask similar questions to 2, but not with such a direct and clear measure.

This data set also contains information about college performance. For example, CFGM stands for field goals made in college. One could predict draft number based on college performance (The better the college performance, the lower the draft number). Unfortunately, colleges play in different quality leagues so the numbers are not that meaningful (I do very well against my 8 year old). So if possible, one would want to have a proxy for quality of competition (FFOUR is a possibility).

Even with the rudimentary skills taught in this course, I believe that students are capable of producing publishable research (in secondary journals) if they ask the right questions. I know virtually nothing about statistical studies of sports, but I suspect that the following question has not been answered previously with econometric tools and if cleverly done, might be publishable: What is the relation between draft choice and eventual performance? A rudimentary stab at this question might look at the following equation:

genr POINTPS = POINTS/SEASONS

genr REBOUNDS = (OFFREB + DEFREB)/SEASONS

genr ASSISTPS = ASSISTS/SEASONS

ls DRAFTNO c POINTPS REBOUNDS ASSISTPS

For a copy of the printout see the full (paper) copy

A more sophisticated study and a better data set would account for the fact that some draft choices are no longer playing (a real bad choice if they were drafted recently). Alternatively, one might confine the study to the first 2 or 3 years after the draft. One should always be aware of missing data and how it might alter the observed empirical results.

Now that there are free agents, draft choice is not as important in the past. One could test whether there is declining care in choice by seeing whether R squared has declined over time.

I do not want to spend a great deal of time on this issue. I just wanted to suggest that there are lots of questions that can be answered with the data sets provided in this course.

--total college player of the year awards plus times named to first or second All-America Team

CFGA

I4 (F4.0)

total college field goals attempted

CFGM

I4 (F4.0)

total college field goals made

CFTA

I3 (F3.0)

total college free throws attempted

CFTM

I3 (F3.0)

total college free throws made

CGAMES

I3 (F3.0)

total college games

CHAMP

I2 (F2.0)

number of pro championship teams played on

CMINS

I4 (F4.0)

total college minutes

CONF

I2 (F2.0)

field not used

CREB

I4 (F4.0)

total college rebounds

CSEA

I1 (F1.0)

total college seasons

CTRPA

I3 (F3.0)

total college three point goals attempted

CTRPM

I2 (F2.0)

total college three goals made

DEFREB

I5 (F5.0)

total pro defensive rebounds

DISQUAL

I2 (F2.0)

number of times disqualified

DRAFTNO

I3 (F3.0)

college draft number

EARLY

I1 (F1.0)

dummy variable for leaving college early

FFOUR

I1 (F1.0)

number of trips to final four (college)

GPLAY

I3 (F3.0)

number pro playoff games played

HEIGHTI

I2 (F2.0)

inches to be added onto

HEIGHTF

I1 (F1.0)

height in feet, e.g. 6 or 7

NOTCOL

I1 (F1.0)

dummy variable for not attending college

OFFREB

I4 (F4.0)

total pro offensive rebounds

PFOULS

I4 (F4.0)

total pro fouls committed

PLAYID

I3 (F3.0)

player ID number

POSITION

I1 (F1.0)

position (1 or 5= center; 2,4 or 7= forward; 3 or 6= guard)

PRODEF

I2 (F2.0)

number of times 1st or 2nd all-defensive team

RACE

I1 (F1.0)

race, 1= white, 0= black

SAL

I7 (F7.0)

1985-6 pro compensation

SEASONS

I2 (F2.0)

total pro seasons

STEALS

I4 (F4.0)

total pro steals

TEAM

I2 (F2.0)

NBA team (in alphabetical order: e.g. 1= Atlanta, 2= Boston, etc.)

TEAMCH

I2 (F2.0)

number of pro team changes

TLFGA

I5 (F5.0)

total pro field goals attempted

TLFGM

I5 (F5.0)

total pro field goals made

TLFTA

I5 (F5.0)

total pro free throws attempted

TLFTM

I5 (F5.0)

total pro free throws made

TLGAMES

I4 (F4.0)

total pro (NBA or ABA) games played

TLMINS

I5 (F5.0)

total pro minutes played

TRIPTA

I3 (F3.0)

total pro three point goals attempted

TRIPTM

I3 (F3.0)

total pro three point goals made

WEIGHT

I3 (F3.0)

weight in pounds

YPLAY

I2 (F2.0)

number of years in the pro playoffs

The following variables refer to the player's 1985-86 team

ARENA

I5 (F5.0)

arena capacity

COL83

F7.4

1983 SMSA cost of living index

HOMEAT

I6 (F6.0)

previous season's home attendance

INCOME

I5 (F5.0)

1983 SMSA per capita income in dollars

MAX

I4 (F4.2)

maximum ticket price in dollars

MIN

I4 (F4.2)

minimum ticket price in dollars

POPCIT

I5 (F5.1)

1980 city population (divided by 10000)

POPCMA

I5 (F5.1)

1980 Consolidated Metropolitan Area population (divided by 10000)

POPMSA

I5 (F5.1)

1980 Standard Metropolitan Statistical Area population (divided by 10000)

RACECIT

I3 (F3.1)

percent of 1980 population in the city that was black

RACECMA

I3 (F3.1)

percent of 1980 population in the Consolidated Metropolitan Area that was black

RACEMSA

I3 (F3.1)

percent of 1980 population in the Standard Metropolitan Statistical Area that was black

TEAMSAL

I8 (F8.0)

total team salary

TOTAT

I7 (F7.0)

previous season's total attendance (home plus away)

WINPCT

I3 (F3.3)

previous season's winning percentage

Notes:

(1) If a player only played a few minutes in a season, then our confidence in his output per minute variables would be reduced. In such a situation, weighted least squares should be used.(back to text)

(2) The R-squares of the two equations cannot be directly compared since one is measuring percent explanation of the variation in SAL and the other percent explanation of the variation in LOG(SAL).(back to text)