Thursday, June 26, 2014

In a column last week, Greg Mankiw said "According to a recent study, if your income is at the 98th percentile of the income distribution — that is, you earn more than 98 percent of the population — the best guess is that your children, when they are adults, will be in the 65th percentile." Of course, you wouldn't expect those children to do as well as their parents--there's not much room to rise and lot of room to fall--but that was a bigger decline than I would have expected, so I decided to take a closer look. The study refers to people born in 1980-82 and "when they are adults" is age 30. Of course, 30-year-olds generally earn less than middle-aged people, but the authors of the study say that relative positions have pretty much stabilized by then--that is, we'd see about the same pattern if we came back 20 years later. Here is the pattern for people whose parents were in the 60th percentile.

The large number of people in the second percentile (actually about 6% of all 30-year olds) had zero income. That makes it hard to read, so here is the figure showing just the lower part of the y-axis.

The most common destination is in the low 70s. The chances of rising above that level drop off pretty sharply. But overall, the differences are pretty small: you could say that people from the 60th percentile are about equally likely to end up at any point in the distribution.Here is the 80th percentile. It's a similar basic pattern, although the chances of winding up near the top are higher and the chances of ending near the bottom are lower.

Here is people whose parents were in the 98th percentile. This looks different--the higher the ranking, the better your chances of getting there. The most likely destination is the 99th percentile--even higher than the "percentile" of zero earnings.

I haven't calculated the mean percentile--I'll look at this more later--but this seems to give a different picture than Mankiw's summary.

Friday, June 20, 2014

I sent a letter to the editors of the Proceedings of the National Academy of Sciences summarizing the point I made in my post of June 12. They declined to publish it (I hope that's because they had already accepted one or more letters making the same point), but I have posted it on my UConn web page.

It occurred to me that the hurricane study omitted the two hurricanes that caused the largest number of deaths (Audrey in 1957 and Katrina in 2005) because the models couldn't fit them--basically, they had too many deaths to be plausible under the distributions that they used. But both hurricanes had female names, so they should be counted as some kind of favorable evidence for their hypothesis. What sort of models could accommodate all of the hurricanes? There are two reasonable approaches:

1. Sophisticated: The Cox proportional hazards model is widely used for duration data--time until some event. Count data is like duration data in the sense that there is only one possible direction of change--just as a person who's turned 90 can't go back and die at 89, a hurricane that's killed 90 can't go back and wind up killing only 89. So the Cox model can reasonably be applied to count data, although I don't think I've ever seen that done. The model is useful because it makes minimal assumptions about the distribution--it essentially just tries to predict which cases will rank higher than others. If you add Katrina and Audrey to the data set (I scored them both as highly feminine names), the estimated effect of feminine name is .031 with a standard error of .032, which is nowhere near statistical significance.

2. Simple: take the logarithm of (deaths+0.25). You need to add the small constant because many hurricanes caused zero deaths and the logarithm of zero is undefined. The exact value doesn't matter much. Then do an ordinary linear regression with the log of (deaths+.25) as dependent variable and log of damage and hurricane name as the predictors. The estimated effect of feminine name is .024 with a standard error of .043, again nowhere near significance. The residuals from the model are approximately normally distributed, meaning that the estimate and se are trustworthy.

The interaction between name and damage is not statistically significant either way.

I thought it might be useful to show why the issue matters so much in this case. Suppose you fit a model with just one independent variable, the logarithm of damage. This is a reasonable thing to do because neither masculinity/femininity of name nor minimum pressure are statistically significant if you include them. The equation is:
predicted log(deaths)= -1.909+.580*log(damage)

What is being predicted in a negative binomial regression is not the number of deaths, but the logarithm of the number of deaths. So to translate it into the number of deaths:
deaths=.148*(damage**0.58) [e**-1.909=.148]
Fortunately, you don't even have to do the algebra--you just need to know that exp(log(x))=x, and give a command like pred=exp(-1.909+.580*ldam) and your statistics program will calculate the predicted number. Then you can see how predicted values are related to damage.

The general numbers are reasonable. The highest predicted number of deaths is about 100; the highest actual number in the data set was about 200, but of course some hurricanes are going to cause more deaths than predicted.

Calculating predicted values for the model with normalized damage is more complicated, because minimum pressure has a statistically significant effect. But if you set normalized damage and masculinity-femininity of names at their means, you get
predicted log(deaths)=1.95+.0000809*damage
or
deaths=7.02 e**(.0000809*damage)

The resulting figure:

This is hard to read, since the predicted values for the few largest hurricanes are so much bigger than the predicted values for all the others, and also a lot bigger than the actual number of deaths. But you can see that the predicted values for the biggest hurricanes are much larger than the actual values. If you use a log scale for both axes, you get:

Here you can see another odd thing. The predicted values barely increase as you go from the hurricanes that did the least damage to the medium ones. As a result, the predicted deaths for the hurricanes that did the least damage are almost all too high--in fact, the 19 hurricanes that did the least property damage all caused fewer deaths than they were "supposed" to. In contrast, most medium hurricanes caused more deaths than predicted, and the few largest hurricanes caused fewer deaths than predicted.

Damage is such an important predictor of deaths that it's not enough to sort of control for it--you need to control for it correctly. If you don't do that, nothing you do from that point on will give you the right results.

Thursday, June 12, 2014

I was looking at Andrew Gelman's blog yesterday and saw a post on a study saying that hurricanes with female names caused more deaths than hurricanes with male names. The study came out a couple of weeks ago; I seem to recall hearing some news reports and assuming there was probably something wrong with it, but I didn't give it any more thought. This morning I looked at the New York Times and saw that Nicholas Kristof gave about half of his column to uncritically recounting the claims of the study. Then I looked a little more and saw that there had been a lot of news coverage, and a lot of critical commentary. But the criticisms seemed sort of peripheral, or raised questions without really identifying a specific flaw. So I read the original paper, downloaded the data, and did some analysis. Their claim does not stand up, and here is my attempt to explain why not. It's possible that someone beat me to it (in fact, I hope someone did, since the problem was so basic), but given the nature of the internet the more places it appears the better.

A statistical model uses variables to predict the value of another variable (total deaths resulting from the hurricane). The "deviance" is a measure of how much of total deaths is not predicted. So the goal is to get a small deviance using a small number of predictors.

Here are the deviance and number of predictors in two of their models:

The models using the logarithm rather than the original variable had much lower deviance. Adding the two interactions to the model with the logarithm reduced the deviance by 2.2, but the usual standard is that adding two predictors has to reduce the deviance by at least 6 to qualify as evidence that there's anything there (ie a reduction of less than 6 is not "statistically significant"). So the best model has a deviance of 97.5 and three predictors. In that model, the estimated effect of the "femaleness" of the name (which they treat as a matter of degree) is .024, with a standard error of .036, which is not statistically significant, or close to statistically significant.

So the flaw was that they controlled for the dollar value of damage when they should have controlled for the logarithm of damage. With the right control, there is no evidence that the gender of the name makes any difference.

Notes: 1. the paper and data, published in the Proceedings of the National Academy of Sciences

Wednesday, June 11, 2014

After his upset victory over Eric Cantor in the Republican primary, David Brat (currently a professor of economics) was asked about whether he supported raising the minimum wage. He replied "I don't have a well-crafted response on that." Justin Wolfers says that "admitting to being uncertain on an issue . . . [is] a mark of intellectual honesty": "assessing the evidence on the effects of the minimum wage is a tricky business" and economists are divided on the issue.

But Brat didn't say he was a typical economist--he said he was "a free-market guy," and "anything that distorts the free market, I'm against." He also didn't say that he was uncertain, but that he "didn't have a well-crafted response."

His dilemma is that if you start with the position that a free market is best, there's a strong case against having any minimum wage; however, there's also strong public support for the minimum wage. Many surveys going back to the 1940s have asked whether the minimum wage should be increased, and almost all have found solid majorities in favor. I wondered if any had asked if there should be a minimum wage at all. I found only one, back in 1948. It asked "Do you think we should have national laws... providing that employers all over the country have to pay a certain minimum wage to employees... or do you think we should have such laws but they should be handled by the states, or do you think we shouldn't have any laws of this kind at all?" 46% said national, 31% said state, and only 13% said none (11% didn't know). Then I looked for surveys that gave people the option of saying that the minimum wage should be reduced. There was one from 1987 that asked if the minimum wage was too high, too low, or about right. About two percent said it was too high (there were two different forms of the question--in one, 1% thought it was too high, in the other 3% did).

Since 1995, the following question has been asked pretty frequently "Do you think the federal government [has become so large and powerful that it] poses an immediate threat to the rights and freedoms of ordinary citizens, or not?" Sometimes the words in brackets were included. A graph of the percent agreeing:

The red line is questions including the "large and powerful" language. Agreement is consistently higher when those words are added, but either way there seems to have been a decline from the mid-1990s to about 2000, and then a rise. Unfortunately the few questions from the 1960s and 1970s are worded differently, so it's hard to come to any definite conclusions about long-term trends.

Wednesday, June 4, 2014

As discussed in my last post, the Southern states have moved towards the Republicans over the last century. But there has been a lot of movement even when you set that aside. After taking out the shifts associated with the South, I followed the same strategy of estimating an election score and a state score. The product of those two predicts Democratic share of the two-party vote (relative to the national average), for each state in each election. The five highest state scores are Utah, Idaho, Wyoming, Arizona and Nebraska. The five lowest (ie most negative) are Vermont, Alabama, Massachusetts, Maine, and Rhode Island. The complete rankings appear at the end of this post. These scores can be understood in conjunction with the election scores shown in this figure:

Basically, positive scores mean states that have moved towards the Republicans, negative ones mean states that have moved towards the Democrats. The general identity of the states at the extremes isn't surprising (the negative score for Alabama occurs because it had a pretty substantial Republican vote in the 1920s and 1930s, unlike other deep South states--in effect, it hasn't moved as far towards the Republicans because it didn't have as far to move). What did surprise me is how steady the shift has been. In fact, the method was not designed to produce a trend, just to show variation. Only one election stands out as unusual (1948), and it seems to have been simply an exception rather than the beginning of something. Aside from 1948, there hasn't been much election-to-election variation. That suggests that the roots of the change are "sociological" (ie related to long term social or economic trends) than "political" (ie related to party ideologies or election issues).

Tuesday, June 3, 2014

I have had several posts observing that the identity of "red" and "blue" states has changed a lot over the years. Of course, every state is unique, but I have been trying to reduce the changes to a small number of basic patterns. An obvious one is that the South has moved towards the Republicans. However, that raises the question of what counts as the South. I estimated a model in which the Democratic share of the vote in each state relative to the Democratic share nationally for the presidential elections between 1916 and 2012 was the product of an election score (how Democratic the South was) and a state score (how "Southern" a state is). The eleven Confederate states and six border states (Oklahoma, Missouri, Kentucky, W. Virginia, Maryland, and Delaware) were allowed to have their own individual scores--all other states were put in one group.

Here is a histogram of the state scores, which are pretty much as you would expect.

Here is a plot of the election scores by year:

Of course, the basic trend is that the South has moved towards the Republicans over the century. The more interesting thing is that the movement was pretty much complete by 1972. The South moved back a bit towards the Democrats in 1976 and 1980 (when Jimmy Carter was running), but apart from that there's been little variation since 1972. I suspect that Southern whites have moved a bit more towards the Republicans during that time, but that's been offset by increased turnout among blacks.

About Me

I am a professor of sociology at the University of Connecticut, and editor of the journal Comparative Sociology. My book, Hypothesis Testing and Model Selection in the Social Sciences, was published by The Guilford Press in April 2016.