Monday, April 21, 2014

If you've seen the new Captain America movie, you might notice that statistics (and data mining more generally) are featured prominently in the film. I can't imagine a more remarkable shift in the perception of statistics, which has historically been claimed to be "dull" or "boring" (a view that is at odds -- pun intended -- with that of any practicing statistician, past or present). In fact, in a 1998 talk the statistician C.F. Jeff Wu even argued that "statistics" should be replaced with the phrase "data science" in part to remove the negative connotations with data analysis and statistical theory!

Yet now more and more people are realizing that statistic is "hot," as exemplified in the following clip, in
which Scarlett Johansson is suggesting that the superhero Captain America go on a date with
-- yes! -- a statistician:

And of course if a movie trailer isn't convincing enough to you of how the public perception of statistics has been shifting, I refer you to the Chief Economist for Google, Hal Varian, who has been saying (correctly) for years that statistics is the "sexy" dream job of the 2010s:

# referred to as "my.py"), then you can run it in R as follows:
system("python C:\\Users\\Name\\Desktop\\my.py")# or alternatively:
system('python -c "import sys; sys.path.append(\'C:\\Users\\Name\\Desktop\');
import my;"')

Wednesday, April 09, 2014

For several weeks I've been working on examining Tweets using code from R. Here's one approach to analyzing Twitter feeds:

# see how many unique Twitter accounts in the samplelength(unique(df$screenName))

# Create a new column of random numbers in place of the usernames and redraw the plots# find out how many random numbers we needn # generate a vector of random number to replace the names (four digits just for convenience) randuser # match up a random number to a usernamescreenName randuser # Now merge the random numbers with the rest of the Twitter data, and match up the correct # random numbers with multiple instances of the usernames:rand.df # determine the frequency of tweets per accountcounts # create an ordered data frame for further manipulation and plottingcountsSort count = sort(counts, decreasing = TRUE), row.names = NULL)

# create a subset of those who tweeted at least 5 times or morecountsSortSubset 0)

## extract counts of how many tweets from each account were retweeted# (1) clean the twitter messages by removing odd charactersrand.df$text # (2) remove @ symbol from user namestrim # (3) pull out who the message is torand.df$to # (4) extract who has been retweetedrand.df$rt trim(str_match(tweet,"^RT (@[[:alnum:]_]*)")[2]))

Friday, September 13, 2013

The political scientists Gary King and Maya Sen have just posted an excellent working paper clearly outlining the major problems facing higher education: economic, political, sociological. The main thrust is that, although only 30% of the American population obtains a four-year college degree (thus leaving an untapped 70% who could finish college degrees), the higher education system is facing major constraints due to limited budgets and major technological advances. For example, online sites such as Khan Academy are effectively competing with universities, and for-profit universities are growing at a high rate. I'd add to their list the potential for big data analysis to displace the role of experts; I refer to the effect of sabermetrics on baseball journalists or data mining algorithms on marketers as possible canaries in the cage for academics. Regardless, King and Sen's paper is a must-needed beginning of a discussion about the future of higher education in the wake of profound social changes. After all, it was only a mere decade ago that Time and Newsweek were major cultural institutions in American life.

Thursday, May 31, 2012

Zombies are now a common topic of discussion. In fact, the data we have available from Google Trends (for the phrase "zombie attack") strongly suggest an increasing risk of zombification across the world:

However, academic research on zombies is limited (i.e,. non-existent), mainly because of the lack of high quality data. For those interested in studying zombies, I refer readers to Andrew Gelman's paper (co-written, apparently, by the great zombie film director George Romero) on how to measure zombie outbreaks via indirect survey techniques. You can find his article here. Even if you're not interested in zombies, his paper offers some good ideas on how to sample difficult-to-reach populations more generally.

Thursday, May 10, 2012

I highly recommend Anthony Damico's excellent two-minute videos on programming in R. You can find the full list of 90+ videos here. This is the first of the series, which tells you how to download and install R:

More generally, Anthony's video collection is another reminder of the immense sociological benefits that come from sharing educational materials and expert knowledge in the style of the Khan Academy.

Tuesday, May 08, 2012

The Consortium for the Advancement of Undergraduate Statistics Education is hosting a global online conference titled "eCOTS: Electronic Conference on Teaching Statistics." You can view the full program here. It only costs $15 to register and participate in the online conference. For at least the past five years I've thought that conferences are obsolete in many respects, so I'm delighted to see this conference developed. By not having a physical place, with food, beverages, and equipment, not to mention lodging and transportation costs, the costs of attendance are much lower, thus enabling more and more people to learn and contribute to knowledge production. (Of course, we'll still want some conferences for face-to-face socialization!)

Sunday, May 06, 2012

It's been over four years that I've been using both R and Stata, but as of last week I've become an R convert. For several years I had conducted statistical analyses in R (since many complex models can only be programmed in R), but I used Stata before and after the analyses. In essence I'd merge and clean data sets in Stata, call R from Stata for the statistical analyses, export R objects into Stata, and then use Stata's graphics utilities to display the results. This setup quickly unraveled last month when I began merging and recoding data in R, which is much aided by John Fox's fantastic "car" package.

The problem is that if you want to do Bayesian analysis or graph modeled coefficients (or work with complex data structures more generally), then R is much easier than Stata due to the object-oriented programming environment. It's unbelievably liberating to be able to save vectors, matrices, data frames, and so on from multiple data sources and manipulations in the same conceptual space. Additionally, R has fantastic graphics capabilities (3-D plots, rotating hyperplanes, social network graphs, and so on), offers excellent tools for analyzing and displaying so-called big data (for example, check out the "tabplot" command from Google), and is (frankly) a fun, intuitive programming language. If you need additional reasons to be an R convert, keep in mind that R is completely free, open-source, and extensible, with over 5,300 statistical packages (as of April 2012).

About Me

I'm a Ph.D. candidate in sociology at Harvard University and a Doctoral Fellow in Inequality and Social Policy at the Harvard Kennedy School. My research focuses on the quantitative analysis of culture. I have written about culture and politics, network analysis, and poverty. You can reach me at efosse@fas.harvard.edu.