Bad Hessianhttp://badhessian.org
Computational social science blogMon, 20 Oct 2014 18:17:20 +0000en-UShourly1http://wordpress.org/?v=4.1.1Report back on the ASA Datathonhttp://badhessian.org/2014/10/report-back-on-the-asa-datathon/
http://badhessian.org/2014/10/report-back-on-the-asa-datathon/#commentsMon, 20 Oct 2014 18:17:20 +0000http://badhessian.org/?p=1472Continue reading →]]>[Note: I do realize that this event was nearly two months ago. I have no one to blame but the academic job market.]

On August 15 and 16, we held the first annual ASA Datathon at the D-Lab at Berkeley. Nearly 25 people came from academia, industry, and government participated during the 24-hour hack session. The datathon focused on open city data and methods, and questions surrounded issues such as gentrification, transit, and urban change.

Two of our sponsors kicked off the event by giving some useful presentations on open city data and visualization tools. Mike Rosengarten from OpenGov presented on OpenGov’s incredibly detailed and descriptive tools for exploring municipal revenues and budgets. And Matt Sundquist from plot.ly showed off the platform’s interactive interface which works across multiple programming environments.

Fueled by various elements of caffeine and great food, six teams hacked away through the night and presented their work on the 16th at the Hilton San Francisco. Our excellent panel of judges picked the three top presentations which stood out the most:

Honorable mention: Spurious Correlations

The Spurious Correlations team developed a statistical definition for gentrification and attempted to define which zip codes had been gentrified by their definition. Curious about those doing the gentrifying, they asked if artists acted as “middle gentrifiers.” While this seemed to correlate in Minneapolis, it didn’t hold for San Francisco.

Second place: Team Vélo

Team Vélo, as the name implies, was interested in bike thefts in San Francisco and crime in general. They used SFPD data to rate crime risk in each neighborhood and tried to understand which factors may be influencing crime rates, including racial diversity, income, and self-employment.

First place: Best Buddies Bus Brigade

Lastly, our first place winners asked “Does SF public transportation underserve those in low-income communities or without cars?” Using San Francisco transit data, they developed a visualization tool to investigate bus load and how this changes by location, conditional on things like car ownership.

You can check out all the presentations at the datathon’s GitHub page.

]]>http://badhessian.org/2014/10/report-back-on-the-asa-datathon/feed/0A Brief Introduction to Plotlyhttp://badhessian.org/2014/08/a-brief-introduction-to-plotly/
http://badhessian.org/2014/08/a-brief-introduction-to-plotly/#commentsThu, 28 Aug 2014 13:51:29 +0000http://badhessian.org/?p=1436Continue reading →]]>This is a guest post by Matt Sundquist. Matt studied philosophy at Harvard and is a Co-founder at Plotly. He previously worked for Facebook’s Privacy Team, has been a Fulbright Scholar in Argentina and a Student Fellow of the Harvard Law School Program on the Legal Profession, and wrote about the Supreme Court for SCOTUSblog.com.

Emailing code, data, graphs, files, and folders around is painful (see below). Discussing all these different objects and translating between languages, versions, and file types makes it worse. We’re working on a project called Plotly aimed at solving this problem. The goal is to be a platform for delightful, web-based, language-agnostic plotting and collaboration. In this post, we’ll show how it works for ggplot2 and R.

A first Plotly ggplot2 plot

Let’s make a plot from the ggplot2 cheatsheet. You can copy and paste this code or sign-up for Plotly and get your own key. It’s free, you own your data, and you control your privacy (the set up is quite like GitHub).

By adding the final line of code, I get the same plot drawn in the browser. It’s here: https://plot.ly/~MattSundquist/1899, and also shown in an iframe below. If you re-make this plot, you’ll see that we’ve styled it in Plotly’s GUI. Beyond editing, sharing, and exporting, we can also add a fit. The plot is interactive and drawn with D3.js, a popular JavaScript visualization library. You can zoom by clicking and dragging, pan, and see text on the hover by mousing over the plot.

Here is how we added a fit and can edit the figure:

Your Rosetta Stone for translating figures

When you share a plot or add collaborators, you’re sharing an object that contains your data, plot, comments, revisions, and the code to re-make the plot from a few languages. The plot is also added to your profile. I like Wired writer Rhett Allain’s profile: https://plot.ly/~RhettAllain.

You can export the figure from the GUI, via an API call, or with a URL. You can also access and share the script to make the exact same plot in different languages, and embed the plot in an iframe, Notebook (see this plot in an IPython Notebook), or webpage like we’ve done for the above plot.

https://plot.ly/~MattSundquist/1899.svg

https://plot.ly/~MattSundquist/1899.png

https://plot.ly/~MattSundquist/1899.pdf

https://plot.ly/~MattSundquist/1899.py

https://plot.ly/~MattSundquist/1899.r

https://plot.ly/~MattSundquist/1899.m

https://plot.ly/~MattSundquist/1899.jl

https://plot.ly/~MattSundquist/1899.json

https://plot.ly/~MattSundquist/1899.embed

To add or edit data in the figure, we can upload or copy and paste data in the GUI, or append data using R.

That routine is possible from other languages and any plots. You can share figures and data between a GUI, Python, R, MATLAB, Julia, Excel, Dropbox, Google Drive, and SAS files.

Three Final thoughts

Why did we build wrappers? Well, we originally set out to build our own syntax. You can use our syntax, which gives you access to the entirety of Plotly’s graphing library. However, we quickly heard from folks that it would be more convenient to be able to translate their figures to the web from libraries they were already using.

Thus, Plotly has APIs for R, Julia, Python, MATLAB, and Node.js; supports LaTeX; and has figure converters for sharing plots from ggplot2, matplotlib, and Igor Pro. You can also translate figures from Seaborn, prettyplotlib, and ggplot for Python, as shown in this IPython Notebook. Then if you’d like to you can use our native syntax or the GUI to edit or make 3D graphs and streaming graphs.

We’ve tried to keep the graphing library flexible. So while Plotly doesn’t natively support network visualizations (see what we support below), you can make them with MATLAB and Julia, as Benjamin Lind recently demonstrated on this blog. The same is true with maps. If you hit a wall, have feedback, or have questions, let us know. We’re at feedback at plot dot ly and @plotlygraphs.

]]>http://badhessian.org/2014/08/a-brief-introduction-to-plotly/feed/1Where to party with Bad Hessians at #asa14http://badhessian.org/2014/08/where-to-party-with-bad-hessians-at-asa14/
http://badhessian.org/2014/08/where-to-party-with-bad-hessians-at-asa14/#commentsWed, 13 Aug 2014 14:16:51 +0000http://badhessian.org/?p=1415Continue reading →]]>The past two years we’ve had our own Bad Hessian shindig, to much win and excitement. This year we’re going to leech off other events and call them our own.

The first will be the after party to the ASA Datathon. We don’t actually have a place for this yet, but judging will take place on Saturday, August 16, 6:30-8:30 PM in the Hilton Union Square, Fourth Floor, Rooms 3-4. So block out 8:30-onwards for Bad Hessian party times.

For those of you with WordPress blogs and have the Jetpack Stats module installed, you’re intimately familiar with this chart. There’s nothing particularly special about this chart, other than you usually don’t see bar charts with the bars shown superimposed.

I wanted to see what it would take to replicate this chart in R, Python and Julia. Here’s what I found. (download the data).

R: ggplot2

Although I prefer to use other languages these days for my analytics work, there’s a certain inertia/nostalgia that when I think of making charts, I think of using ggplot2 and R. Creating the above chart is pretty straightforward, though I didn’t quite replicate the chart, as I couldn’t figure out how to make my custom legend not do the diagonal bar thing.

The R Cookbook talks about a hack to remove the diagonal lines from legends, so I don’t feel too bad about not getting it. I also couldn’t figure out how to force ggplot2 to give me the horizontal line at 10000. If anyone in the R community knows how to fix these, let me know!

(Pythonistas: I’m aware of the ggplot port by Yhat; functionality I used in my R code is still in TODO, so I didn’t pursue plotting with ggplot in Python)

R: Base Graphics

Of course, not everyone finds ggplot2 to be easy to understand, as it requires a different way of thinking about coding than most ‘base’ R functions. To that end, there are the base graphics built into R, which produced this plot: While I was able to nearly replicate the WordPress chart (except for the feature of having the dark bars slightly smaller width than the lighter), the base R syntax is horrid. The abbreviations for plotting arguments are indefensible, the center and width keywords seem to shift the range of the x-axis instead of changing the actual bar width, and in general, the experience plotting using base R was the worst of the six libraries I evaluated.

Python: matplotlib

In the past year or so, there’s been quite a lot of activity towards improving the graphics capabilities in Python. Historically, there’s been a lot of teeth-gnashing about matplotlib being too low-level and hard to work with, but with enough effort, the results are quite pleasant. Unlike with ggplot2 and base R, I was able to replicate all the features of the WordPress plot:

Python: Seaborn

One of the aforementioned improvements to matplotlib is Seaborn, which promises to be a higher-level means of plotting data than matplotlib, as well as adding new plotting functionality common in statistics and research. Re-creating this plot using Seaborn is a waste of the additional functionality of Seaborn, and as such, I found it more difficult to make this plot using Seaborn than I did with matplotlib.

To replicate the plot, I ended up hacking a solution together using both Seaborn functionality and matplotlib in order to be able to set bar width and to create the legend, which defeats the purpose of using Seaborn in the first place.

Julia: Gadfly

In the Julia community, Gadfly is clearly the standard for plotting graphics. Supporting d3.js, PNG, PS, and PDF, Gadfly is built to work with many popular back-end environments. I was able to replicate everything about the WordPress graph except for the legend:While Gadfly took a line or two more than base R in terms of fewest lines of code, I find the Gadfly syntax significantly more pleasant to work with.

Julia: Plot.ly

Plot.ly is an interesting ‘competitor’ in this challenge, as it’s not a language-specific package per-se. Rather, Plot.ly is a means of specifying plots using JSON, with lightweight Julia/Python/MATLAB/R wrappers. I was able to replicate nearly everything about the WordPress plot, with the exception of not having a line at 10000, having the legend vertical instead of horizontal and I couldn’t figure out how to set the bar widths separately.

And The Winner Is…matplotlib?!

If you told me at the beginning of this exercise that matplotlib (and by extension, Seaborn) would be the only library that I would be able to replicate all the features of the WordPress graph, I wouldn’t have believed it. And yet, here we are. ggplot2 was certainly very close, and I’m certain that someone knows how to fix the diagonal line issue. I suspect I could submit an issue ticket to Gadfly.jl to get the feature added to create custom legends (and for that matter, make the request of Plot.ly for horizontal legends), so in the future there could be feature parity using these two libraries as well.

I hope we all agree there’s no hope for Base Graphics in R besides quick throwaway plots.

In the end, the best thing I can say from this exercise is that the analytics community is fortunate to have so many talented people working to provide these amazing visualization libraries. This graph was rather pedestrian in nature, so I didn’t even scratch the surface of what these various libraries can do. Even beyond the six libraries I chose, there are others I didn’t choose, including: prettyplotlib (Python), Bokeh (Python), Vincent (Python), rCharts (R), ggvis (R), Winston (Julia), ASCII Plots (Julia) and probably even more that I’m not even aware of! All free and open-source and miles apart from terrible looking Microsoft graphics in Excel and Powerpoint.

]]>http://badhessian.org/2014/07/six-of-one-plot-half-dozen-of-the-other/feed/14Testing the Springsteen Conjecture: Exploring the “Post-authentic musical world” with big, messy internet datahttp://badhessian.org/2014/07/testing-the-springsteen-conjecture-exploring-the-post-authentic-musical-world-with-big-messy-internet-data/
http://badhessian.org/2014/07/testing-the-springsteen-conjecture-exploring-the-post-authentic-musical-world-with-big-messy-internet-data/#commentsTue, 01 Jul 2014 05:07:22 +0000http://badhessian.org/?p=1298Continue reading →]]>This is a guest post by Monica Lee and Dan Silver. Monica is a Doctoral Candidate in Sociology and Harper Dissertation Fellow at the University of Chicago. Dan is an Assistant Professor of Sociology at the University of Toronto. He received his PhD from the Committee on Social Thought at the University of Chicago.

For the past few months, we’ve been doing some research on musical genres and musical unconventionality. We’re presenting it at a conference soon and hope to get some initial feedback on the work.

This project is inspired by the Boss, rock legend Bruce Springsteen. During his keynote speech at the 2012 South-by-Southwest Music Festival in Austin, TX, Springsteen reflected on the potentially changing role of genre classifications for musicians. In Springsteen’s youth, “there wasn’t much music to play. When I picked up the guitar, there was only ten years of Rock history to draw on.” Now, “no one really hardly agrees on anything in pop anymore.” That American popular music lacks a center is evident in a massive proliferation in genre classifications:

While precisely delineating differences between styles like “swamp pop” and “melodic death metal” might suggest a growing concern among musicians and fans to align themselves with specific genre categories, Springsteen suggests a different possibility: that the increasing number of genres, sub-genres, and sub-sub-genres frees musicians from worrying about fitting into any given set of genre expectations:

“We live in a post–authentic world. And today authenticity is a house of mirrors. It’s all just what you’re bringing when the lights go down. It’s your teachers, your influences, your personal history; and at the end of the day, it’s the power and purpose of your music that still matters.”

With so many classifications and no common center, musicians need not play by the rules laid down by genre categories. What matters instead is “the genesis and power of creativity, the power of the songwriter, or let’s say, composer, or just creator.” A creative musician can join rap with steel guitar, electronic beats with acoustic guitar, without thereby becoming an inauthentic rap, country, electronica, or folk musician. One is instead a good rap-country-electronica-folk musician, or, as Springsteen would probably have it, just a good musician.

This purported shift in the norms surrounding musical creation drew attention from both the music industry and cultural sociologists alike. A couple years ago Springsteen’s provocative statement got a comment on orgtheory.

Do we truly live in a “post-authentic” musical world, where norms and conventions around musical creation have weakened and individual creativity is born to run free (pun!)? Our work starts with this question, and then goes on to pursue a couple further ones about how musical unconventionality relates to band popularity and how unconventionality may be geographically concentrated.

RESEARCH QUESTIONS: 1) To what extent are genres organized into discrete scenes or are relatively unbounded at the national level and in certain metropolitan regions? 2) To what extent are “musically unconventional” bands more or less popular than more conventional bands? 3) To what extent are bands in certain metropolitan areas more “musically unconventional” than bands in others?

DATA: We examine data from ~3.2 million bands’ MySpace.com pages from 2007 that was scraped back then by the University of Chicago Cultural Policy Center. Good overviews of the data can be found here, here, here, here, and here. Kevin Stolarick has also worked with us on these data, in particular in constructing the popularity index and in matching MySpace pages to metro areas. Sure, MySpace is no longer fashionable, but you might recall that in 2007 it was booming, especially for musicians, and nearly every band/musician you could name—whether hugely commercially popular or obscure and local—had a MySpace page. The sample is restricted to only bands in the U.S. and we must also acknowledge that in all likelihood (but we don’t have data on it) the data skews toward younger musicians, with older and less internet-savvy musicians left out. Internet data is huge, but incomplete and messy. So when we clean the data and eliminate cases with missing data, we end up with 1,337,454 bands. Still a more than decent N, of course, but it’s sad that the initially bigger N is so fleeting. And although our thoughts are still inconclusive about whether an N bigger than statistically necessary offers benefits that outweigh the (computational and data-manipulating) costs of performing an analysis, we will stick with the ~1.3 million and not sample further from it, since we are (like many of you) stricken with N-Envy.

We begin from the information available on publically accessible MySpace pages: genre, page views, page plays, fans, and geographical location. Our analysis involves transforming this information into more theoretically interesting variables. Over the course of the post, you’ll see that we use genre identifications to create a musical unconventionality measure, we use fans, views, and plays to create a composite band popularity measure, and we match band-provided location information to Metropolitan Statistical Areas as drawn by the U.S. Census (and can we add: matching that was a blast).

We grant that these days, there has indeed been a proliferation of words to describe musical genres. Indeed, MySpace users could choose up to three genres from 125 different options, meaning that they had 333,375 different ways to describe their distinctive style. This sort of freedom to represent oneself in so many different ways is a major inspiration behind The Boss’s speech. But how many of these possibilities do bands actually use? Does the expansion of genre classifications truly represent a free mixing of musical styles? Our results suggest that there still are significant boundaries among musical scenes. Mixing musical styles seems restricted instead of free.

We came to this conclusion by examining the relationship between genres as a network, and then examining modularity in that network. This way, we can find out whether genres clump up—are consistently more likely to be paired with some than with others.

H1: We will find statistically significant modularity in the genre network; genres are consistently paired with a small number of others.

H0: There will be little modularity in the network; genres are equally paired with all other genres.

If we find significant modularity, it means that perhaps there are more words than ever available to describe musical styles, but that doesn’t necessarily mean that genre boundaries have completely diluted. Genres are not isolated; they are anchored in higher order musical communities (“scenes”)—islands of musical inbreeding—each of which is still quite distinct from the next. That musicians evidently respect the boundaries of such communities suggests that collective musical norms continue to strongly operate. That is, genre boundaries have perhaps shifted or come to encompass more sub-categories, but they are still relevant and strong.

We make the genre network by mapping the band-provided (self-identified) co-listings of genres. That is, bands identify themselves with up to three genres, and genres are considered “related” once when a band lists them together.

We use Greedy Modularity Maximization to locate genre clusters, and we test the statistical significance of those clusters with a Wilcoxon Rank-Sum test, comparing the number of in-edges for bands in each cluster vs. their out-edges. We run the modularity maximization progressively until the results of going to a smaller, more specific “genre island” is no longer statistically significant. The graph below displays the 17 statistically significant clusters that result (all at p<.001). Click around. An even more fun full page version is available here. The second view scales edge widths to edge weights; a larger version is here. And a table listing the genre clusters is below both graphs:

As modularity in a complex network is difficult to display in a simple visual, this is a modified version of the network created by plotting each of the clusters and its three strongest out-edges.

We can see that the genre clusters pass the eye test reasonably well. Genres that we would imagine going together generally do, and we might also be surprised by how subdivided certain higher-order genres seem to be (e.g. electronic music). We can also see how some scenes overlap with one another more than others. For instance, the hardest core metal genres (“dark”) are relatively isolated, though they link through their neighbouring cluster (“sad and angry”) to the “popular” cluster (via the metal-rock connection) and to “black and brown” (via the hardcore-rap-hip hop connection). Moreover, it seems that a few genres—Rock, Hip Hop, Acoustic—do much of the work of binding the musical universe. Play around with the picture a bit and let us know what else you see.

So having found strong modularity, our research does not support The Boss’s proclamations of a post-authentic musical world. As much as traditional musical genres have been subdivided to death, those subdivisions still cohere together, creating distinct scenes rather than encouraging the free mixing of musical styles. This work is ongoing, and one can imagine many further types of analyses. One item on our agenda is to examine the extent to which certain genres have lost their integrity, remain distinct and intact, or serve as “bridges” that forge connections between differentiated musical scenes. This is already somewhat displayed in the visual, but warrants further attention. Suggestions for other directions are most welcome.

This is already a lot. But we’ve already done much more. Recall that genre is not the only information we have; we also know about popularity, and can ask some questions about that. Such as:

Question 2: Are popular bands less likely to be musically unconventional?

The short answer is that the most popular bands are fairly unconventional, but extremely unconventional bands tend to be very unpopular.

Hypotheses:

H1: There is a negative correlation between band popularity and band’s musical conventionality: popular bands are less likely to be musically unconventional.

H0: There is no correlation between popularity and conventionality: popular bands are just as likely as unpopular bands to be musically unconventional.

The idea is that popular music is stylistically common music; most people don’t like exotic things (by definition), so music that is stylistically more familiar will be more popular.

Measuring musical unconventionality

But before we get to that, we need to first explain how we measured musical unconventionality. Here, we are inspired by Lizardo (2013)’s measure of “effective cultural omnivorousness.” As he put it, this measure “uses [an] audience overlap matrix to penalize [choosing] genres that are themselves strongly connected to one another (e.g. have high audience overlap). Conversely, [choosing] genres which are not strongly connected to one another (e.g. belong to relatively distinct audience clusters) [is] assigned a higher score.”

We diverge from Lizardo (and use a simpler version of his idea) in that, instead of making a band’s unconventionality score cumulative, adding to its score when more genre choices are made, we take the mean of each band’s genre-pair unconventionality. So in effect, each genre pairing receives an unconventionality score, and a band’s unconventionality score is derived from taking a mean of each genre pair it chooses. It is necessary here to take means instead of sums because bands have either 1 or 3 pairings, so the very number of genres listed (regardless of how unusual any pairing is) can have a larger effect on the score than actually having an unconventional genre combination (if, say, only two genres are named). That is, you would get a higher score for saying your band is “Pop-Rock-Alternative” than for saying that it is “Grindcore-Ghettotech-None.” Taking the mean of genre pair novelty scores avoids this problem. So a band’s unconventionality ( is given by the following:

where n is the number of genre pairings selected, cjk is the number of times in the data set that genres j and k are paired, and cjj andckkare the total number of times persons in the sample selected genres j and k, respectively.

To make our conception of “conventionality” and “unconventionality” more concrete, a very conventional combination of genres would be ones that lie in one of the clusters explored above. For example: “rap/hip-hop/R&B” or “Rock/pop/alternative.” Unconventional pairings would be ones that bridge the clusters. A somewhat unconventional one might be “Rock/Alternative/Experimental” or “Punk/Thrash/Rock.” A very unconventional triad would be “Shoegaze/Hard House/Rockabilly” or “Ambient/Hardcore/Opera.”

Constructing a “popularity” measure

We also need to measure popularity. MySpace pages give us information about the number of fans, number of page views, and number of times people have played the music posted to its page. We can combine these to get an overall index of each band’s popularity.

We did this two ways, a weighted approach and a z-score approach. These represent different ways of making sure that each of the three components is given equal weight in the popularity score despite their differences in numerical magnitude (i.e. bands will get many more page views than people signing on as fans).

The weighted approach is given by:

Popularity = plays + 1.75*views + 20*fans

The z-score approach is given by:

PopularityZ = zPlays.log + zViews.log + zFans.log

Where we use logs of each component variable to normalize their distributions.

We have run analyses using both versions of “popularity,” and they return basically the same results, so we’ll only present the results using the z-score approach to save space.

But we have a problem with our other variable: musical unconventionality. No matter what transformation was applied, conventionality cannot be coerced into a normal distribution.

So a Spearman rank-transformed correlation might be more appropriate than a linear regression, even though it would be really nice to make use of all this continuous data. We find a negative correlation (ρ= -0.038) significant at the .01 level (p<2.2e-16). Of course, significance is not hard to achieve given the large sample size, but by the same token correlation coefficients are also typically much smaller in large datasets. This suggests that less popular bands are on average somewhat more unconventional than popular bands, but the difference is not large.

Perhaps more informative would be simply plotting the relationship between popularity and unconventionality. This is done below overlaid on a density plot of musical unconventionality scores (pink blob).

What we learn two things from this plot: (1) There is a fairly linear, strong, positive relationship between popularity and unconventionality up to about the 80th percentile of unconventionality. In fact, the most popular bands in the sample are fairly innovative. But that changes very quickly at a clear inflection point. (2) Extremely unconventional bands tend to be very unpopular.

Again ideas for pursuing the analysis further would be most welcome.

A MySpace public profile gives us one more key piece of information: a band’s location. That enables us to ask our third question:

Question 3: Are bands in certain metro areas more musically unconventional than bands in others?

And if so, what are the characteristics of metropolitan areas that are the most and least musically conventional?

H1: There is significant difference among different metro areas’ levels of musical conventionality.

H0: There is no significant difference among metro areas’ levels of musical conventionality.

We find that there are significant differences among metro areas. And it appears that college towns have the most scene-crossing while racially diverse metros anchor the main streams of American popular music.

As a first step, we used mixed models to determine what percentage of the variation in individual band’s unconventionality comes from metro vs. band differences. We found that about 2.6% comes from the metro. This is again statistically significant, but again small, and again raises a question about what small and big mean in this context.

Next we treat metro conventionality/unconventionality as a phenomenon in its own right. That is, we turn from properties of individual bands to aggregate characteristics of metro areas. To create a metropolitan area’s conventionality, we take the median of the scores for all bands that reside in those areas. We ran a Kruskal-Wallis test to see whether the medians were significantly different and found that they were (= 1337453, df = 331, p < 2.2e-16).

Here is a map on musical un/conventionality across 332 metro areas. Blue dots represent highest unconventionality, while yellow ones represent highest conventionality. The first version is unlabeled, while the second points out some of the least and most musically un/conventional metros. As you can see, there is a vague geographic pattern whereby the southeast and up the east coast sees the most conventionality, whereas unconventionality tends to be found in the north and the west.

But it appears clear from the maps that musical un/conventionality does not align according to a clear geographical pattern. So we run a number of analyses to discover whether musical creation correlates to some metro level demographic variables.

An advantage of aggregating to metro areas is that we can make use of the extensive information we know about them from the U.S. Census Bureau. We draw on two main sources. First is the decennial census, which tells us about demographics. We use total population, percent African-American, percent Hispanic, percent college students, and median household income. Second is Zip Code Business Patterns, which can tell us about a metro’s organizational make-up. For our analysis we focused on organizations related to the music industry: specifically we made a recording industry per capita variable (the total “Record Production,” “Music Publishers,” “Sound recording studios,” and “Other Sound Recording Studios” per person), and also a radio stations per capita variable. These are both based on NAICS codes. Though one might consider many other variables (and we did explore others), given that our N drops precipitously when we move to metros (from three million to three hundred) we try to stick to a small number to avoid collinearity problems. To continue to pursue the relationship between popularity and conventionality now at the aggregate level, we also include the metro median of our band popularity index.

Here are simple bivariate correlations between these variables and a metro area’s median conventionality score.

Metro Median Unconventionality

Sig. (2-tailed)

** Correlation is significant at the 0.01 level (2-tailed).

Total Population

-.200**

<.001

Radio Stations Per Capita

.244**

<.001

Recording Industry Organizations Per Capita

0.105

0.056

Percent African American

-.678**

<.001

Percent College Students

.233**

<.001

Median Household Income

-0.049

.37

Percent Hispanic

0.092

.095

Metro Median Popularity

.294**

<.001

And here are results from a simple multivariate OLS regression, showing standardized Beta coefficients and p-values.

The takeaway is that the most unconventional metros tend to have lots of college students, radio stations, and a strong recording industry presence. By contrast, America’s major popular music clusters are anchored in higher income, more racially diverse (African-American and Hispanic) metros. Interestingly, musically unconventional metros also tend to be home to relatively popular bands. So in contrast to what we saw at the individual band level, at the aggregate level, popularity and unconventionality seem to go together. This raises all sorts of interesting questions, and the interplay between individual band and aggregate metro characteristics is an area we would like to pursue further. Note also finally that total population is insignificant – this too is interesting in that one might have thought that size itself would breed unconventionality (cf. Simmel’s “Metropolis and Mental Life”), but this does not seem to be the case. It also provides some evidence against the possible criticism that we’re getting higher unconventionality scores for smaller metros only due to the tendency of smaller populations to yield extreme values (cf. Gelman).

And now we return to where we started, with the Boss’s speech. We found little support for his idea that American popular musicians freely roam across genres; they mostly operate according to what seem to be strong conventions about what genres go together. But not everywhere to the same degree. Indeed, if we conjure a picture of the Austin SxSW crowd, it looks a lot like the regression results: college students, radio stations, and the record industry. While America in general may not conform to the Springsteen Hypothesis, in some contexts it comes closer than others, and Austin is probably one of them. Considered in this way we might take Springsteen’s speech less as a general proposition and more as a specific expression of the expectations he and his audience have about the nature of musical creativity, one which is by no means universally shared. In other words, he was preaching to the choir.

Certainly these last ideas are speculative, but we hope to pursue them along a number of fronts. We mentioned some above: multi-level analysis that simultaneously investigates metro properties, band properties, and their interactions; diving deeper into the positions of specific genres within the network structure; investigating other variables. Another idea is to (somehow) transform the seventeen clusters into variables, so we can determine their relative strength across (geographic) space. And we might try matching the data to other geographies, such as counties.

But for now, that is all. We would love to hear any comments you may have. Thanks.

We’re on from 1pm August 15 through 1pm the 16th at Berkeley’s D-Lab. Public presentations and judging will take place at one of the ASA conference hotels, the Hilton Union Square, Room 3-4, Fourth Floor from 6:30-8:15 on August 16th.

Signing up will give us a better idea of who will be at the event and how many folks we can expect to feed and caffinate. We’re also going to give teams a week to get to know each other before the event, so signing up will allow us to make sure everyone gets the same amount of time to work.

If you’re interested, you are invited. We don’t discriminate against particular methodologies or backgrounds. We hope to have social scientists, data scientists, computer scientists, municipal staffers, start-up employees, grad students, and data hackers of all stripes – quantitative, qualitative, and the methodologically agnostic.

Our title implies an interest in “big cities” but honestly, we’re more interested in real estate and housing data. Because a majority of the population lives in cities, cities will likely be important focal points in many of the projects that come out of the datathon. We’re hoping some teams focus on rural areas, too. Questions that we’ve considered include:

– How are home buyers different now compared to home buyers ten years ago? Can the recession explain any of these differences? If so, would we expect a home-buying rebound or did the recession combine with other trends (increasing amount of student loan debt) to cause a permanent change in home buying patterns?

– Who buys homes in rural areas? Are there halos of second-home buying around major cities? Around major airports? What kind of impact does this have on rural economies?

– Are there specific industries that drive housing patterns? For instance, the tech industry is under fire in San Francisco right now for accelerating gentrification. Is this historically accurate? How does it compare to industries like the financial sector influencing prices in the New York metro area? Are these stories about single industries influencing real estate ecosystems oversimplifying more complicated patterns?

– How does access to natural resources – and proximity to natural disasters – shape purchasing decisions, if at all? In other words, is there evidence that buyers take natural risks into their value considerations?

These are just some questions we’ve tossed around among ourselves. We’re sure our participants will come up with other great questions that use real estate and/or housing data.

We can’t wait to see what happens in August.

]]>http://badhessian.org/2014/06/asa-datathon-big-cities-big-data-sign-up-by-august-1/feed/4Crowdsourced Season 6 Drag Race predictionshttp://badhessian.org/2014/05/crowdsourced-season-6-drag-race-predictions/
http://badhessian.org/2014/05/crowdsourced-season-6-drag-race-predictions/#commentsWed, 21 May 2014 12:12:29 +0000http://badhessian.org/?p=1286Continue reading →]]>With Season 6 of RuPaul’s Drag Race in the books and the new queen crowned, it’s time to reflect on how our pre-season forecasts did. In February I posted a wiki survey asking who would win this season before the first episode had aired. I posted this to reddit’s r/rupaulsdragrace, Twitter, and Facebook, and it generated an impressive 15,632 votes for 435 unique user sessions. Which means the average survey taker did a little under 36 pairwise comparisons.

The plot below shows the results. The x-axis is the score assigned by the All Our Ideas statistical model and can be interpreted that, if “idea” 1 (or, in this case, queen 1) is pitted at random against idea 2, this is the chance that idea 1 will win. The color is how close the wiki survey got to the actual rank. The more pale the dot, the closer. Bluer dots mean the wiki survey overestimated the queen, while redder dots mean it underestimated them.

So how did the wiki survey do? Not terrible. Courtney Act was a clear frontrunner and had a lot of star power to carry her to the end. Bianca was a close second in the wiki survey and finally outshone her when it came to the final. These two are relatively close to each other in score. This was actually the first season in which two queens never had to lipsync. Ben DeLaCreme is ranked third in the survey, although she came in fifth. Little surprise she was voted Miss Congeniality.

After that, it gets interesting. Milk was ranked four by the survey, but came in 9th on the show. I’m thinking her quirkiness may have given folks the impression that she could go much further than she actually did. Adore, one of the top three, comes in fifth on the survey, rather close to her friend Laganja.

April Carrion and Kelly Mantle were expected to go far, but got the chop relatively early on. Darienne was a dark horse in this competition, ending up in fourth place when pre-season fans thought she’d be middling.

Lastly, Joslyn and Trinity are the biggest success stories of season 6. They had a surprising amount of staying power when folks thought they wouldn’t make it out of the first month.

So what can we learn from this? Well, for one, for a more or less staged reality show, I’m somewhat impressed by how well these rankings came out. Unlike using wiki surveys for sports forecasting, we have no prior information on contestants from season to season. Prior seasons give us no information about contestants (unless you consider something like “drag lineages”, e.g. Laganja is Alyssa Edwards’s drag daughter). All information comes from the domain expertise of drag aficionados. Courtney and Bianca were already widely regarded drag stars in their own right before the competition. Although this didn’t seem to be the case with other seasons, it seems like there was a strong Matthew effect at work this time. Is this the new normal as more well-known queens start competing?

]]>http://badhessian.org/2014/05/crowdsourced-season-6-drag-race-predictions/feed/0It is time to get rid of the E in GDELThttp://badhessian.org/2014/05/it-is-time-to-get-rid-of-the-e-in-gdelt/
http://badhessian.org/2014/05/it-is-time-to-get-rid-of-the-e-in-gdelt/#commentsThu, 15 May 2014 12:01:58 +0000http://badhessian.org/?p=1281Continue reading →]]>This is a guest post by Neal Caren. He is an Associate Professor of Sociology at the University of North Carolina, Chapel Hill. He studies social movements and the media.

In the first story, “Kidnapping of Girls in Nigeria Is Part of a Worsening Problem,” Chalabi writes:

The recent mass abduction of schoolgirls took place April 15; the database records 151 kidnappings on that day and 215 the next.

To investigate the source of this claim, I downloaded the daily GDELT files for those days and pulled all the kidnappings (CAMEO Code 181) that mentioned Nigeria. GDELT provides the story URLs. Each different GDELT event is assocaited with a URL, although one article can produce more than one GDELT event.

I’ve listed the URLs below. Some of the links are dead, and I haven’t looked at all of the stories yet, but, as far as I can tell, every single story that is about a specific kidnapping is about the same event. You can get a sense of this by just look at the words in the URLS for just those two days. For example, 89 of the URLs contain the word “schoolgirl” and 32 contain Boko Haram. It looks like instead of 366 different kidnappings, there were many, many stories about one kidnapping.

Something very strange is happening with the way the stories are parsed and then aggregated. I suspect that this is because when reports differ on any detail, each report is counted as a different event. Events are coded on 57 attributes each of which has multiple possible values and it appears that events are only considered duplicates when they match all on attributes. Given the vagueness of events and variation in reporting style, a well-covered, evolving event like the Boko Haram kidnapping is likely to covered in multiple ways with varying degrees of specificity, leading to hundreds of “events” from a single incident.

Plotting these “events” on a map only magnifies the errors–there are 41 different unique latitudes/longitudes pairs listed to described the same abduction.

At a minimum, GDELT should stop calling itself an “event” database and call itself a “report” database. People still need to be very careful about using the data, but defaulting to writing that there were 366 reports about kidnapping in Nigeria over these two days is much more accurate than saying there were 366 kidnappings.

In case you were wondering, GDELT lists 296 abductions associated with Nigeria that happened yesterday (May 14th, 2014) in 42 different locations. Almost all of the articles are about the Boko Haram school girl kidnappings, and the rest are entirely miscoded, like the Heritage blog post about how the IRS is targeting the Tea Party.

]]>http://badhessian.org/2014/05/it-is-time-to-get-rid-of-the-e-in-gdelt/feed/3Creating Network Diagrams in Plotly from Juliahttp://badhessian.org/2014/05/creating-network-diagrams-in-plotly-from-julia/
http://badhessian.org/2014/05/creating-network-diagrams-in-plotly-from-julia/#commentsMon, 12 May 2014 17:19:51 +0000http://badhessian.org/?p=1272Continue reading →]]>I’ve been using R for years and absolutely love it, warts and all, but it’s been hard to ignore some of the publicity the Julia language has been receiving. To put it succinctly, Julia promises both speed and intuitive use to meet contemporary data challenges. As soon as I started dabbling in it about six months ago I was sold. It’s a very nice language. After I had understood most of the language’s syntax, I found myself thinking “But can it do networks?” As it stands, there’s currently one library, Graphs, that addresses networks. Relative to R’s network packages, Graphs.jl has a very slick way of storing network data . Wasserman and Faust (1994) define a network as a set of actors and the set of the relationships between them–Graphs.jl reminds users of this conceptualizaton. Unlike R, Julia requires that users actively consider data types, so naturally, each actor in a network is represented by an object of type “KeyVertex” or “ExVertex.” The difference between these two vertex types is whether or not actors need attributes and labels. Should an actor have a label and attributes, this information is stored within each ExVertex object within a network. A relationship between two actors is likewise represented by an object of type “Edge” or “ExEdge” (again, differing based upon whether or not the edge can store an attribute). Creating an edge object requires a user to specify the sending and receiving vertices, expressed as vertex objects. In short, each network is represented by an array of vertices of a particular type as well as an array of edges of a particular type. Once the user has decided upon the types of vertices in her network, she must then create the vertex objects, include them in her network or “graph,” create edge objects that include pairs of vertex objects, then include those edges in her graph as well. After the user has created a graph object, she can then easily retrieve the vertex and edge objects along with any attributes they might have. After I create a network data object, I typically want to visualize it immediately. Unfortunately, for now the functionality in Graphs.jl is quite limited compared to igraph and the statnet suite in R. As it stands, there are no network visualization functions and just one layout function (random). Because Graphs.jl is relatively young and the developers are busy contributing to many of Julia’s other libraries, this limitation is to be expected. Nevertheless, I’ve been itching to make a network plot in Julia so I decided to write up a function. Network plotting functions need to accept a few traditional arguments. First, they obviously need to accept a network/graph object. Second, they need the ability to express vertex and edge attributes through varying colors, shapes, and sizes. Third, edge directionality needs to be indicated, should it exist. As a matter of personal preference, I find that opacity can be a useful visualization tool and that intensity-tapered-curved edges are a good way to portray directionality. The defaults should also not be fugly. Julia has a few visualization tools and for this project I decided to render the networks in plot.ly. Plot.ly has a few benefits over the other packages in Julia in that it stores the visualizations and data online, it allows collaboration, and the platform is also accessible through R, MATLAB, and Python. If network diagrams are basically many line and scatter plots, then, in theory, plot.ly should have no trouble rendering network visualizations. To demonstrate, let’s do a “Hello, world” with Padgett’s Florentine family marriage network. Here, marriage is an undirected relationship, the labels are meaningful, and we’re omitting the other attributes. For the data examples here, I’ve cheated a bit and calculated the vertex layout in igraph for R, but the rest was done in Julia. After starting Julia, you’ll need to load the data and plotting function, then you’ll need to enter your own plot.ly user name and API key, and lastly run the function.

Going to that URL, you should see a plot that looks like this one: Directed networks need a bit more patience. Directed relationships are conventionally represented in social network diagrams using an arrow. On the downside, at the moment I haven’t found a way to use arrows in plotly. On the upside, the plotting script will indicate directionality with intensity-tapered-curved edges, a representation that is easier to interpret than arrows. On the downside (again), this method is much more computationally expensive. The parameter of interest here is the “gradient.” Each edge contains a number of small line plots equal to gradient. By default, these line plots go from wider, darker, and more opaque to thinner, lighter, and more transparent as they leave the source vertext and approach the target vertex. Setting the gradient parameter to a higher value will improve the quality of the visual results, though it will take longer to load in your browser; setting the gradient parameter to a lower value will do just the opposite. For the directed network, let’s look at Coleman’s (1964) high school interaction (“friendship”) network.

Right now, the script has a few limitations. First, the graph coordinates must be saved as vertex attributes “x” and “y” and I’ve had to calculate these outside of Julia. Likewise, the vertices must be of type ExVertex, as the type KeyVertex cannot store attributes. Second, curved edges for undirected graphs have not yet been implemented. Lastly, I’m exploring different ideas to get the speed up for directed graphs. If anyone has any suggestions or feedback, I’d greatly appreciate it! Helper functions / Data / Plotting function

]]>http://badhessian.org/2014/05/creating-network-diagrams-in-plotly-from-julia/feed/2New Paper: Developing a System for the Automated Coding of Protest Event Datahttp://badhessian.org/2014/04/new-paper-developing-a-system-for-the-automated-coding-of-protest-event-data/
http://badhessian.org/2014/04/new-paper-developing-a-system-for-the-automated-coding-of-protest-event-data/#commentsTue, 15 Apr 2014 18:22:55 +0000http://badhessian.org/?p=1267Continue reading →]]>Sadly, we haven’t posted in a while. My own excuse is that I’ve been working a lot on a dissertation chapter. I’m presenting this work at the Young Scholars in Social Movements conference at Notre Dame at the beginning of May and have just finished a rather rough draft of that chapter. The abstract:

Scholars and policy makers recognize the need for better and timelier data about contentious collective action, both the peaceful protests that are understood as part of democracy and the violent events that are threats to it. News media provide the only consistent source of information available outside government intelligence agencies and are thus the focus of all scholarly efforts to improve collective action data. Human coding of news sources is time-consuming and thus can never be timely and is necessarily limited to a small number of sources, a small time interval, or a limited set of protest “issues” as captured by particular keywords. There have been a number of attempts to address this need through machine coding of electronic versions of news media, but approaches so far remain less than optimal. The goal of this paper is to outline the steps needed build, test and validate an open-source system for coding protest events from any electronically available news source using advances from natural language processing and machine learning. Such a system should have the effect of increasing the speed and reducing the labor costs associated with identifying and coding collective actions in news sources, thus increasing the timeliness of protest data and reducing biases due to excessive reliance on too few news sources. The system will also be open, available for replication, and extendable by future social movement researchers, and social and computational scientists.

This is very much a work still in progress. There are some tasks which I know immediately need to be done — improving evaluation for the closed-ended coding task, incorporating the open-ended coding, and clarifying the methods. From those of you that do event data work, I would love your feedback. Also if you can think of a witty, Googleable name for the system, I’d love to hear that too.