Seven-inch heels, natural language processing, and sociology

The following is a guest post from Trey Causey, a long-time reader of codeandculture and a grad student at Washington who does a lot of work with web scraping. We got to discussing a dubious finding and at my request he graciously wrote up his thoughts into a guest post.

| Trey |

Recently, Gabriel pointed me to a piece in Ad Age (and original press release) about IBM researchers correlating the conversations of fashion bloggers with the state of the economy (make sure you file away the accompanying graph for the next time you teach data visualization). Trevor Davis, a “consumer-products expert” with IBM, claimed that as economic conditions improve, the average height of high heels mentioned by these bloggers decreases. Similarly, as economic conditions worsen, the average height would increase. As Gabriel pointed out, these findings seemed to lack any sort of face validity — how likely does it seem that, at any level of economic performance, the average high heel is seven inches tall (even among fashionistas)? I’ll return to the specific problems posed by what I’ll call the “seven-inch heel problem” in a moment, but first some background on the methods that most likely went into this study.

Much of this work is being met by what I perceive to be reflexive criticism (as in automatic, rather than in the more sociological sense) from within the academy. The Golder and Macy piece in particular received sharp criticism in the comments on orgtheory, labeled variously “empiricism gone awry”, non-representative, and even fallacious (and in which yours truly was labeled “cavalier”). Some of this criticism is warranted, although as is often the case with new methods and data sources, much of the criticism seems rooted in misunderstanding. I suspect part of this is the surprisingly long-lived skepticism of scholarly work on “the internet” which, with the rise of Facebook and Twitter, seems to have been reinvigorated.

However, sociologists are doing themselves a disservice by seeing this work as research on the internet qua internet. Incredible amounts of textual and relational data are out there for the analyzing — and we all know if there’s one thing social scientists love, it’s original data. And these data are not limited to blog posts, status updates, and tweets. Newspapers, legislation, historical archives, and more are rapidly being digitized, providing pristine territory for analysis. Political scientists are warming to the approach, as evidenced by none other than the inimitable Gary King and his own start-up Crimson Hexagon, which performs sentiment analysis on social media using software developed for a piece in AJPS. Political Analysis, the top-ranked journal in political science and the methodological showcase for the discipline, devoted an entire issue in 2008 to the “text-as-data” approach. Additionally, a group of historians and literary scholars have adopted these methods, dubbing the new subfield the “digital humanities.”

Sociologists of culture and diffusion have already warmed to many of these ideas, but the potential for other subfields is significant and largely unrealized. Social movement scholars could find ways to empirically identify frames in wider public discourse. Sociologists of stratification have access to thousands of public- and private-sector reports, the texts of employment legislation, and more to analyze. Race, ethnicity, and immigration researchers can model changing symbolic boundaries across time and space. The real mistake, in my view, is dismissing these methods as an end in and of themselves rather than as a tool for exploring important and interesting sociological questions. Although many of the studies hitting the mass media seem more “proof of concept” than “test of theory,” this is changing; sociologists will not want to be left behind. Below, I will outline the basics of some of these methods and then return to the seven-inch heels problem.

The use of simple scripts or programs to scrape data from the web or Twitter has been featured several times on this blog. The data that I collected for my dissertation were crawled and then scraped from multiple English and Arabic news outlets that post their archives online, including Al Ahram, Al Masry Al Youm, Al Jazeera, and Asharq al Awsat. The actual scrapers are written in Python using the Scrapy framework.

Obtaining the data is the first and least interesting step (to sociologists). Using the scraped data, I am creating chains of topic models (specifically using Latent Dirichlet Allocation) to model latent discursive patterns in the media from the years leading up to the so-called “Arab Spring.” In doing so, I am trying to identify the convergence and divergence in discourse across and within sources to understand how contemporary actors were making sense of their social, political, and economic contexts prior to a major social upheaval. Estimating common knowledge prior to contentious political events is often problematic due to hindsight biases, because of the problems of conducting surveys in non-democracies, and for the obvious reason that we usually don’t know when a major social upheaval is about to happen even if we may know which places may be more susceptible.

Topic modeling is a method that will be look familiar in its generalities to anyone who has seen a cluster analysis. Essentially, topic models use unstructured text — i.e., text without labeled fields from a database or from a forced-choice survey — to model the underlying topical components that make up a document or set of documents. For instance, one modeled topic might be composed of the words “protest”, “revolution”, “dictator”, and “tahrir”. The model attempts to find the words that have the highest probability of being found with one another and with the lowest probability of being found with other words. The generated topics are devoid of meaning, however, without theoretically informed interpretation. This is analogous to survey researchers that perform cluster or factor analyses to find items that “hang together” and then attempt to figure out what the latent construct is that links them.

Collections of documents (a corpus) are usually represented as a document-term matrix, where each row is a document and the columns are all of the words that appear in your set of documents (the vocabulary). The contents of the individual cells are the per-document word frequencies. This produces a very sparse matrix, so some pre-processing is usually performed to reduce the dimensionality. The majority of all documents from any source are filled with words that convey little to no information — prepositions, articles, common adjectives, etc. (see Zipf’s law). Words that appear in every document or in a very small number of documents provide little explanatory power and are usually removed. The texts are often pre-processed using tools such as the Natural Language Toolkit for Python or RTextTools (which is developed in part here at the University of Washington) to remove these words and punctuation. Further, words are often “stemmed” or “lemmatized” so that the number of words with common suffixes and prefixes but with similar meanings is reduced. For example, “run”, “runner”, “running”, and “runs” might all be reduced to “run”.

This approach is known as a “bag-of-words” approach in that the order and context of the words is assumed to be unimportant (obviously, a contentious assumption, but perhaps that is a debate for another blog). Researchers that are uncomfortable with this assumption can use n-grams, groupings of two or more words, rather than single words. However, as the n increases, the number of possible combinations and the accompanying computing power required grows rapidly. You may be familiar with the Google Ngram Viewer. Most of the models are extendable to other languages and are indifferent to the actual content of the text although obviously the researcher needs to be able to read and make sense of the output.

Other methods require different assumptions. If you are interested in parts of speech, a part-of-speech tagger is required, which assumes that the document is fairly coherent and not riddled with typos. Tracking exact or near-exact phrases is difficult as well, as evidenced by the formidable team of computer scientists working on MemeTracker. The number of possible variations on even a short phrase quickly becomes unwieldy and requires substantial computational resources — which brings us back to the seven-inch heels.

Although IBM now develops the oft-maligned SPSS, they also produced Watson. This is why the total lack of validity of fashion blogging results is surprising. If one were seriously going to track the height of heels mentioned and attempt to correlate it with economic conditions, in order to have any confidence that you have captured a non-biased sample of mentions, at least two necessary steps would include:

Identifying possible combinations of size metrics and words for heels: seven-inch heels, seven inch heels, seven inch high heels, seven-inch high-heels, seven inch platforms, etc. And so on. This is further complicated by the fact that many text processing algorithms will treat “seven-inch” as one word.

Dealing with the problem of punctuational abbreviations for these metrics: 7″ heels, 7″ high heels, 7 and a 1/2 inch heels, etc. Since punctuation is usually stripped out, it would be necessary to leave it in, but then how to distinguish quotation marks that appear as size abbreviations and those that appear in other contexts?

Do we include all of these variations with “pumps?” Is there something systematic such as age, location, etc. about individuals that refer to “pumps” rather than “heels?”

Are there words or descriptions for heels that I’m not even aware of? Probably.

None of these is an insurmountable problem and I have no doubt that IBM researchers have easy access to substantial computing power. However, each of them requires careful thought prior to and following data collection; the combination of them together quickly complicates matters. Since IBM is unlikely to reveal their methods, though, I have serious doubts as to the validity of their findings.

As any content analyst can tell you, text is a truly unique data source as it is intentional language and is one of the few sources of observational data for which the observation process is totally unobtrusive. In some cases, the authors are no longer alive! Much of the available online text of interest to social scientists was not produced for scholarly inquiry and was not generated from survey responses. However, the sheer volume of the text requires some (but not much!) technical sophistication to acquire and make sense of and, like any other method, these analyses can produce results that are essentially meaningless. Just as your statistics package of choice will output meaningless regression results from just about any data you feed into it, automated and semi-automated text analysis produces its own share of seven-inch heels.

11 Comments

Coding error of quantity, age, or other numbers as heights. This could introduce substantial outliers. For instance “Imelda Marcos had 1000 pairs of heels” or “I wasn’t allowed to wear heels until I was 16”

Ignoring references to shoes where (low) height is implicit but not stated in inches. For example “I just bought a very cute pair of flats” probably wouldn’t be coded as information about heel height, even though it arguably should be.

I see the real take home message as being that you really need to look for outliers and spot check the data. I am not going to believe this IBM data until I see (a) a histogram for each period and (b) the actual websites that apparently described women walking around on stilts in the first half of 2009.

Also disappointed that all the press coverage of this finding took it at face value and nobody stopped to say, “wait a minute, I don’t remember the spring 2009 collections having any shoes that resembled a ballerina standing en pointe.”

Trey, it may be helpful to tell readers who aren’t familiar with Gary King’s recent work a little more about how the content analysis is done, since it could fit well with block modeling. King’s method (in short) is to have multiple coders analyze an initial set of documents and then replicate what the average coder would do in that situation. Of course, hiring multiple coders usually isn’t viable for us grad students…

3.Trey | November 21, 2011 at 1:38 pm

Good points. Yes, King calls his method “semi-automated” content analysis in that a training set is hand coded. Yes, hiring multiple coders is not viable for us grad students, but Amazon Mechanical Turk certainly is, depending on the kinds of documents you are using and the intricacy of the coding scheme you are using. I know you can also set up a random sample of documents to be served to multiple users to establish inter-coder reliability as well. However, those tasks are suited more for “positive/negative/neutral” tasks than any kind of fine-grained analysis.

4.Trey | November 21, 2011 at 12:13 pm

Thanks for the opportunity to guest blog!

Your points are all exactly right on. In fact, there aren’t even outliers in the traditional sense if your units are not comparable. Just as survey researchers have spent decades trying to solve reliability and validity problems, content analysts have spent years working out the same issues. Unfortunately, most of the “flashy” work in this area ignores a lot of that work which likely tarnishes sociologists’ impressions of the methods.

5.Trevor Davis | November 21, 2011 at 1:04 pm

The analysis is of 12 key influencers we identified out of a corpus of >1bn (non porn – we have a way to remove it that works well) discussions identified since 2008 – we used k-means to identify topic clusters, then looked for highly networked individuals within those clusters, then tested their predictive ability by reading a sample of posts and looking at what happened later (e.g. next fashion season). The 7 inch reference in the release is to the median of what those individuals were posting as predictions (not actual observed heights). For example, “I saw several articles about 7 inch heels this week – does that mean we get those next season”. We extracted the heel height entities from the text by a mix of very, very large numbers of RegEx statements we have been developing since 2004, and a little NLP. Processing horsepower was moderately high as I used server farm in one of our research labs.

Thanks for the clarification. Limiting it to 12 particular bloggers would definitely solve the issue of getting things that don’t belong in the universe (eg, porn). Of course it introduces its own set of issues, but those are the trade-offs.

Also, using median would deal w any outliers. (I was confused by references to “average” in the press write-ups).

Your mention of “key influencers” makes me wonder your thoughts on this as compared to Dodds and Watts JCR 2007 challenging the opinion leader hypothesis (or more precisely, delineating as a scope condition that you have to have a preferential attachment structure). Some of my current work suggests a further scope condition that opinion leadership is also very fragile to the introduction of diffusion mechanisms other than network contagion. (ie, it coud be washed out by advertising or bestseller lists).

7.Charles Seguin | November 21, 2011 at 1:20 pm

Trey,

Great post. I’m going to have to take a look at that AJPS issue and the Scrapy module. Your dissertation sounds absolutely fascinating.

Mostly just wanted to say thanks for bringing these things to my attention.

It seems like another potential problem with these data is the multiple testing issue. As you point out, it has to be in a “test of theory” framework. Although this may or not apply to the 7 inch heels phenomenon, given the massive amount of data, one could probably find no end of statistically significant relationships between just about anything.

8.Trey | November 21, 2011 at 1:27 pm

Trevor, thanks for responding with clarification! Your point about the “very, very large numbers” of regular expressions in development for seven years underscores some of my points above about the (not insurmountable) challenges in doing this kind of work. It still seems a bit curious that seven inches could be the median of mentioned heel heights — half of the mentions are equal to or larger than seven inches?

As far as influencers, his substantive focus is a bit difference, but you may be interested in sociologist Duncan Watts’s work (he is now at Yahoo! Research) on the difficulty in reliably identifying them.

Finally, one of the points that I am trying to make about these methods is that they are often presented as “wow, isn’t that interesting?” and the explanation is left for the reader’s imagination. However, if we have no plausible theoretical reason for why such an association would be present, it’s not clear that it is meaningful sociologically.

9.Trevor Davis | November 21, 2011 at 2:11 pm

The identification challenge is the big one, I agree. Thanks for pointing out where Watts has gone too – I had lost sight of him. For influencers we use a mixture of approaches – in particular we have adapted von Hippel’s work on lead users to characterise a different kind of key influencer from those usually described in the literature (more involved in their subject than trying to shape it).

Back to the 7 inches. The influencers we watched tend to be drawn to glamour and high fashion. They watch catwalks for example and so their view is in some ways distorted. However, for me that is beside the point as I was interested in how their predictive views changed with time against a backdrop of economic ups and downs.

[…] The following is a guest post from Trey Causey, a long-time reader of codeandculture and a grad student at Washington who does a lot of work with web scraping. This is his second guest post here, his first was on natural language processing SNAFUs. […]