Learn more about using open source R for big data analysis, predictive modeling, data science and more from the staff of Revolution Analytics.

May 16, 2013

Social Network Analysis at New Frontiers in Computing 2013

by Joseph Rickert

This past Saturday, the New Frontiers in Computing Conference (NFIC 2013), held at Stanford University, explored the theme: Social Network Analysis: It’s Who You Know. The speakers were a well-chosen, eclectic lot who covered a remarkable array of issues in less than a full day. Ian Hersey, former CTO of Attensity spoke on Lessons from Large-Scale Social Analytics. Michael Wu, Chief scientist of Lithium Technologies, provided an introduction to social network analysis and very gamely conducted a live experiment building a social network of attendee tweets during the conference. Rong Yan, the Engineering Manager for Ads Relevance and Quality at Facebook spoke about machine learning insights. Zahan Malkani, an engineer at Facebook, presented “Dog”, the yet to be released social media programming language. Shivakumar Vaithyanathan, Chief Scientist for Text Analytics at the IBM Almaden Research Center that is built around IBM’s Annotation Query Language (AQL). Laura Jacob, a Factset engineer and president of the IEEE’s Society on the Social Implications of Technology spoke about “Context Collapse”, a fundamental cause for the damaging “oversharing” trap that so many Facebook and Twitter users fall into. Finally, John Rehling, Senior Research Scientist at Reputation.com, “cleaned up” with an alarming discussion of the mind boggling hazards we all face in just using the Internet.

Although most of the talks were obviously enhanced versions of corporate presentations, there was nothing superficial about the day. Collectively, the presentations and panel discussions provided a comprehensive, multidimensional look at the technologies, issues and challenges associate with social networks. Most refreshingly, the day was mostly hype free — no beating the drum for big data or promoting unreasonable expectations for Hadoop. The presenters all seemed to pretty much be in agreement about the current best practices in technology. Hadoop, for example, was characterized as being the place for massive amounts of persistent data, but not a suitable platform for ingesting social media data where low latency is of paramount importance. And, Rong Yan pointed out that although Facebook is a big Hadoop shop they do not use Map-Reduce for analyses that require status sharing among processors distributed across the cluster. R came up at various times during the discussions in a matter of fact way. Rong pointed out for example, that for data stored in Hadoop clusters Pig or Hive will typically be used to aggregate data at which point it is no longer big data. After that R, Matlab or SQL might be used for analysis. He indicated that most business questions can be answered with relatively small data sets. When it really is necessary to work with a large data set then the analysis is likely to be done in C++. At one point Shivakumar casually remarked that AQL syntax looks a lot like R.

A technical highlight of the day was Michael Wu’s introduction to social network analysis (SNA). With the help of an open source plug-in to Excel he was able to start from first principles and work up to explaining some fairly sophisticated performance metrics for social network graphs such as eigenvector centrality. Basically, this is the notion of giving high scores to nodes that are connected to nodes that are themselves central within the network. (For a very nice explanation of this idea and pointers to the source papers have a look at the Plos paper by Gabrielle Lohmann et al.)

Michael gave a remarkably clear presentation and although he did not use R he could have. For anyone with an interest in getting started with SNA I recommend the 2010 Social Network Analysis Labs in R written by McFarland, Messing and Nowak. The labs use functions from the igraph package and data from the NetData package to provide a challenging introductory SNA course. The first plot (from the 4th lab) shows a network graph of student interactions using the studentnets.S641 data set.

This next
plot shows the Eigenvector centrality score for each student.

The most fascinating and distressing presentations and discussions happened in the section on Privacy Implications for SNA. Laura Jacob started things off here by providing some social theory background for the problem of inadvertently oversharing on social media sites. Frequently this sort of thing happens when the imagined audience for a tweet, message or photo turns out not to be the actual audience. This “context collapse” results from the tension between the individual’s attempt to establish some level of privacy and the social media site’s desire obtain information. Laura explained that social media sites know that if they put you a certain context you are more likely to share information that is appropriate for that context. However, unless you are really careful about the privacy settings the actual context might include a wider audience than intended. At some level, participating in social media is like continually reliving that part of your wedding day where you worked very hard to limit the conversation between your new in-laws at Table 1 and your Vegas party friends seated in Table 12. For more on the theory take a look at Laura’s suggested reading list of (Goffman 1959) and (Marwick 2010)

In the final presentation of the day, John Rehling took the attendees through the “Spectrum of Social Distance”: self < younger self < family < friend < acquaintance < enemy; recounted a number of cases where reputations were tarnished and irrevocable damage done by people closer than family and then pointed out that in the future we can expect to live in a world where individually innocuous bits of information will be assembled to form damaging information.

This very brief summary of the conference does not do justice to any of the presenters, but will end here with Ian Hersey’s list of ongoing challenges for SNA:

The growth in the volume of data (10% increase per month)

Data Quality Assurance

Rich natural language processing in many languages across many domains