With the proliferation of data, the increasing availability of rather simple tools to analyze data and an increasing number of people who can use these tools in combination with the availability of low cost publication platforms (e.g. blogs), the potential to democratize certain aspects of scientific processes – such as empirical data analysis – seems tremendeous. This might give rise to the idea that everyone who can use these tools (such as Python), and publish the results from their analysis (e.g. via blog posts) can now participate in knowledge production.

An opportunity for data analysis by the masses: If true, the potential of such a development would be enormous: By increasing the number of people that participate in scientific processes, we could increase the coverage of interesting phenomena to explore, research activity would not be constrained to areas that are funded by large institutional bodies, and in general more research could get done.

At the same time, this would represent an absolut shift in the way science has been operating up til now, as people formerly not part of traditional scientific processes (and not trained in scientific knowledge production) now move into new territory, and participate in new processes. In order to understand this shift, we need to understand the modi operandi of scientific knowledge production in the past.

Different modes of knowledge production: There are many ways to look at scientific knowledge production. A very influential distinction has been made by Gibbons et al, which argue that we have to differentiate between “Mode 1” and “Mode 2” of knowledge production.

M. Gibbons, C. Limoges, and H. Nowotny. The new production of knowledge: the dynamics of science and research in contemporary societies. Sage, 1997.

Mode 1 refers to traditional knowledge production processes, by focusing on hierarchical mechanisms and processes executed by a set of homogenous actors from a common disciplinary background. An example would be the ivory tower view of a university, where a scientist or group of scientists with homogeneous backgrounds work on disciplinary problems. This mode is increasingly being replaced by Mode 2 knowledge production, which is socially distributed, organizationally diverse, application-oriented, and trans-disciplinary [GLN97, NSG03]. An example would be a network of university partners with different disciplinary backgrounds collaborating on an application-oriented problem with other stakeholders from e.g. industry or other public institutions.

Mode 3 knowledge production: The proliferation of data, tools and people able to make use of them might give rise to what I might call Mode 3 knowledge production which could be self organized, context-focused, and driven by individuals not primarily trained in scientific processes. An example would be an interested user (or group of users) of a social network platform that looks at data that might explain some online social network phenomenon that they feel worth exploring. Another might be a group of patients performing self experiments or experiments with n=1 in order to explore the cause of personal symptoms or health concerns. These groups might embed the discussion of their findings into community conversations and social sensemaking processes.

While this idea looks appealing on the surface, there are a number of issues. For example: Mode 1 and mode 2 knowledge production differ in terms of organization, but both follow the scientific method in terms of basic mechanisms and values. It is yet unclear whether an emerging mode 3 would adhere to the scientific method as well. Being able to use analysis tools to look at data does not necessarily mean that whatever kind of analysis follows from that contributes to scientific processes in meaningful ways.

The scientific method: So what is the scientific method, i.e. what are some of the standards, ethics and practices that mode 1 and mode 2 knowledge production follow, which a potential mode 3 knowledge production would have to adopt as well? Answers can be found in the philosophy of science, which has long been thinking about the nature of science and scientific processes. This is an entire field that can not be adquately described here – the Hempel–Oppenheim model would just be one of many examples.

However, typical qualities of scientific processes would include, but are not limited to: the ability to reproduce results including a proper description of methods and means of data collection, sharing of data, the quality of hypotheses (w.r.t. falsifiability, explanatory power, understandability, etc), the relation to state-of-the-art research including proper citations of existing literature, critical reflections about the validity of findings, as well as the quality of interpretations and whether they follow from the data.

Do blog posts follow the scientific method? While there is nothing that prevents research published via blog posts to follow the scientific method, more often than not blog posts – even data-oriented ones – fail to meet these most basic requirements. For example, from a data visualization published via a blog post it does not necessarily become clear where the data is from, how the data has been collected, which methods have been applied, whether the results are reproducable, whether the data used will be shared, how the analysis relates to the state-of-the-art of scientific knowledge or whether there is an agreement that the conclusions presented follow from the data.

This is not surprising. In scientific articles, peer-review is the most common (but certainly not infallible) instrument to check whether submitted research follows the scientific method. In blog posts and similar user-generated media, there are currently no established social or other mechanisms enforcing the scientific method, which often makes their results – while potentially interesting – less useful from a scientific perspective. In addition, it is typically impossible for a researcher to ignore a reviewer’s comment (as an editor will make a decision based on reviewers’ comments whether to publish an article or not), at the same time it is usually easy for a blogger to delete an unwanted comment.

Conclusion: Whether a third mode of knowledge production will ultimately emerge is unclear. While the democratization of data analysis will expand without a doubt, it will depend on the masses of amateurs and bloggers to adopt principles based on the scientific method or the masses of scientists to participate and enforce the scientific method in blog conversations or both. It will probably not depend on the technicalities of the publishing medium – whether blog posts or not.

References:

M. Gibbons, C. Limoges, and H. Nowotny. The new production of knowledge: the dynamics of science and research in contemporary societies. Sage, 1997.

I’ve made it a hobby for myself to ask this question to professors that I meet at conferences in my field. The answers that I have collected in these conversations are manifestations of an astonishing variety of underlying research philosophies and ideologies. Here’s a list of answers I have received so far, the labels in brackets are mine, they might be misleading, deceptive or misrepresent the original intent of the answer given.

When he is offered a position in industry or academia that assumes a PhD (the american view)

When he has convinced his corresponding research (sub-)community that the work he has been doing is worthy of a PhD (the sociologist’s / psychologist’s view)

When he has expanded the state of knowledge by a significant amount / When he added new knowledge to the existing body of knowledge about the world (the epistemological view)

When he has built something truly new, interesting, elegant and/or complex (the engineer’s view)

When he has reached his personal intellectual maximum i.e. the maximum intellectual capacity that he is capable of acquiring (the subjective view)

When he is able to explain the results of his work in one sentence (the communication view)

When he has published n papers (the bureaucrat’s view)

I am amazed that there is little repetition in the answers that I get. What is your answer? Add it to the comments.

Next week, my PhD student Claudia Wagner will present results from one of our recent studies on the susceptibility of users in online social networks at the #MSM2012 workshop at WWW’2012 conference in Lyon, France.

In our paper (downloadsocialbots.pdf), we analyze data from the Socialbot Challenge 2011 organized by T. Hwang and the WebEcologyProject, where a set of Twitter users were targeted by three teams who implemented socialbots that were released “into the wild” (i.e. implemented and implanted on Twitter). The objective for each team was to elicit certain responses from target users, such as @replies or follows. Our work on this dataset aimed to understand and model the factors that make users susceptible to such attacks.

Our results indicate that even very active Twitter users, who might be expected to develop certain skills and competencies for using social media, are prone to attacks. The work presented in this paper increases our understanding about vulnerabilities of online social networks, and represents a stepping stone towards more sophisticated measures for protecting users from socialbot attacks in online social network environments.

The Figure below depicts the network of users and socialbots in our dataset (a set of users who were targeted by social bots during the Socialbot challenge), how they link to each other, and highlights those users who were susceptible to the attacks (green and orange nodes).

Susceptibility of users on Twitter who were targeted by socialbots during the Socialbot challenge 2011 (organized by T. Hwang and the WebEcologyProject). Each node represents a Twitter user: red nodes represent socialbots (total of 3), blue nodes represent users who did not interact with social bots, green nodes represent users who have interacted with at least one social bot, orange nodes represent users who have interacted with all social bots. Dashed edges represent social links between users which existed prior to the challenge, solid edges represent social links that were created during the challenge. Large nodes have a high follower/followee ratio (more popular users), small nodes have a low follower/followee ratio (less popular users). Network visualization generated by my student Simon Kendler.

Here’s the abstract of our paper:

Abstract: Social bots are automatic or semi-automatic computer programs that mimic humans and/or human behavior in online social networks. Social bots can attack users (targets) in on- line social networks to pursue a variety of latent goals, such as to spread information or to influence targets. Without a deep understanding of the nature of such attacks or the susceptibility of users, the potential of social media as an instrument for facilitating discourse or democratic processes is in jeopardy. In this paper, we study data from the Social Bot Challenge 2011 – an experiment conducted by the WebEcologyProject during 2011 – in which three teams implemented a number of social bots that aimed to influence user behavior on Twitter. Using this data, we aim to develop models to (i) identify susceptible users among a set of targets and (ii) predict users’ level of susceptibility. We explore the predictiveness of three different groups of features (network, behavioral and linguistic features) for these tasks. Our results suggest that susceptible users tend to use Twitter for a conversational purpose and tend to be more open and social since they communicate with many different users, use more social words and show more affection than non-susceptible users.

here’s the full reference, including a link to the article (socialbots.pdf):

Reference and PDF Download: C. Wagner, S. Mitter; C. Körner and M. Strohmaier. When social bots attack: Modeling susceptibility of users in online social networks. In Proceedings of the 2nd Workshop on Making Sense of Microposts (MSM’2012), held in conjunction with the 21st World Wide Web Conference (WWW’2012), Lyon, France, 2012. (downloadsocialbots.pdf)

The ACM Hypertext and Social Media conference is a premium venue for high quality peer-reviewed research on hypertext theory, systems and applications. It is concerned with all aspects of modern hypertext research including social media, semantic web, dynamic and computed hypertext and hypermedia as well as narrative systems and applications. The ACM Hypertext and Social Media 2012 conference will focus on exploring, studying and shaping relationships between four important dimensions of links in hypertextual systems and the World Wide Web: people, data, resources and stories.

The Library of Babel is a theoretical library that holds the sum of all books that can be written with (i) a given set of symbols and (ii) a given page limit. According to Wikipedia, the Library of Babel is based on a short story by the author and librarian Jorge Luis Borges (1899–1986). Its idea is simple: the library holds all books that can be produced by every combinatorially possible sequence of symbols up to a certain book length. In Jorge Luis Borges case, the Library is immensly large since it contains all possible books up to 410 pages. The American Scientist calculates:

… each book has 410 pages, with 40 lines of 80 characters on each page. Thus a book consists of 410 [pages] × 40 [lines] × 80 [characters] = 1,312,000 symbols. There are 25 choices for each of these symbols, and so the library’s collection consists of 251,312,000 books.

But what is the size of a Library of Twitter, i.e. the size of the set of all theoretically possible tweets? It should be (i) much smaller and (ii) much easier to calculate due to the particular structure of tweets. Here’s a brief back-of-the-envelope calculation:

Given the 140 character limit of tweets, and assuming an english vocabulary of 26 symbols expanded by basic syntactical elements such as punctuation (.), commas (,), spaces ( ), at signs (@), hashs (#) and a few others, we end up with 140 characters and all combinatorially possible sequences of a vocabulary of maybe 50 symbols. Based on these (conservative) assumptions, the Library of Twitter holds at least 50140 tweets.

In other words, the size of the Library of Twitter is at least 7.17 × 10237 [1] or:

While this number seems impressive, it pales in comparison to the size of the Library of Babel (which is 1.956 × 101834097). As with the Library of Babel, most of the Library of Twitter contents would be non-sensical. But on the upside, the library would also contain all tweets ever written in the past and all theoretically possible tweets to be written in the future. Thereby, 50140 is an upper bound on the information that can be conveyed in 140 characters given a vocabulary of 50 symbols [2]. This first approximate upper bound should be informative for future studies of Twitter to answer questions such as: How many of the theoretically possible tweets have already been written – or in other words – how much is there left to write before we run out of (sensical) combinatorial options?

I’ll leave it to somebody else to calculate the number of bits and hard drives necessary to store, mine and search the Library of Twitter.

[1] all numbers calculated with WolframAlpha
[2] It is obvious that larger assumed vocabularies would significantly increase the size of the library.

Mechanical Turk has received some bad press recently (this is one example). It has been pointed out that Mechanical Turk can be used to do evil, which got me interested in seeing whether if and how it can do any good (or at least: creative). This has led to the post here, and resulted in the following poem – collaboratively produced by independent workers on Mechanical Turk.

In the daily life of a Mechanical Turk

In the daily life of a Mechanical Turk,Never have I quite finished my work,

For I return and refresh and come back for moreIn quest of a yet higher score

Now and then my eyes may tireIf I said they didn’t, I’d be a liar

Though I am spent, It’s hard to stopEven when I’m ready to drop

My available HITs are waiting for meOcassionally I’d rather go and watch TV

The structure of the poem is fully algorithmically determined. It has been written collaboratively by a crowd of Mechanical Turkers interacting with each other only through HITs. Before designing the poem algorithm, I’ve done some research on the structure and different types of poems, which led me to Acrostics.

“An acrostic (Greek: ákros “top”; stíchos “verse”) is a poem or other form of writing in which the first letter, syllable or word of each line, paragraph or other recurring feature in the text spells out a word or a message.” (wikipedia)

In my poem algorithm, I’ve constrained the first letter of each sentence in the poem, thereby forming an acrostic. As an additional constraint, I required the poem to consist of pairs of sentences that rhyme (similar to a Limerick).

While I determined (i.e. programmed) the structure of the poem, the content was completely produced by mechanical turkers. The only input provided was the title, which acts as the first sentence of the poem as well. Each rhyming pair of sentences was written by 2 different turkers, i.e. the output of one turker was used as an input for another turker. Total price of the poem was 1.804 USD. The poem was built incrementally, each subsequent turker had access to the output of all previous turkers. All tasks were requested at least 3 times, selection among alternatives was done by me, although it could have easily been done by Turkers themselves. In total, the contributions of 7 different Turkers were used in the poem above (while many more have worked on the HITs).

With that, I’ve initialized the poem algorithm with the acrostic “Infinite Monkey” and the title “In the daily life of a Mechanical Turk” and ran it on Mechanical Turk. The result can be seen above.

The Infinite Monkey acrostic refers to the Infinite Monkey Theorem:

“The Infinite Monkey Theorem states that a monkey hitting keys at random on a typewriter keyboard for an infinite amount of time will almost surely type a given text, such as the complete works of William Shakespeare.” (Wikipedia)

That’s what we are trying to test here, in a less statistic and a more informed manner though. Instead of producing all possible poems, we are interested in producing constrained yet plausible poems, efficiently (i.e. in very few iterations).

Which leads to a variation of the Infite Monkey Theorem that I’d like to propose here:

The Finite Turker Theorem states that a finite (yet potentially large) number of independent writers (here: Mechanical Turkers) will almost surely produce a poem that is creative, enjoyable and mostly indistinguishable from a single author poem.

With the Finite Turker Theorem, and market places such as Mechanical Turk, it might be possible to outsource creative work – such as poem writing – to a large set of workers without much penalty in terms of beauty or enjoyability. Algorithms such as the one above can constraint and influence the resulting poems, giving greater control about the outcome of creative processes (which sounds like an oxymoron).

Because HITs were requested multiple times, there were several rejects that did not make it into the final poem, but which show some of the difficulties as well as the creative potential of programmed poems, including:

…
For I return and refresh and come back for more
Info, my pimp: [I’m] a Dolores Labs penny whore
…

Conclusion: It has been suggested that the primary use of Mechanical Turk is the execution of simple, easily replacable and often spam-related work. This little experiment suggests that Mechanical Turk can serve richer purposes, by tapping into the creative energy of an underestimated, underutilized but also (currently) underpaid work force.

I am happy to announce that my research group at TU Graz has launched Bulltweetbingo!, a game-with-a-purpose based on Twitter, today. The game is already live and available at http://bingo.tugraz.at. For an introduction to the idea of Buzzword Bingo, please see the following IBM commercial (Youtube video).

IBM Innovation Buzzword Bingo (Youtube)

Rather than playing buzzword bingo while listening to a talk, the idea of Bulltweetbingo! is to play Buzzword Bingo with the people you follow on Twitter. All people you follow on Twitter automatically participate in the game by tweeting. A Bulltweetbingo game terminates (i.e. hits “Bingo!”) if the people you follow on Twitter use a particular combination of the defined buzzwords in their tweets. We intend to use the data provided by each game in our research on analzying the semantics of short messages on systems such as Twitter or Facebook. Each game provides information about the relevance and topics of tweets for a particular person as well as some information on the topics of tweets that a person expects to receive in the future.

I’m copy’n pasting some more information about the game that we have made available on the game website (about the project).

Bulltweetbingo!
Playing a game of bingo with people you follow on Twitter.

A team of researchers from Graz University of Technology, Austria has developed one of the first games-with-a-purpose that is exclusively based on Twitter.

The goal of this project is to annotate and to better understand the short messages posted to so-called social awareness streams such as Twitter or Facebook. Using this data, the researchers aim to improve the ability of computers to effectively organize and make sense out of the sea of short messages available today.

Dr. Markus Strohmaier, Assistant Professor at the Knowledge Management Institute at Graz University of Technology, Austria explains: “While social awareness streams such as Twitter or Facebook have experienced significant popularity over the last few years, we know little about how to best understand, search and organize the information that is contained in them.”

To tackle this problem, the researchers have developed a game of Buzzword Bingo that users can play with people they follow on Twitter.

“With each game users play on our website, we will collect data that helps us develop more effective algorithms for better understanding this new kind of data” Dr. Markus Strohmaier says, “and in addition to that, we simply hope users would enjoy playing a game of Bingo on Twitter. Each game is unique and exciting in a sense that users generally don’t know what tweets people will publish during the course of a bingo game”.

The researchers have launched the site bulltweetbingo! and ask users to sign up and to play a game of Bingo with the people they follow on Twitter. Twitter users can sign up at http://bingo.tugraz.at.

The game was implemented by one of my talented students, Simon Walk – Make sure to hire him if you need a complex web project to be realized quickly and effectively!

About me

Markus Strohmaier, Full Professor of Web-Science at the Faculty of Computer Science at University of Koblenz-Landau (Germany) and Scientific Director at GESIS - the Leibniz Institute for the Social Sciences (Germany).

My research focuses on the World Wide Web, my interests include social computation, agents, online production systems and crowdsourcing.