A Twitter Analog to PageRank

A few weeks ago, there was a flame war about Twitter authority, and I was all too eager to throw fuel on the pyre. But now that the blogosphere has calmed down a bit, I’d like to propose a ranking measure that I think might work. My apologies if it isn’t original. In fact, if you’ve seen it elsewhere, please point me to it.

Let me start with the assumptions about the model:

Influence(X) = Expected number of people who will read a tweet that X tweets, including all retweets of that tweet. For simplicity, we assume that, if a person reads the same message twice (because of retweets), both readings count.

If X is a member of Followers(Y), then there is a 1/||Following(X)|| probability that X will read a tweet posted by Y, where Following(X) is the set of people that X follows.

If X reads a tweet from Y, there’s a constant probability p that X will retweet it.

This model is obviously simplistic in all three assumptions. But I think it’s a reasonable first cut. In particular, it accounts for the inflation that occurs from people who follow in the hopes of reciprocity. There’s less value in being followed by someone who follows a lot of people, because that person is less likely to read your messages or retweet them.

Of course, there’s room for adding more realism to this model, but I hope it is at least close enough to the truth to be interesting.

From this model, it’s easy to measure someone’s influence recursively, assuming that we know the constant retweet probability p:

The recursion is infinite over a graph with directed cycles, but rapidly converges as high powers of p approach zero. I would think this measure wouldn’t be hard to compute to a reasonable accuracy.

This measure strikes me as a PageRank for Twitter or any system with similar properties. There’s more room for nuance, but I at least find this approach more plausible than the ones I’ve seen. It also strikes me as hard to game, since it isn’t counting retweets, and it’s hard to add much influence through followers who don’t have any influence themselves.

What do folks think? Has anyone tried this? If not, is there anyone who’d like to try hacking an application to compute it? Either way, please let me know!

77 responses so far ↓

An excellent starting point. Since I’ve been following the talk around Tweet Rank the last couple of weeks the missing component (IMHO) that you address is that “There’s less value in being followed by someone who follows a lot of people, because that person is less likely to read your messages or retweet them.”

As you say it represents the attention scarcity. In social networks more is not necessarily better and often has the opposite effect.

This looks like a standard type of heuristic ranking to me? The magic of PageRank is it’s use of the stationary distribution (first eigenvector) of the link-graph. As an aside I hate attributing this technique to Google.. people have been using the stationary distribution to analyze complex systems long before Larry Page plucked it out of a textbook.

Off the cuff I’d say the retweet matrix to be far too sparse to really measure the dynamics of information sharing on twitter.

I’d put forth that measuring click though of shared links (biy.ly, tinyulr etc) is a more implicit measure. Of the 50+ people I follow re-tweeting is not too common as a percentage of posts… but my bit.ly analytics show that people click on the links I share often.

One could probably add all the various #XXXX and other ‘social command line’ stuff on twitter.

Hmm TwitterRank smells like an area with real meat behind the first bite 😉

I like it. I ran a simulation on a toy social graph and the results are pretty much what you’d expect. For my small example, the score was fairly sensitive to the effect of being followed by people who follow many other people. It seems to converge very quickly too, as you say.

One thing that was interesting was that users who are only followed by people with no followers, will have a score of zero. This properly reflects their lack of retweetability, so maybe you should call it RetweetRank.

Neal, you’re right that it’s sloppy for me to use “PageRank” as shorthand for stationary distribution on the link graph. My apologies.

I should also note that I intend “retweet” in a general sense. I don’t see much of a practical difference between citing a post, retweeting it, or in some cases even replying to it in a way that draws attention to it. So I kept the model simple, but I intend the concept generally.

Jason, I’m psyched that you’re already looking at this empirically! The zero-follower case is a nice reality check. And, just to clarify the sensitivity you observed, do you mean that a follower who follows many other people adds almost nothing to a person’s rank?

Correct, they add very little to your rank. Sorry for the lack of clarity.

Click stats might be a great thing to add to this somehow, but I don’t see how you could possibly implement that. I’m not sure every (or even most) url shorteners publish those stats. If you implement that unevenly, it will certainly bias your results. And then occasionally twitter doesn’t auto-shorten, and you’d have no way to get to those unless you could harvest the clicks from all the ten million twitter clients and proxies out there. And then there’s URLs that aren’t turned into links.. 😛

Maybe instead, just track the number of links published. Someone who publishes too many would have to be penalized, since followers will only click on so many.

I do like that my model can’t be gamed by clicking or even by sending messages. Perhaps that’s disguising a bug as a feature–it models expected influence rather than trying to measure it empirically. But I see that as similar to PageRank, which uses a stationary distribution of the link graph rather than the click stream. Of course, it wouldn’t be a bad idea to validate the model against reality. And there’s still the question of how to pick the retweet probability p.

Sorry, this comment is probably a bit orthogonal to the type of discussion that you’re seeking here, but… I have to ask.. why do we need a Twitter analogy to PageRank?

Isn’t one of the problems of PageRank that it emphasizes authority and influence a bit too much, to the detriment of exploratory search? It’s basically a “find me what’s popular” mechanism, rather than a “find me something interesting that I might not have found in any other way” mechanism.

Why seek to duplicate that sort of bias in the Twitter world? Wouldn’t it be more interesting to do exploratory Twitter search, and come across less influential, less connected voices, but perhaps who have something more interesting to say?

Well, to start off, what I’m trying to measure here is people (or AIs that have Twitter accounts), not messages. And I’m trying to measure influence, which is a bit more subtle than popularity. The question I’m trying to answer is: if X says something, what will be the expected impact?

What do I care about measuring influence? Here are a couple of reasons:

– I’d like to be able to measure my own influence, since one of my goals is to increase the leverage associated with my ideas. If I were a company, the same would apply to measuring brand capital. In my own case, I’d like to be able to check my balance of reputation capital.

– I’d like to know who the influencers are so I can monitor them and in some cases court them. Of course I’ll have other criteria about the people and their areas of expertise. But the ability to explore and the ability to sort by influence are complementary. For example, I’d love to know who are the most influential people tweeting about information seeking.

For me, these are the practical applications. I’m sure that others would find it interesting for different reasons, perhaps even simply as an interesting research problem in social networks.

Quick comments.
“a follower who follows many other people adds almost nothing to a person’s rank”
-you might want to take into account that twitter clients (like tweetdeck and others) might allow a different model of following for following sake and following and engaging in (by creating filter views) not sure how prevalent that is but something that people have also been actively requesting directly from Twitter as a feature as well. just a thought

I can imagine an extension to the model where, even though Y follows n people, Y doesn’t follow all of them equally. In that case, the inside of the sum shouldn’t be weighted uniformly by 1/Following (Y), but rather the attention of Y should be allocated to reflect how Y allocates attention among Following (Y). Of course, that can only be done in practice if Y is able and willing to publicize this allocation.

I am one of those people Daniel alludes to who use filters, though it is a practice some might disagree with. While I follow 250+ people, I really only keep up with what 50-75 are tweeting. The others I do occasionally read but with considerably lower frequency. Twalala is great for this purpose, btw..

Indeed, I’ve noticed that some people who follow a thousand people nonetheless seem to notice my tweets with frequency that belies my simple proposed model. The modification I proposed in response to Daniela should handle this case. E.g., for Jason, those 50-75 people might each get 1% of the weight, with the rest divided among the remainder. I don’t know how you’d compute the weights in practice, but I think it’s at least the right model in theory.

One approximation might be to count the number of people you reference or reply to. That will miss a lot of where your attention is going and it may discourage interaction since people will be penalized for communicating. So forget I said it.

I think this is a reasonable first pass but assuming p is constant across posters is limiting and, as the nature of posters changes, will skew the results. Suppose a Hollywood A-list celebrity starts twittering; my intuition is that s/he would get a lot of followers but fewer retweets. If that holds true, someone ‘twitter famous’ like timoreilly might have significantly fewer followers but significantly more impact.

The first-order retweet probability for any given poster can be empirically determined, though it changes the problem from O(number of links in the graph) to O(number of links + number of posters * number of messages per poster).

I readily concede the assuming p is constant across posters is limiting. But I am stumped on how best to remedy the problem. The obvious approach of inferring p from behavior seems to invite gaming, assuming that such a measure was adopted and people cared about their influence scores. I feel I’ve addressed the follower inflation problem, but I don’t know how to address the retweet inflation problem, so I’ve avoided considering actual behavior in the model.

But I do wonder about your Hollywood vs. “Twitter famous” celebrity example. I’d be curious if there are differences among their followers that would show up in their follower subgraphs. Is it wishful thinking to imagine that the average person following @timoreilly is more influential than the average person following @britneyspears?

>> Is it wishful thinking to imagine that the average person following @timoreilly
>> is more influential than the average person following @britneyspears?

Not what you intended to mean, but I couldn’t help wonder how many pop music fans desperately want Britney’s observations on the power of real-time enterprise? Or for that matter, how many IT executives are purchasing make-up based on Tim O’Reilly’s endorsements?

> But I do wonder about your Hollywood vs. “Twitter famous” celebrity example. I’d be curious if there are differences among their followers that would show up in their follower subgraphs.

There’s going to be a lot less closure in the graphs of celebrities and entities (ie. @ruwtbot or @zappos) — that is, “triangles” in that user’s 1-neighborhood. The triangles will also have a different spectrum: an organic person has a lot more 2-2-2 triangles (all three follow each other) reinforced by @, RT and fave links.

Tim O’Reilly’s cluster will probably have a stronger topical signature than Britneys (A “word cloud” of his 2-neighborhood will have a sharper distribution than hers). He also probably has many more conversation threads (reply-reply-reply) although that’s so messy to measure I’m not looking at it.

But I guarantee you THE REAL SHAQ is more influential by far.

… Incidentally if anyone’s interested in collaborating on some factor analysis / bayesian classification with this please email me flip at infochimps org

Interesting. It’s certainly intuitive (almost tautological) that the graphs of followers of mass-appeal celebrities will be sparser than those of people who have a more targeted appeal.

But that doesn’t answer the question of whose followers themselves follow more people. Maybe there’s no correlation.

In any case, I’m excited about your analysis (sorry the comment was initially swallowed by Akismet’s spam filter). It also seems to capture attention scarcity, but based on how the network is dynamically used rather than how it is statically configured.

Ouch .. I think my head hurts after reading through this very interesting conversation but I have a question (or two).

If I understand this correctly you are trying to come up with an ranking system for a person’s influence within the TwitterSphere – what I am curious about is would an individual’s ‘rank’ of influence be more than that of a company or even a celebrity?

If so how would you be able to differentiate between accounts that are actual valid people using Twitter for more than just self-promotion and from companies announcing products in contrast to companies who actually use Twitter to interact on a full time basis.

Would a company utilizing Twitter as a ‘help desk’ interaction have more influence than a celebrity or say someone like myself or yourself?

It would be great to have meta-data about users to know if they are people vs. companies, computer scientists vs. basketball players, etc. But I see this sort of information as orthogonal to their influence. I don’t think it is worthwhile to compare Tim O’Reilly with Shaquille O’Neal: it is more likely a question of whether people care more about Web 2.0 or Basketball 7’1″.

Granted, people might not choose to describe who they are accurately. But hopefully the fakers would get culled out by their low influence once people discovered they were fakers–open fakers like Fake Steve Jobs being the exception.

Thanks for the link! I’ve seen something similar before, but not as a GreaseMonkey script. It would be interesting to go one step further than looking at common followers, e.g,, using some sort of feature reduction based on sets of people that are often followed as a group or even analyzing the content of their tweets.

[…] with modeling authority and influence in social networks, a problem in which I take a deep personal interest. Another inferred attributes of social network users based on those of other users in their […]

[…] To me, it’s Google’s responsibility to intervene. The company that expresses algorithmic prowess on so many complex patterns should have no trouble in doing so with blog engagement. The raw numbers displayed in feedburner chicklets are no more reliable than the 1990s hit counters which allowed unscrupulous webmasters to “start off” with high numbers in order to mislead the readers that a site was popular. Perhaps we need a pagerank for Twitter/Friendfeed followers. […]

[…] What criteria should we pursue to help people find useful information in microblog collections? Surely time plays a role here. Topical relevance is a likely suspect, as are various types of reputation factors such as TunkRank (and here). […]

[…] you might have, and there are even some people thinking seriously about how to measure that kind of influence.Another question that David raised was if there were wiser investments to be made.A figure that […]

[…] spamming techniques. Then, there was another approach by Daniel Tunkelang, first posted on his blog thenoisychannel, which got implemented as TunkRank. Unfortunately, at the time of this writing, I could not test […]

Hi Dan. I know this is quite a belated follow-up to your post, but I was wondering: as I(X) is the expected number of people who read something by X, how does the method cope with cases where after convergence, I(X) is greater than N, where N the #nodes in the network? In your implementation, do you impose a threshold N for each element of I per iteration?
Thanks and kudos for Tunkrank!

Yannis, it doesn’t try to. While it’s theoretically possible for the model to exceed this threshold, it won’t happen with realistic parameter values. Indeed, the model doesn’t even try to prevent cycles, nor does it avoid double-counting if person reads the same message twice (because of retweets).

Lots of room to make the model more complex / realistic. But I figured it was best to keep it simple and understandable, at least to start off.

I find this to be a very interesting experiment, though I do not tweet myself. Apologies for the delayed comment, my question is if “influence” is really captured by simply reading something (or calculating readers). For someone that read and retweeted, it seems like your tweet may have “influenced” them a bit more than a reader. Perhaps calculating readership is difficult enough that there is no benefit going beyond that. Would like to hear your thoughts.