We know that most ReTweets contain a link, but there are hundreds of different URL shortening services available to help you save space with that link. I analyzed my database of over 30 million ReTweets and compared them to over 2 million random Tweets to find which shorteners are the most (and least) ReTweetable.

I calculated how much more or less often each URL shortening service appeared in ReTweets than it did in normal Tweets and presented this value as a percentage. For instance, in my data 9.28% more ReTweets than random Tweets used bit.ly. I took into account the fact that ReTweets tend to contain more links than average Tweets and normalized the occurrence values.

The short, post-Twitter shorteners, bit.ly, ow.ly, and is.gd were all more ReTweetable than the older, longer, tinyurl.

I’ve looked at the 20 words and phrases that tend to get the most ReTweets, but what about the flip-side of that coin? What about the words that are least likely to get your ReTweets?

I used my database of over 30 million ReTweets, and compared it to a sample of over 2 million random Tweets and found the common words that occurred far more often in non-ReTweets. The percentages below represent the relative Un-ReTweetability of the 20 least ReTweetable words.

Some Highlights from the List

There are a number of “-ing” verbs, including “going,” “watching” and “listening,” which reinforces my understanding that answers to the “What are you doing?” question don’t get very many ReTweets.

The presence of “sleep,” “bed,” “night,” and “tired” indicate that people often Tweet “goodnight” style messages, but generally don’t ReTweet them.

The relatively informal nature of many of the words on the list including “lol,” “gonna,” and “hey,” show that simple or slang conversation is not ReTweetable.

The lesson learned here is that if you’re trying to get more ReTweets, don’t just engage in idle chit-chat or Tweet about mundane activities.

140 characters doesn’t leave much room for extraneous letters, numbers or symbols, so you might think that punctuation would be sparse in Tweets. But I compared a random sample of over 1 million “normal” Tweets to a sample of over 10 million ReTweets and found that 85.86% of Tweets contain some form of punctuation, and an overwhelming 97.55% of ReTweets do as well.

Of course, the prevailing ReTweet format includes a colon to better display the original Tweet, but even when ignoring this form of punctuation, ReTweets still contain more punctuation than non-ReTweets (93.42% to 83.78%).

I then analyzed the frequency of specific types of punctuation and found that hyphens, periods and colons are the most ReTweetable punctuation, occurring far more commonly in ReTweets than in regular Tweets, while the rarest mark, the semicolon, is the only unReTweetable punctuation mark.

I study ReTweets because I believe they offer an unprecedented window into how people spread ideas. And while Twitter may be threatening to mangle them, I think they’re still the most important innovation to come from microblogging yet.

I gave this presentation, or a version of it, at a few conferences this summer and since then I’ve done a bunch more analysis. So I added all my new data to the slideshow, included a video interview with me after Social Media Camp and uploaded it to SlideShare for your viewing pleasure.

If you’ve read this blog, you know that ReTweets are one of my favorite topics. For a ton of reasons I think that they’re not only one of the most important developments to come from Twitter, but from social media in general.

How ReTweets Work Now

As you probably know, ReTweets were designed by the community, for the community, and currently look like this:

RT: @username Really Awesome Tweet

Granted, the “RT @username” prefix takes up some space, but that minor annoyance is more than made up for by the benefit users get from a Tweet clearly labeled as being ReTweeted from @username originally. When you see a ReTweet in your timeline it has the avatar of the person who did the ReTweeting, so you know who spread it to you and from whom they got it.

ReTweeters could add their own commentary (and lend social proof with their name and avatar), Twitter client developers could add one-click ReTweet functions and analysts (like me and Microsoft) could gather ReTweets and study them.

How Twitter Aims to Break ReTweets

In a stunningly disappointing move, Twitter has threatened to completely eviscerate most of the value out of ReTweets by “formalizing” a feeble version of a format that was already well understood and functional for all users involved.

Twitter plans to add a button to the Twitter web client that says “Retweet” that will allow you to send the same exact Tweet, with no editing, to your followers. Your followers will see the original poster’s avatar and name, even if they’re not following them, and the only indication they’ll see that it is a ReTweet will be a small line of light gray text underneath it.

I follow people because I trust and enjoy their point of view, I don’t nessecarily trust the POV of people I don’t follow, so using the original poster’s picture and name in my timeline destroys any social proof the ReTweeter may have lent the Tweet.

Most active Twitter users use third party desktop and mobile clients to Tweet, and there is no way of telling how those developers will indicate ReTweets in this new format just yet. The Tweets will not contain the “RT @username” prefix. There will no longer be a commonly understood format. Scanning my friend’s timeline is how I use Twitter, and I suspect how many of you do too. The new ReTweet format will make that much harder.

If more than one of my followers ReTweet the same Tweet, the screenshots seem to indicate that the ReTweet won’t appear more than once in my timline, it will simply be updated to say “ReTweeted by @user1 and @user2…” The problem here is that if @user1 ReTweets at 1pm and @user2 does it at 2pm, that Tweet will have been buried in my timeline and I won’t see it again.

The new version of ReTweets will come with a few new API calls. They’re calling your friend’s timeline by a new name so they can deprecate the old one (which worked fine). They’re going to allow you to see ReTweets you’ve posted (not sure why), and ReTweets your followers have posted (which you could already do). The only kind of cool API call is the one that will allow you to see the 20 most recent updates that are ReTweets of your Tweets; problem is, you can only get yours. You can’t see the most recent ReTweets of other people’s content, and you can’t check for ReTweets of a specific Tweet.

How ReTweets Should be Adopted

The idea of a button, next to the reply button, is great; that absolutely should be implemented. But clicking that button should do the same thing that TweetDeck does: copy the Tweet into the text area, add “RT @username” and let me edit before sending.

An API call should be added so that 3rd party clients could signal to Twitter that a Tweet is a ReTweet of a specific update. The new API calls are otherwise fine, but there should also be a call to get all ReTweets of a specific Tweet.

My advice? Use the HashTag: #SaveReTweets to start making some noise about this, and keep using the old “RT @username” format.

Not all viral content sharing on Twitter happens in ReTweets, so when I designed my viral Tweeting survey, I included three similar questions:

1) What types of content do you ReTweet?
2) What types of content do you Tweet about?
3) What types of content do you Tweet links to?

They’re all pretty similar, but there are obvious differences in each. For instance, people are OK Tweeting about their own opinions, but are unlikely to Tweet links to or ReTweet other peoples’ opinions.

But here’s where the data gets really interesting, the graph below shows the differences in answers by the gender of the respondent. 10.5% more women than men say they ReTweet entertainment content, while 32.2% more men Tweet about their opinions.

I’ve done a bunch of research into the characteristics of ReTweets in an effort to understand what makes them viral. ReTweets are the first entirely observable and analyzable viral content spreading mechanism in the history of mankind and as such they offer an unparalleled window into what makes humans spread ideas.

Over the past few weeks I’ve begun delving into much deeper analysis than I have in the past with more advanced tools and a much larger dataset. At present I have a database of over 10 million ReTweets and I’ve gained access to Twitter’s new streaming API which allows me to build a very large (10 million and growing) random sample of all tweets (not just ReTweets).

In re-visiting a data point that I looked at 6 months ago (this time with a larger data set), I found that in a random sample of normal (non-ReTweet) Tweets, 18.96% contained a link, whereas 3 times that many ReTweets (56.69%) included a link.

Then I tested the assumption that simplicity is a vital component of ReTweets (as it has been observed in other viral-content types) and I found that random Tweets have 1.58 syllables per word on average, while ReTweets had an average of 1.62 syllables per word. Longer, higher syllable-count words are typically more complex, indicating that ReTweets may be more complex than their less viral counterparts.

Comparing two different types of reading grade level analysis revealed that ReTweets, in general, are less “readable” and require a higher level of education to understand. A Flesch-Kincaid test gave ReTweets a reading grade level of 6.47 years of education, while random Tweets only required 6.04 years. The similar SMOG test (Simple Measure of Gobbledygook) indicated that ReTweets required 6.13 years of schooling, with random Tweets only needing 5.88 years.

Another characteristic commonly found in viral content is novelty; that is, the “newness” of the ideas and information presented. I created a measure of novelty by counting how many other times each word in my sample sets occurred. In the random Tweet sample, each word was found an average of 89.19 other times, while in the ReTweet sample each word was only found 16.37 other times. This shows us that while simplicity may not be very important to ReTweetability, novelty certainly is.

Part of speech (POS) tagging is an analysis technique in which an algorithm is used to label each word in a piece of content as a specific part-of-speech–noun, verb, adjective, etc. The graph below shows what percentages of words in each sample were labeled as a specific part-of-speech. It lists only the most interesting parts from the much larger list of POS tags.

Interesting points from this data include the noun and 3rd-person heaviness of ReTweets, indicating a subject matter and headline type nature.

First up is the more “Freudian” Regressive Imagery Dictionary (RID). This coding scheme is designed to measure the amount and type of three categories of content: primordial (the unconscious way you think, like in dreams); conceptual (logical and rational thought); and emotional.

Significantly more primordial content has been found in the poetry of poets who exhibit signs of psychopathology than in that of poets who exhibit no such signs (Martindale, 1975).

The first RID graph shows that ReTweets contain less primordial and emotional content than random Tweets and more conceptual content.

Looking at specific RID codes, we see that social and instrumental (constructive words like build and create) behavior are ReTweetable, while abstract thought and sensation-based words are not.

The last analysis I performed used LIWC (pronounced “Luke”). This is a lexicon similar to RID, but based in more reviewed and accepted research and refined over 15 years. LIWC measures the cognitive and emotional properties of a person based on the words they use.

In order to provide an efficient and effective method for studying the various emotional, cognitive, and structural components present in individuals’ verbal and written speech samples, we originally developed a text analysis application called Linguistic Inquiry and Word Count, or LIWC.

LIWC analysis shows that Tweets about work, religion, money and media/celebrities are more ReTweetable than Tweets about negative emotions, sensations, swear words and self-reference.

I think the most powerful potential feature of a system like TweetPsych is its ability to match people based on their cognitive processes, so I’ve added two features to the still beta TweetPsych.

People That Think Like You

When you generate a profile for yourself or someone else, TweetPsych will also show you a list of 5 users who it believes share similar psychological characteristics. This matching is not done topically, therefore the other users you’re presented with may not Tweet about the same things as you.

These users come only from the list of users that the system has analyzed so far, so the results will get better as it analyzes more accounts. Starting this week, I am automatically profiling accounts starting with a few prioritized lists, including most ReTweeted users and most followed users to help build a large dataset for comparison.

Site Profiling

The second feature I added this weekend is site profiling. When you enter a URL TweetPsych will create a psychological profile of the content on that page and match it against its database of user profiles, returning the 50 closest matches.

Again, this matching is not done on a topical basis, meaning the users presented might not tweet about the same subjects the page is about. The goal is to help you find users that may be mentally aligned with the psycho-graphic profile of the web page you provided.

And just to reiterate, TweetPsych is still beta stuff and I’m aware there are issues, specifically around explaining and presenting the features in a more understandable way, but my first priorities were making the system stable under the huge traffic load (and my host MediaTemple has been awesome helping me) and fleshing out the potential power of the technology. I’m very open to new feature suggestions as I continue working on TweetPsych.

I am contemplating the possibility of releasing an API but I’m still thinking about how to handle the possibly high server resource demands. What features would you like to see in an API?

This weekend I was playing with a bunch of different linguistic analysis methods to better understand ReTweets, and while I uncovered a ton of cool new data which I’ll be sharing a little later this week, I also came upon an idea I think is pretty awesome, probably groundbreaking, and definitely worth Twittering about.

Communication is a window into a person’s mind, and the way a person talks can tell you a lot about how they think. Linguists have developed two methods to decoding the written word into a meaningful profile of a person’s cognitive processes.

One method is called the Regressive Imagery Dictionary (RID). This coding scheme is designed to measure the amount and type of three categories of content: primordial (the unconscious way you think, like in dreams), conceptual (logical and rational though) and emotional.

Significantly more primordial content has been found in the poetry of poets who exhibit signs of psychopathology than in that of poets who exhibit no such signs (Martindale, 1975). There is also more primordial content in the fantasy stories of creative as opposed to uncreative subjects (Martindale & Dailey, 1996), in psychoanalytic sessions marked by therapeutic “work” as opposed to those marked by resistance and defensiveness (Reynes, Martindale & Dahl, 1984), and in sentences containing verbal tics as opposed to asymptomatic sentences (Martindale, 1977). A cross-cultural study of folktales from forty-five preliterate societies revealed, as predicted from the “primitive mentality” hypothesis of Lévy-Bruhl (1910) and Werner (1948), that amount of primary process content in folktales is negatively related to the degree of sociocultural complexity of the societies that produced them (Martindale, 1976). Martindale and Fischer (1977) found that psilocybin (a drug that has about the same effect as LSD) increases the amount of primordial content in written stories. Marijuana has a similar effect (West et al., 1983). Research has also revealed more primordial content in verbal productions of younger children as compared with older children (West, Martindale, & Sutton-Smith, 1985) and of schizophrenic subjects as compared with control subjects (West & Martindale, 1988).

The other method is Linguistic Inquiry and Word Count (LIWC). In development for over 15 years, the LIWC measures the cognitive and emotional properties of a person based on the words they use.

In order to provide an efficient and effective method for studying the various emotional, cognitive, and structural components present in individuals’ verbal and written speech samples, we originally developed a text analysis application called Linguistic Inquiry and Word Count, or LIWC.

I’ve combined these two systems with a Porter stemming algorithm and my own Twitter analysis infrastructure to create TweetPsych.com.

TweetPsych uses the LIWC and RID to build a psychological profile of a person based on the content of their Tweets. It compares the content of a user’s Tweets to a baseline reading I’ve built by analyzing an ever-expanding group of over 1.5 million random Tweets, then highlighting areas where the user stands out.

The service analyzes your last 1000 Tweets; as such, it works best on users who have posted more than 1000 updates. It is also better suited for running analyses on accounts that are operated by a single user and use Twitter in a conversational manner, rather than simply a content distribution platform. It takes a few moments to analyze an account the first time, but subsequent views of a profile will load faster.

I’ve tried to translate the codes that come from the two linguistic systems into more meaningful explanations, but I may have missed a few. I will continue to expand these definitions, while also refining the system and algorithm to better analyze Twitter-specific content.

I think the possibilities of a system like this are enormous, from matching like-minded users to identifying users that exhibit certain useful or desirable traits. I’d love to hear your thoughts on where this could be improved or where I could take this technology next.

The internet has accelerated social communications and memetics more than it has fundamentally changed it (though it has altered some of the selection pressures on individual memes, namely around memory retention and expression). It has also, through mechanisms like Twitter and specifically ReTweets, made the exchange of cultural units much more open to quantitative analysis and testing. Through the keyhole of ReTweeting I believe it is possible to get a glimpse of the answers to the larger question of why and how humans spread information in a way that was never before possible.

I’ve found myself telling the Snow Crash story a lot recently to explain what I see as the true power of what I call viral marketing science. Here’s twoversions of it.

Being that I come at this opportunity from a marketing background, I look to this analysis to build a framework for repeatably creating contagious memes, so this presentation from PubCon Austin aims to do just that for ReTweets.

Dan Zarrella

Dan Zarrella is the award-winning social media scientist and author of four books: “The Science of Marketing,” “Zarrella’s Hierarchy of Contagiousness,” “The Social Media Marketing Book” and The Facebook Marketing Book.