you may want to think about the problem some. re-tweeting is more of a pattern matching problem since there is no hard and fast rules for the 're-tweet. consequently likely only part of the original tweet may be available so hashing won't work... See answer below to use text indexer
–
jottosMay 2 '09 at 19:35

@jottos For this purpose I'd assume all words beginning with RT are retweets and that covers 90% of the right ones. Practically sufficient. I am going to have to "clean" the tweet of all @words RTs etc, so hashing could be possible.
–
Lakshman PrasadMay 2 '09 at 20:31

7 Answers
7

You are trying to hash a string right? Builtin types can be hashed right away, just do hash("some string") and you get some int. Its the same function python uses for dictonarys, so it is probably the best choice.

Doesn't that produce a 32bit value, though? I think this application needs more collision-resistance than that, since he's planning to discard the message and rely only on the hash. With 32bit values you'd expect a collision within 65k tweets, which is like half an hour of Stephen Fry.
–
Steve JessopMay 3 '09 at 22:38

Well, It would be computationally expensive to compare a given 140 char string with thousands of such strings. I figured, querying the db with count(hash) is simpler and efficient. Corret me if I am wrong
–
Lakshman PrasadMay 2 '09 at 20:20

If you always sort your tweets and use binary search it could be doable. If your database is really huge, use radix search. (Linear run-time, how cool is that?)
–
Georg SchöllyMay 3 '09 at 18:47

Retweets are frequently non-identical. A hash would be oblivious to this unless you run some kind of "normalizer" first.
–
pchap10kMay 4 '09 at 11:50

I am not familiar with Python (sorry, Ruby guy typing here) however you could try a few things.

Assumptions:
You will likely be storing hundreds of thousands of Tweets over time, so comparing one hash against "every record" in the table will be inefficient. Also, RTs are not always carbon copies of the original tweet. After all, the original author's name is usually included and takes up some of the 140 character limit. So perhaps you could use a solution that matches more accurately than a "dumb" hash?

Tagging & Indexing

Tag and index the component parts of
the message in a standard way. This
could include treating hashed #....,
at-marked @.... and URL strings as
"tags". After removing noise words
and punctuation, you could also
treat the remaining words as tags
too.

Fast Searching

Databases are terrible at finding
multiple group membership very
quickly (I'll assume your using either
Mysql or Postgresql, which are
terrible at this). Instead try one
of the free text engines like
Sphinx Search. They are very
very fast at resolving multiple group membership (i.e.
checking if keywords are present).

Using Sphinx or similar, we search on
all of the "tags" we extracted. This
will probably return a smallish
result set of "potential original Tweets". Then compare them one by one
using similarity matching algorithm
(here is one in Python http://code.google.com/p/pylevenshtein/)

Of course, I have to "clean the tweet" of all the @words and punctuation. But rather than tagging, grouping, wouldn't it be simpler to generate some unique value that I can query the database as count(hash)
–
Lakshman PrasadMay 2 '09 at 20:29

Have you analyzed a sample of RTs and confirmed they are mostly identical? If you can rely on this, a hash will be simpler. But my swift wild-arsed guess is maybe 10-20% of RTs are non-indentical to the original. If you need high accuracy, then get a meaningful random sample (1000-10000) of Tweets that look like RT (i.e. starts with "RT @....", "via @....", "Retweet @...." or "@... said") and measure how closely they match the original? If accuracy is not so important, save time and just hash it. I had an idea for fast hash lookups too, so I'll put that below. :D
–
pchap10kMay 3 '09 at 2:34

There are a few issues here. First, RT's are not always identical. Some people add a comment. Others change the URL for tracking. Others add in the person that they are RT'ing (which may or may not be the originator).

So if you are going to hash the tweet, you need to boil it down to the meat of the tweet, and only hash that. Good luck.

Above, someone mentioned that with 32-bits, you will start having collisions at about 65K tweets. Of course, you could have collisions on tweet #2. But I think the author of that comment was confused, since 2^16 = ~65K, but 2^32 = ~4 Trillion. So you have a little more room there.

A better algorithm might be to try to derive the "unique" parts of the tweet, and fingerprint it. It's not a hash, it's a fingerprint of a few key words that define uniqueness.

Well, tweets are only 140 characters long, so you could even store the entire tweet in the database...

but if you really want to "hash" them somehow, a simple way would be to just take the sum of the ASCII values of all the characters in the tweet:

sum(ord(c) for c in tweet)

Of course, whenever you have a match of hashes, you should check the tweets themselves for sameness, because the probability of finding two tweets that give the same "sum-hash" is probably non-negligible.