How to do a distributed Twitter (MSM)

Dave Winer recently made a call for an open source twitter shell, which he suggests be perhaps done with a javascript framework to let any site act like twitter.com. Many people are interested in this sort of suggestion, because while the folks at twitter.com are generally well loved and felt to be good actors, many people fear that no publishing system that becomes important should be controlled by just one company.

For success, such a system would need to be as easy to use and set up as twitter for users, and pretty easy to set up for server operators. One thing it can’t do so easily, alas, is use a simple single namespace the way twitter does. A distributed system probably has to make names be domains, like E-mail addresses. That almost surely means something longer than twitter names and no use of the @name syntax popular in Twitter to refer to users. On the other hand almost everybody already has a domain based ID, ie. their E-mail address. On the other hand most people are afraid to use this ID in public where it might get spam. It’s a shame, but many might well prefer to get a different ID from their E-mail, or of course to use one at twitter, which would now look like user@twitter.com to the outside world instead of @user within twitter.

Naming problems aside, the denizens of the internet are certainly up to building a publish/subscribe based short message multicasting service, which is what twitter is using terms much older than the company. I might propose the name MSM for the techology (Multicast Short Message)

Indeed, many of these technologies already exist. Jabber (XMPP) already offers such facilities, and was in fact used by Twitter in its early days, the the tools did not scale to doing it all on one site via XMPP. It does have a bit of baggage which makes the XML messages larger than needed, and its main publish/subscribe mechanism is for presence, not messages, though it can readily be adopted. There have been many other pubsub methods produced, and whole companies based on them. A new google-fed open source tool called pubsubhubbub is one example.

Any site could then present a unified interface on top of the distributed infrastructure. Just as search engine companies show us one index of the distributed web, they would also spider and mass subscribe to all the public MSMs to let you search, filter and see them in different ways. Twitter.com would be just one company doing that. No doubt Google and Yahoo would soon join. Many companies would also sign up to host your tweeting (need a new name for that, too, “massum” and “massing” perhaps) for you so that the public or your followers could see your history, and you could have a server to talk to via your client or a web client. Again, twitter.com would be the immediately leader if it decided to participate in, rather than fight the distributed MSM system.

Also important would be efforts to make all this more efficient, as today it is done by millions. It is likely that even though you might want to use a name at your existing domain (Say b@4brad.com) you might decide you don’t want to host an MSM server, so the domain owner (that’s me, but more often would be an ISP or company) would just set up a DNS record to redirect to another company which would run the server. Again, that might well be twitter.com if it chooses to do this. Chances are there might be only a few score servers which handle the vast bulk of users, with the rest running their own private servers. If this is done, even though a message needs to be multicast to a million followers, it might be discovered (and cached) that the vast majority of those million followers are really on a much smaller number of servers, and as such the number of messages to be sent out could be quite manageable.

It might also make sense to consider IP multicasting. IP multicasting is one of the internet’s most useful tools but it is very rarely used, because it usually does not make it through NATs and firewalls. However, it is generally available to servers at main network hubs. It allows you to “multicast” a packet to subscribers, and they call will be sent it, but each packet only travels once down any given internet pipe, even if there are several receivers waiting for it at the end of that pipe, or other pipes branched off that pipe. (Yes, sometimes the internet really is a series of tubes.)

Since a fair number of sites would both represent many users and also want the full firehose of public data, they might do well to use multicasting to be efficient. Note that multicasting is not a 100% reliable service, so it requires an extra protocol where packets have serial numbers so you can see if you miss a packet, and use regular TCP methods to request those lost packets, assuring you miss nothing. Getting a few MSMs slightly out of order over the course of seconds is no big deal. Sites without a lot to say would not multicast, but might ask another site which does to do it — it’s not a big burden.

So you, or your MSM client, would connect to your server of choice, and there it would see your feed. When you wanted to see things about other people, and arrange to follow them, there would need to be a protocol to generate the right web address of the server which handles that user, and can show you their history, their lists, or other new features related to them. For search you might use your own server or another you trust, or your own server might embed Google search or any similar provider in your page, so it looks and acts like Twitter today when you see it. To make all this work, there has to be a way that given a name in the system (like b@4brad.com) you can turn it into a URL which would deliver something very much like what you see at www.twitter.com/username today. Again, the domain records for 4brad.com would need to contain a record pointing to that web server, and your browser would then be directed to whateverserver.com/msm/b@4brad.com or similar. Another approach would be an HTTP based system like webfinger which depends on a web server at the domain. A client would also follow a redirect, but possibly speak a client API to it rather than just serve up a web page.

All this is doable. Is twitter important enough that the will would exist to create it?

Of course, a publish/subscribe multicast server would actually serve a number of other uses, particularly because it would not be limited to 140 characters. It would also be a good way to distribute new blog posts to readers in a push method, and do mailing lists and newsgroups and many other things. It’s something the web has hobbled together many times from other technologies, and even is similar to some ways of interpreting USENET. As such the work would have many other benefits.

Update: Status.net (which powers identi.ca) is one attempt in this direction though I need to learn more about how it handles namespace. I need to study it more. Silly me for assuming that because Dave’s blog post on the subject did not refer to some of these projects that there wasn’t yet much progress in the area!