An infrequently updated collection of comments on random subjects.

April 24, 2006

Dave Winer: Show me that mathematical proof!

Dave, even endless repetition of a false statement won't make it true...

Pito Salas of BlogBridgerecently wrote that they have implemented "Reading List Pings." This brings to OPML the same push-based bandwidth saving technology pioneered and proven in the realm of RSS/Atom syndication. As Pito says: "As more and more people both publish and subscribe to OPML Reading Lists, the polling that we are all doing isn't going to scale."

Almost on queue, Dave Winer slams the idea of Reading List Pings while inferring that Pito and others are "doomsayers." Winer claims that he can produce a "mathematical proof" that polling does, in fact, scale. In the past, Dave has made similar statements repeatedly while arguing against push-based notification, but he is no more correct today than he's ever been. No valid proof can be concocted that shows that polling is more efficient than push-based "notifications." It simply can't be done. No amount of name-calling or accusing folk of being "doomsayers" can change the simple fact that polling has serious and unsolvable scaling problems.

Dave is certainly correct when he says that "HTTP has [more than one] very efficient mechanism for software to determine if a resource has changed." In fact, the most efficient of those mechanisms in use today is the one that I originally defined on this blog: RFC3229+feed. However, even the best polling methods do not result in a system that scales as well as a push-based "notification" system that only sends updates to subscribers if and when there are changes.

Dave claims that sending "a ping to every subscriber... isn't practical because of firewalls and NATs." He is simply wrong. For several years now, PubSub.com has operated a system that efficiently sends "pings" through firewalls and NATs with no problem. We rely on the JEP-0060 extensions to the open Jabber/XMPP Instant Messaging protocol to create cheap continuous connections from inside the firewall/NAT boundary. There are many other methods available for dealing with the firewall/Nat issues. For instance, users of KnowNow's products or any of the implementations of mod_pubsub accomplish much the same as PubSub (with less efficiency...:-)) by using persistent HTTP connections.

Dave claims that if you rely on a "central authority" (like PubSub...) "nothing is more efficient ... than eTags, nor as widely implemented, nor as utterly optimized." While Winer is certainly correct in saying that "eTag" based polling systems are widely implemented, he is completely wrong in saying that "nothing is more efficient" and that nothing is "as utterly optimized." The RFC3229+feed mechanism, which relies in part on eTags, is certainly just about as optimized as a feed polling method can be, but it is still vastly less efficient, less optimized and less scaleable than a well-designed push-based method. I know this since in addition to having defined the very efficient RFC3229+feed mechanism, I've also personally directed, at PubSub, the construction of not only one of the blogosphere's largest feed polling systems but also the blogosphere's largest push-based notification system.

Dave's lack of understanding of the issues related to scaling can be seen in the history of the weblogs.com site that he struggled to build and maintain for so long. That site takes "pings" from blogs and then consolidates them into tremendous "change lists" which must be polled. Essentially, this site converts an efficient push-based update notification system (pinging) into an inefficient polling based system. Weblogs.com, as Dave built it, didn't even support common methods like eTags or RFC3229+feed to improve polling efficiency and scaling. The result was that it simply didn't scale and was frequently incapable of providing the service levels that people expected. Only now that Verisign has taken over the site and dedicated much better engineering staff and much more hardware, has the weblogs.com service begun to be somewhat useful again. However, since weblogs.com is still based on the terribly inefficient polling of change-lists that Dave supports, it is still a far cry from being what it might be.

It is true that mechanisms exist in HTTP to make polling more efficient and more scalable, however, those mechanisms do not permit polling to be more efficient than push notifications. Repeating false statements doesn't make them true. If Dave continues to claim that he actually has a "mathematical proof" that polling can scale better than a push-based system, he should present it. If he doesn't demonstrate his proof, then his claims are no more credible than Bush's claims to "secret" evidence of WMD's in Iraq or Joseph McCarthy's claims to hold a list of communists in the State Department...

On the "push.. NAT .. Jabber" issues , it would be interesting to see more about using keep alive and push over "Comet"
[http://ajaxian.com/archives/comet-a-new-approach-to-ajax-applications].
Then, the implementation of "Observer" patterns
[http://en.wikipedia.org/wiki/Observer_pattern] might be transparent even when http is only protocol for the remote observer objects.

FYI, the scaling issues on weblogs.com were on the ping-handling side, not in handling the requests from people polling changes.xml.

Also, you're right, the term "doomsayer" was not a good thing to say, and I apologize for any hurt it might have caused. However, you do a lot of name-calling in this piece, so much so that I'm reluctant to point to it, so as not to encourage this level of discourse. If you have something to say about this, can you do it without being so disrespectful about it?

Sorry Bob, but the whole RSS ping thing is a failure. Both Technorati and PubSub rely on it and fail to index the vast majority of posts because the pings are simply dropped. My own evidence shows that PubSub responds to less than 10% of pings.

I've documented this countless times on The RSS Blog.
http://www.kbcafe.com/rss/?guid=20060409191336

Blogosphere search engines that use polling in addition to pings (IceRocket and Google Blog Search) are reliable and index most all of the content of feeds in their index. Blogosphere search engines that rely solely on pings, are not and index a small percentage of the actual content.

This is most evident this month with PubSub where you've experienced massive outages for days on end. A few weeks ago, I talked with Salim about this and he confirmed the ping infrastructure wasn't working.

Notification:
To receive notification, the listener must have a socket pre-allocated. Suppose the listener receives 100 notifications simultaneously. The listener will fork or spawn a new thread, create a new socket and repeat that 100 times. Each thread/process must be processed. At some level N of simultaneous or near-simultaneous notifications, the listening system will be swamped: N may be 100, 1000, or 1 million, but at some level the system _will_ necessarily fail due to lack of resources (memory or CPU). Alternatively, notifying systems may timeout. The listening system has few choices: either die or discard notifications.

Polling:
Polling can be done at the leisure of the polling system. The number of connections Q can be limited to whatever the polling system can handle. There is zero, nada, NIL chance of the system being swamped, since at any time it is polling at most Q sites.

Summary:
While a polling system will be updated more slowly, it will scale linearly:
time to process M sites ~= M x (time to process 1 site)

Since the laws of probability _guarantee_ that a system based on notifications will eventually encounter a situation when it is saturated by notifications, using a notification system guarantees eventual failure.

It *is* remarkable how successful polling-based syndication has been to date, and says a lot for the design of the web's key specification, HTTP. Saying a software system "doesn't scale" without any kind of qualification is pretty meaningless - anything can be scaled by throwing more iron at it. But there's no getting around Bob's basic point - push is inherently more efficient than polling.

Randy, two cases of (presumed) failure of the ping approach doesn't mean the technique is flawed. I believe there is a lot of potential, it would be a shame for it to be neglected.

G. Roper, I'm afraid that analysis is a non-starter. It's just as reasonable to limit the number of connections at a push-based receiver as it is for a polling system to decide not to poll beyond its capabilities. Or if you prefer, "Since the laws of probability _guarantee_ that a system based on polling will eventually encounter a situation when it is saturated by subscriptions, using a polling system guarantees eventual failure."

Ok, that's a bit flippant. But polling can never be more efficient than push. Wanna proof? Consider a one-entry feed. To operate without dropping any entries a polling system would have to ensure its polling frequency is high enough that the window between polls is narrower than the *minimum* time between new entries. Over time, the number of bits that are transferred will be the sum of a value proportional to the polling frequency multiplied by the number of bits transferred for each 'miss' (and any transport overhead, presumably constant), plus the total number of information bits (and any transport overhead). The number of bits transferred in a push system will simply be of the order of the total number of information bits. So the amount of data that has to be transferred in a polling system will be *at least* as many as in a push system.

In practice the polling window is expanded considerably by allowing multiple entries in each document. So fewer calls are needed. But this comes at a price - when there are new entries all the already-received data in the feed gets transferred again (unless as Bob suggests, you only pass deltas).

It's been a long time since I read it, but I believe there are some of the relevant sums in Rohit Khare's dissertation (no coincidence that he's also the guy behind mod_pubsub and KnowNow).

There are trade-offs between the approaches, I suspect further down the road we'll see more interesting hybrids. What we won't see is a mathematical proof from Dave.

The 'Mathematical Proof' bases on assumptions that a receiver has to maintain persistent connections in order to be notified. But just the assumptions that G. Roper employed in polling case, it can be done at receiver's PACE or CAPACITY. SMTP is the most popular push based notication system and it scales pretty well after almost 30 years introduction. The problem for push model is how to make an receiver's well-known endpoint visible outside NATs and Firewalls. As long as this problem is dealt with, the receiver's end can take notification in REST, XML-RPC, SOAP, SMTP or XMPP without problem at ITS own PACES. The problem will be how pushers schedule their updates to achieve minumum cost/maximum efficiency/shortest latency for notification delivery.

Oh yeah, what I forgot to mention was things get worse when you start chaining polling systems - the simplest case being synthesized feeds (aggregate & republish). They have to cache otherwise the polling window will have to be reduced at every stage to prevent missing entries. The "Reading List" case is slightly different, but I suspect that's unlikely to scale indefinitely unless you're prepared to accept either a certain proportion of lost data, or significant redundancy in caching. This isn't "doomsaying", such costs might be worth it, and in practice such systems as a whole may turn out to be more useful than their push-based counterparts. But there's no sense in pretending the scalability issues don't exist.

Hoy - Randy. I gotta argue with your quote. I don't remember ever saying the 'ping infrastructure wasn't working'. More likely I said something about spings or the fact that some blogs don't yet ping or something like that. The ping infrastructure is the foundation of anything to do with syndication and is constantly evolving. It took roughly 10 years each for SMTP and HTTP to fully iron themselves out - it'll also take a while for this new wave to do so. However, once it's fully there, a whole new class of applications and services gets enabled, which is what's so exciting.

Danny, your proof left out the reason the blogosphere ping has failed.
http://en.wikipedia.org/wiki/Sping

[Bob Wyman responds: Randy, the "blogosphere ping" has NOT failed. Sure, we receive lots of spam pings, but we can recognize most of them for what they are and we filter them out. All pings we receive at PubSub are verified before we forward them to the FeedMesh or other subscribers. (By "verify" I mean that we verify that the ping corresponds to an actual change in a feed.) If nothing else, spam pings are a great indicator of who the spammers are! The vast majority of sites that ping "too fast" or that send "fake pings" (pings for feeds that have not changed) are run by spammers. So, spammers who are spingers just draw attention to themselves and will end up being blocked.

Bob, if the spam pings are so easily filtered, then why is most of my referrers in PubSub full of splogs?

Here's a particularly bad day just a week ago.
http://www.pubsub.com/site_inlinks.php?site=kbcafe.com&linktype=in&date=20060417

[Bob Wyman responds:
Randy, spam pings and spam are two often related yet independent problems. Not all spammers produce spam pings -- some are quite proper in the way that they ping to notify us of updates to their spam... Not all spam pings are produced by spammers -- some otherwise respectable bloggers generate vast quantities of repeated pings, fake pings, etc..
We have a variety of methods to detect and filter ping spam and we have a different set of methods to detect and filter spam itself. Success in one of these two areas helps in the other but doesn't "solve" the other problem.

bob wyman's original post stated:
"Winer claims that he can produce a "mathematical proof" that polling does, in fact, scale."

I showed that polling scales. I didn't argue that polling was "more efficient than push". Nor does my argument claim that polling systems cannot be saturated.

I forgot to mention the use of cacheing on the WWW, which improve the efficency of HTTP and polling. I am uncertain whether/how these caches work in the case of push technologies. Perhaps someone else can tell us?

Caching does not make polling more efficent than pushing. A network of N nodes has to transmit at least N-1 time from one node to another to spread a piece of information from original node to every nodes in the network. What Danny pointed out is by applying Nyquist–Shannon-Kotelnikov sampling theorem, a polling system has to poll at a frequecy twice of a publisher's frequency of changes to catch up every update in a publisher. Therefore any polling system has to consume at least 200% of bandwidth that any push system does, if you expect a polling system to behave exactly as a push system.

I don't think that caching applies to push systems as to polling systems. Instead, a push system needs some QoS mechanism to avoid saturation and routing stragtegy to relay messages among nodes.

Bob, you keep telling us this and that works. Of course, we're not on your end, we have no choice but to believe you. What we see is that PubSub and Technorati go weeks and months without indexing blogs that are updated regularly. Somewhere in there is a disconnect. You'll have to tell us what's broken, because something is.