I wrote this article last week during the height of the edits that Comet article that stripped it of all useful content. I decided not to post, but after seeing the interest on reddit I decided to reconsider. So here it is.

Wikipedia rejects any self-published sources as being inherently biased, inconsequential, and ultimately non-notable. This is undoubtedly a product of Wikipedia’s early days when the site struggled (and to some extent is still struggling) to gain legitimacy in the hard-copy world of authors, journals, and publishers. I actually know little about the founding of wikipedia, or even much about Wikipedia culture beyond what I’ve witnessed with a recent disagreement about the validity of the term Comet on Wikipedia. I suspect though, that the founders of Wikipedia figured that they could gain the legitimacy they so craved if they adopted the same guidelines and rules of the hard-copy world. They figured they would be as rigorous as paper-based encyclopedias, and eventually everyone would accept WP content as authoratative.

I think thats a reasonable approach. Indeed, many students reference wikipedia in their academic endeavors. When I’m at work, and I need to know the meaning of some concept, I almost always check Wikipedia. Since August 2007 alone, the site has increased by 50% in daily reach (alexa.com) But Is that because I and others have read Wikipedia’s guidelines and we trust that the same rigor applied to Britannica has also been applied to each word we read online? Well, no. As I said, I knew very little about that rigor until recently.

The answer, is search engines. When I read an a tech article somewhere that talked about Bubble Sort I wondered when the algorithm was first discovered. I googled the term and low and behold, the first link was Wikipedia. The same is true for a vast number of search results on Google. *People didn’t start using Wikipedia because they was impressed with the high standards in sourcing material, rather, its that the site kept coming up first on google searches.*

A major advantage of the internet after all is that it is trivial to disseminate information — there’s no a third party you have to convince in order to put your thoughts online. What makes this so ironic, is that Wikipedia is a self-published source. They didn’t ask Brittanica to publish their content, they just published it online. The reason Wikipedia gained so much legitimacy is not because of their guidelines, its their Google Page Rank. Because other sources on the internet linked to their website, they gained authority on all manner of subject. Page Rank, after all, is the modern measure of authority.

Now consider that Wikipedia rejects blog entries as non-authorative. The hypocrisy is clear: Wikipedia is a self-published source, and is *only* successful due to the modern measure of legitimacy known as Google Page Rank. Yet, in spite of their success, by policy they are in stark opposition to this modern process of content dissemenation that we call the internet. All content on wikipedia must be validated through out-dated mechanisms, such as academic journals or large publishing congolmerates.

Unfortunately, the world of technology is abandonding these out-dated venues due to the difficulty and extremely drawn out turnaround time on putting out new information. When a popular Open Source project comes up with a new innovation, they blog about it. Someone else might come along and write a book about it, but much of the time the author is someone who is only peripherally related to the topic and isn’t innovator or even the expert on the subject matter! The worst part though is that there is no way to publish a book in less than a year, and generally it takes much longer.

The result, is a collision on topics like Comet, as evidenced by the recent controversy. While there are dozens of trade conference presentations about Comet this year, many white papers, and two upcoming books, they are all far less authoritative than, say, Alex Russell’s blog, or Comet Daily, a professional blog about Comet. The result is that the Wikipedia article must be crippled by a) the months turnaround for content from a blog to appear in an “authorative source” and b) the loss of clarity because we can only refer to those authors who aren’t on the cutting edge but instead are re-hashing the blogs from last year in book or conference form.

Is there a solution to this problem? Maybe some combination of editors to summarize sources, and page rank to determine the content that has actual authority. I don’t really know, but I do know that wholesale deletions are *not* the answer, and its a childish reaction, at best, on the part of Wikipedia editors. Maybe someone else knows the answer?

For the first time I ran into my first problem with CPython’s GIL. I’ve heard many people complain about it, but I always figured “Whats the big deal?” After all, you can just create a multi-process application if you want to take advantage of multiple processors. And in fact, the GIL is all that saves my Core Duo Laptop when I play Galcon, as it utilizes 100% of whatever processor its given. When I play on my Pentium M the computer becomes unusable. On the Core Duo I’ve still got a spare processor lying around that python can’t get it.

But as for the problem at hand, I wrote a simple scgi daemon using pyevent. In case you haven’t heard of pyevent, its a python wrapper for the libevent newtork IO library. Libevent is fast and scales very well. Anyway, I wrote the daemon using pyevent and it worked great. But the app I was writing needed database access, so I figured I’d create a thread pool and dispatch jobs via a worker queue. I coded it up and gave it a try.

Instead of the great Async IO, threaded dispatch application that I expected, Only the Main thread actually ran. This seemed strange to me because the other threads all started up, but they just never got any cpu time. More perplexing was that after I shut down the main thread, the other threads suddenly started running, dispatching jobs from the work queue. The whole application shut down when they were done.

After a couple discouraging hours, I found that I could get the approximate behaviour I wanted by setting a timeout event in pyevent such that time.sleep would be called from the pyevent code.

def idle():
time.sleep(0.01)
event.timeout(0.02, idle)

This works, but it isn’t ideal. Every 0.02 seconds the io thread will give way to the db threads for 0.01 seconds. This means that I’ve manually set a priority between the two. I actually have no idea which will end up using more cpu, so it will be very hard to choose a balance that works.

I plan on updating the code to use a second process for the database access. Its probably a better idea anyway because then I’ll be taking advantage of two processors.

Now I understand at least one way in which the GIL can be annoying. I started looking into updating the pyevent code to release the GIL, but I don’t know enough about the inner workings of libevent or python at this point to make the appropriate modifications. Perhaps someone else has some insight