Saturday, 29 March 2008

We've been hearing for a while about where Wikipedia's traffic comes from, but here are some new stats from Heather Hopkins at Hitwise on where traffic goes to after visiting Wikipedia. Hopkins had produced some similar stats back in October 2006, and it's interesting to compare the results.

Wikipedia gets plenty of traffic from Google (consistently around half) and indeed other search engines, but what's interesting is that nearly one in ten users go back to Google after visiting Wikipedia, making it the number one downstream destination. Yahoo! is also a popular post-Wikipedia destination.

It was nice to see that Wiktionary and the Wikimedia Commons both make it into the top twenty sites visited by users leaving Wikipedia.

Hopkins also presents a graph illustrating destinations broken down by Hitwise's categories. More than a third of outbound traffic is to sites in the "computers and internet" category, and around a fifth to sites in the "entertainment" category, which probably ties in with the demographics of Wikipedia readers, and the general popularity of pop culture, internet and computing articles on Wikipedia.

Hopkins makes another interesting point on the categories, that large portions of the traffic in each category are to "authority" sites:

"Among Entertainment websites, IMDB and YouTube are authorities. Among Shopping and Classifieds it's Amazon and eBay. Among Music websites it's All Music Guide For Sports it's ESPN. For Finance it's Yahoo! Finance. For Health & Medical it's WebMD and United States National Library of Medicine."

Similarly, Doug Caverly at WebProNews states that the substantial proportion of traffic returning to search engines after visiting Wikipedia "probably indicates that folks are continuing their research elsewhere", and this ties in well with Hopkins' observation about the strong representation of reference sites.

All of this suggests that Wikipedia is being used the way that it is really meant to be used: as a first reference, as a starting point for further research.

Monday, 17 March 2008

Henrik's traffic statistics viewer, a visual interface to the raw data gathered by Domas Mituzas' wikistats page view counter, has generated plenty of interest among the Wikimedia community recently. Last week Kelly Martin, discussing the list of most viewed pages, wondered how many page views are of protected content; that thought piqued my interest, so I decided to dust off the old database and calculator and try to put a number to that question.

The data comes from the most viewed articles list covering the period from 1 February 2008 to 23 February 2008. I've used that data, and data on protection histories from the English Wikipedia site, to come up with some stats on page protection and page views. There are some limitations: I don't have gigabytes of bandwidth available, so some of the stats (on page views in particular) are estimates, and protection logs turn out to be pretty difficult to parse, so I've focused on collecting duration information rather than information on the type of protection (full protection, semi-protection etc). Maybe that could be the focus of a future study.

There were 9956 pages in the most viewed list for February 1 to February 23 2008. Excluding special pages, images and non-content pages, there were 9674 content pages (articles and portals) in the list. Interestingly, only 3617 of these pages have ever been protected, although each page that has been protected at least once has, on average, been under protection nearly three times.

Protection statistics

Only 1223 (12.6%, about an eighth) of the pages were edit protected at some point during the sample period, 902 of those for the entire period (a further 92 were move protected only at some point, 69 of those for the entire period). Each page that had some period of protection was protected for, on average, 82.9% of the time (just under 20 days), though if the pages protected for the whole period are excluded, the average period spent protected was only 34.8% of the time (just over eight days).

The following graph shows the distribution of the portion of the sample period that pages spent protected, rounded down to the nearest ten percent:

The shortest period of protection during the period was for Vicki Iseman, protected on 21 February by Stifle, who thought better of it and unprotected just 38 seconds later.

Among the most viewed list for February, the page that has been protected the longest is Swastika, which has been move protected continuously since 1 May 2005 (more than 1050 days). The page that has been edit protected the longest is Marilyn Manson (band), which has been semi-protected since 5 January 2006 (more than 800 days).

Interestingly, the average length of a period of edit protection across these articles (through their entire history) is around 46 days and 16 hours, whereas the average length of a period of move protection is lower, at 41 days 14 hours. I had expected the average bout of move protection to last longer, although almost all edit protections do include move protections.

The next graph shows the distribution of protection lengths across the history of these pages, for periods of protection up to 100 days in length (the full graph goes up to just over 800 days):Note the large spikes in the distribution at seven and fourteen days, the smaller spike at twenty-one days and the bump from twenty-eight to thirty-one days, corresponding to protections of four weeks or one month duration (MediaWiki uses calendar months, so one month's protection starting January will be 31 days long, whereas one month's protection starting September will be 30 days long).

The final graph shows the average length of protection periods (orange) and the number of protection periods applied (green) in each month, over the last four years:At least on these generally popular articles, protection got really popular towards the end of 2006 into the beginning of 2007, and again a year later. However, it seems that protection lengths peaked around the middle of 2007 and have been in decline since then.

Protection and pageviews

What really matters here though is the pageviews. The 9674 content pages in the most viewed list were viewed a total of 805,569,269 times over the relevant period. The 1223 pages that were edit protected for at least part of the period were viewed a total of 270,057,550 times (33.5%), with approximately 247 million of these pageviews coming while the pages were protected.

This is a really substantial number of pageviews, however, this number includes the Main Page, which alone accounts for more than 114 million of those pageviews. Leaving the Main Page out of the equation gives a healthier figure of around 133 million views to protected pages during the relevant period (and remember, this is only counting pages on the most viewed list).

Conclusions

Although only one in eight of the pages in the most viewed list were protected at some point during the relevant period, they tended to be higher-profile ones, accounting for one third of the page views. The pages that were protected at some point tended to be protected alot of the time, three-quarters of them for the entire sample period. This certainly fits with what many people have already suspected, that a small pool of high-profile articles attract plenty of attention in the form of both page protection and page views.

It will be interesting to do some more analysis on the history of page protection. Based on just this small sample, it seems that average protection lengths are trending downwards, which could well be something to do with the advent of timed protection. Hopefully I'll have some more insights to come.

Sunday, 16 March 2008

So much for claims that Jimmy Wales uses his influence to alter content for his friends: according to Facebook's Compare People application, Jimmy may be the best listener, the best scientist and the most fun to hang out with for a day, but he's nowhere to be seen in the list of people most likely to do a favour for me.