Wikipedia, we have a Google refresh problem!

About Isobel Gorman
Isobel Gorman is an alumni of the MA in New Media at the University of Amsterdam. She also studied at the National College of Art & Design (NCAD) in Dublin, Ireland where she graduated with a BA in Visual Communications and History of Art and Design in 2001. She received a First Class Honours for her graduating thesis ‘Second Skins’ which dealt with the rising phenomena of Multi-player online gaming. An extract of this thesis was published in the NCAD anthology ‘Thought Lines 6‘.
Isobel currently runs a digital design studio and lectures at the Dublin Institute of Design.

What did the web crawler say to Wikipedia? I’ll update you later!

So the question is … just how much later?! I’m wondering because there seems to be a discrepancy between just how fast Wikipedia bots/editors seem to shoot down vandals and how fast Google actually refreshes information to reflect this correction in the search engine.

Today I was doing some research on new game technology and in particular, on a famous game developer who is showcasing it. I typed in my request in Google search engine and up pops a link to the developer’s Wikipedia page. Except it wasn’t quite how I’d imagined the page information to appear in a search engine.

Note: I have removed the developer’s identity for fear of aiding the vandal in his/her smear campaign. The screenshot demonstrates the nature of the problem.

Wow!! This must be so frustrating for those extremely on the ball editors at Wikipedia, imagine shooting down the bad guys only to be taunted by fragments of their evil deed in the inaccessible cached space in search engines. Not to mention the prolonged embarrassment of the individual who has been the target for the vandalism.

So where does the problem lie? With the search engine information refresh rate. Search engines use web crawling agents to read pages on World Wide Web and extract the keywords. Google also stores a cached version of most of the pages that can be viewed even if the original page is down.

Content on the Web tends to be very dynamic by nature as information is continually modified, added and deleted. The data that search engines store becomes outdated very quickly. To try to prevent these discrepancies, web pages are scanned periodically, this can be once a day, a week, a month – depending on the information provided by the site administrator or search engines statistics. This creates a gap in which we still can see the old content, already removed from the original page, and we still can find a link to it using keywords based on the old content. In this case, if I typed in three inappropriate words the vandal used, the developer in question will appear first on the list (and funnily enough on top of “sexy stripper” who was below).

So can this problem be solved? Is there a way for Wikipedia editors or users to mark a page to be refreshed sooner? That’s a question for Google (in this case) and other search engines as this feature seems to be missing. As of writing this article the issue has already existed for two days and still has not refreshed to reflect the changes.

Hi there. I happen to be a contractor for the Wikimedia Foundation, but I am just speaking for myself, not in any official capacity for the Foundation.

I’ve thought about the same issue for some time. There are ways to partially solve this problem — there are standards for alerting Google, and other search engines, to recently changed pages.

That said it is not clear to me that this is where the foundation or the community should spend its (very limited) time and resources.

There are some side benefits to a more direct pathway from Wikipedia to search engines, in that it might lead to a great *coverage* of Wikipedia pages in Google, which is actually more interesting in my opinion.

It is interesting to see that in the article’s revision history a lot of vandalism has been reverted. All vandalism/reversions were on the 14th of October, thus very recently. It might be interesting to see how long Google keeps this vandalism in cache. Keep us posted :)

@Neil: I completely agree with you with regard to how limited Wikipedia time should be better spent. Of course, regular content edits should just be left for the next update on search engines. However, when an entry contains a personal attack on an individual’s character and leaves behind a visible remnant in the search engine, this is not good for either the person’s reputation or Wikipedia’s image. Surely these cases can easily be marked for early refresh.

I would just like to clarify — when I made the comment about limited time & resources, I don’t want to sound callous to the damage this can cause to someone’s reputation. The thing is, the problem occurs on non-Wikimedia websites and it is as yet unclear that any amount of effort on our part could fix it to the degree that say, the vandalism is gone in less than a day.

But, I bet we can probably improve the current situation.

P.S. One wonders how many Wikipedia articles legitimately contain the word “dickbag”.

Simple. The Foundation needs to get off the dime and implement pending changes or flagged revisions, or anything else they want to call it which presents only approved versions of biographies and other target articles. Although some oppose this, many editors have weighed in in favor of doing this, and all we get are trials and endless discussion. I thought Wikipedia is “not a democracy” – it’s time to “be bold” and get on with it already. Tvoz

@Tvox, The problem is the community has a lot of resistance to the pending changes as much as I want it to happen, boldness only goes so far when you don’t have a clear consensus and a major change in software is not something that an objecting user can just revert.

Talking about the relationship between Google and us, can someone help me with http://snurl.com/wpsottolinkmeta ? I read a few complaints from Italian readers about google.it showcasing the Youporn article when looking for “wikipedia”.