Thursday, October 25, 2012

After running the R script for the week 8 rankings, the first thing that struck my mind was the disparity in the size of the nodes between the AFC on the left side of our graph and the NFC on the right side.

Two weeks ago we wrote that the NFC West has been dominant so far this year. The NFC West has the best combined record and their aggregate point differential puts others to shame. However it is not just the West division but the entire NFC conference has dominated and out-performed the AFC conference at every turn. CBS Sports rates the NFC as head and shoulders above the AFC this year.

Our ranking system is based on Google's PageRank algorithm. It is explained in some detail in past posts.
A directed graph is created to represent the current years season. Each
team is represented by a node in the graph. For every game played a
directed edge is created from the loser pointing to the winner and it is
weighted by the Margin of Victory.

In the Pagerank model each link from a webpage i to webpage j causes webpage i to give some of its own Pagerank to webpage j. This is often characterized as webpage i voting for webpage j.
In our system the losing team essentially votes for the winning team
with a number of votes equal to the margin of victory. Last week the Giants beat the Redskins 27 to 23, in the graph a directed edge from
the Redskins to the Giants with a weight of 4 was created.

The season graph so far can be visualized in the following graph.

The Pagerank algorithm is run and all of the votes from losing teams
are calculated. The nodes in the graph are given a final ranking and
that is represented by the size of the node in the graph. This algorithm does a much better job of taking the strength of schedule into account than many of the other ranking systems that are essentially based on win loss ratios. Barring any injuries or or other problems it is a good guess that Houston will representing the AFC once the playoffs are complete. The real question is which team from the NFC will rise to surface to take them on in the Superbowl.

Thursday, October 11, 2012

It is now five weeks into the 2012 season and the season is starting to come into focus. The topic of many online discussions is this years performance of the NFC West division compared to last year. The NFC West is one of the best performing divisions so far this year, which is a far cry from last year. They are certainly doing well in our ranking system.

Our ranking system is based on Google's PageRank algorithm.It is explained in some detail in past posts. A directed graph is created to represent the current years season. Each team is represented by a node in the graph. For every game played a directed edge is created from the loser pointing to the winner and it is weighted by the Margin of Victory.

In the Pagerank model each link from a webpage i to webpage j causes webpage i to give some of its own Pagerank to webpage j. This is often characterized as webpage i voting for webpage j. In our system the losing team essentially votes for the winning team with a number of votes equal to the margin of victory. Last week the Falcons beat the Redskins 24 to 17, in the graph a directed edge from the Redskins to the Falcons with a weight of 7 was created.

The season graph so far can be visualized in the following graph.

The Pagerank algorithm is run and all of the votes from losing teams are calculated. The nodes in the graph are given a final ranking and that is represented by the size of the node in the graph. The Pagerank algorithm used in this fashion has the nice effect of representing the strength of schedule. This should be of interest to many of the Houston Texan fans out there. The majority of the NFL Power Ranking sites out there, currently have Houston ranked number one. A simple glance at the schedule for the past five weeks would show that Houston has had a pretty easy season so far. They have played well so far and this week when they play Green Bay should be a good game.

Wednesday, October 10, 2012

In our current research, the WS-DL group has observed leakage in archived sites. Leakage occurs when archived resources include current content. I enjoy referring to such occurrences as "zombie" resources (which is appropriate given the upcoming Halloween holiday). That is to say, these resources are expected to be archived ("dead") but still reach into the current Web.

In the examples below, this reach into the live Web is caused by URIs contained in JavaScript not being rewritten to be relative to the Web archive; the page in the archive is not pulling from the past archived content but is "reaching out" (zombie-style) from the archive to the live Web.

We provide two examples with humorous juxtaposition of past and present content. Because of JavaScript, rendering a page from the past will include advertisements from the present Web.

When we observe the HTTP requests that are made when loading the mementos there is evidence of reach into the current Web. We've stored all HTTP headers from the archive into a text file for analysis. The requests should be to other archive.org resources. However, we can get the requests for live-Web resources:

During our investigation of these zombie resources, we observed that this leakage of live content into archived resources is not consistent. We noticed that some versions of some browsers would not produce the leakage; this is potentially due to the browsers' different methods of handling JavaScript and Ajax calls. In our experience, older browsers have a higher percentage of leakage, while the newer browsers demonstrate the leakage less frequently.

The CNN and IMDB mementos mentioned above were rendered in Mozilla Firefox version 3.6.3. Below are two examples of our CNN and IMDB mementos rendered in a Mozilla Firefox 15.0.1. Note that the below examples attempt to load the advertisements but produce a "Not Found In Archive" message.

CNN.com memento rendered in a newer browser with no leakage.

IMDB.com memento rendered in a newer browser with no leakage.

When analyzing the headers with these new browsers, we get fewer requests for live content. More importantly, we get different requests than we saw in the other browsers:

These mementos still attempted to load wrong resources, albeit unsuccessfully. Essentially, these mementos are shown as incomplete instead of incorrect (and without our humorous results). The exact relationship between browser, mementos, and zombie resources will required additional investigation before we can establish a cause and solution for these leakages.

The "Popular on Facebook" section of the page has activity from two of my "friends." The page that was shared was the 10 questions for Obama to answer page, which was published on October 1st, 2012 and is shown below. It should be obvious that my "friends" shouldn't have been able to share a page that hasn't been published, yet (2012-09-09 occurs before 2012-10-01). So, the WebCite page allow live-Web leakage in the cnn.com memento.

Live cnn.com resource

Such occurrences of leakage and zombie resources are not uncommon in today's archives. Current Web technologies such as JavaScript make a pure, unchanging capture difficult in the modern Web. However, it is useful for us as Web users and Web scientists to understand that zombies do exist in our archives.