So for grins I implemented pagerank on Wikipedia (this was actually last month). I figured I’d share my results (and code, although it’s kind of a hack job) in case anyone was curious how it turned out.

There are two folders: pagecount/ contains code to download the November page view data and aggregate it. pagerank/ contains all the code to parse the Wikipediaenwiki dump, parse it, and run pagerank on the result.

If you run:

gcc -O2 sparsematrix.c pagerank.c main.c -lpthread -lm -o pagerank

Inside pagerank/ you will get an executable file called “pagerank”, which will perform the pagerank algorithm on an arbitrary network specified in the file “network”. There’s a small test network in the file testdata, which is the one Dr. Cline showed us in class (in the slides). It’ll output the pagerank data as a vector, one element on each line, into the file vector-out.

The file toolchain.sh shows how I ran it. On the Wikipedia data you need about 20GB of disk space and 16GB of RAM to run everything comfortably. It ran in about 2-3 hours.

There are three columns, the name of the article, the expected rank according to pagerank, and the actual rank based on page view data from November 2014.

The pagerank algorithm worked by multiplying the markov matrix by a vector 1000 times. I’m not sure if that’s enough, but the eigenvector wasn’t changing by a measurable amount (I measured the angle between successive iterations), so I assumed it was good.

Okay, now that the technical stuff is over, now some pictures. I made a plot, where X was the expected rank from pagerank and Y was the actual rank in November, then made a density chart out of that. In this image, red means more points and blue means fewer points:

Odd little graphic (The Z axis is the log of the number of points in the cell). You would expect it to be square, except that a lot of pages didn’t have empirical pagerank data attached to them, so they had to be cut out. The scale is 512 = 16,000,000

Ideally, if the pagerank algorithm perfectly predicted the actual page ranks, then the graph would be white except for a bright red line along Y=X.

We don’t see that, exactly. We see a sort of plume in the lower left corner, which in general follows a line. This plume shows that pagerank in general got the higher-ranked pages right, and in general didn’t guess that unpopular pages would be popular or vice versa.

Over towards the right, we see a few bands of lighter blue. These are presumably because certain pages on Wikipedia aren’t linked to often, and aren’t visited very often either (it’s not hard to think of an example or two). I can imagine there are some clusters of pages which are like this – perhaps bridges in Iran, or Gymnastics competitions in China. These clusters would form those vertical bands.

Anyway, here’s a zoomed-in image of the plume:

As you can see, it’s slightly more likely that a page that’s popular in reality gets ranked lowly by the pagerank than it is for a page that pagerank expects to be popular gets rankly lowly in reality. (remember, lower ranks are more popular)

I would imagine that a lot of the error would be because Wikipedia’s traffic is driven primarily by what’s happening in the news, rather than networking effects like the internet’s traffic.

Anyway, I guess the result of all this is that pagerank actually works. It may still be magic, but it’s magic which actually works. Also a working sparse matrix pagerank program, if you ever need one.

Someone will consume 94 pounds over the course of 10 days. Give or take.

Hypothetically, if you had to ship a friend to Italy, it would take 10 days by UPS, and the box would weigh roughly 224 pounds. If they fit inside a 3 foot cube box, which would be a bit cramped but survivable, then the billable weight is 282 pounds. As long as you stay under that, the trip from Austin, TX, USA to Salerno, Italy would cost $553 and take about 8 days.

That’s assuming that the person in question was worth 1 dollar. The EPA values a human life at $5.5 million dollars, but UPS won’t insure that much.

If I value my shippable friend at $1 million, the price rises dramatically to $8996. An airline ticket costs about $1500 from the US to Rome, so the break even point there is when the person in question was worth $120 thousand. Then the air freight cost would equal $1500 or so also.

I’m in a computer architecture class at UT. This is fine and dandy. Recently we got a homework/lab, and some people started noticing that one of the reference functions didn’t perform correctly on all machines. Here’s the offending code:

This function is supposed to return 1 if the number x can be represented as a two’s-complement number with n bits, and 0 otherwise. For instance, test_fitsBits(3, 2)=1 and test_fitsBits(-3, 2)=0

Can you spot the error? It took me an hour to determine what the error was and then another hour to determine that the error wasn’t a compiler error. If you enjoy hunting for bugs like you enjoy a good murder mystery novella, read on. Otherwise, you might want to stop here, because this is the tale of how I learned that two’s complement is not required by C spec.

So I’m getting errors on Ubuntu when I compile with clang3.5 using “-O -Wall”, but not when I use gcc 4.8 on the same machine. So there’s gotta be some sort of optimization thing going on there.

Oh, but check this out. From tests.c, this is how GCC compiled the fitsBits function:

So it’s checking to see if 1>=-2147483648 and getting 1, which is good. Then it checks to see if 1<=2147483647, but it’s getting 0. What gives?

Alright, back to the assembly output. Some differences between them:

LLVM’s check is setting %al if %eax > %edi.

GCC sets %al if %edi <= %eax.

So let’s say n is 32. Then %cl is 31. %eax becomes 1 shifted 31 times, or 0x80 00 00 00. This is where trouble begins – keen eyes will notice that this is a large negative number. INT_MIN, to be exact. GCC subtracts 1 from this number to get 0x7F FF FF FF, which is a large positive number – in fact, all 32 bit numbers are less than this. However, LLVM optimized away the -1 and did the comparison directly against a large negative number.

This leads us to the simplest reproducible manifestation of the bug: Subtracting 1 from a number, and checking to see if it’s <= something. Like so:

Quick digression: C99 only specifies up to 63 nested levels of parenthesis within a full expression or declarator. Or 63 significant initial characters in an internal identifier or macro name. So if you have two macros, each named 64 a’s followed by a “1” or a “2”, respectively, then according to the C99 specification they are the same name. Also, you can’t have more than 1023 members in a single structure or union.

Un-digress. Note that the C99 spec limits for ints are that ints are at least 16 bits, longs are at least 32 bits, and long longs are at least 64 bits.

Also note that the unsigned overflow is defined as being the modulo. We knew that.

If you have a copy of ISO/IEC 9899:1999, you can follow along. Note in section 6.2.6.2(2) that two’s complement is not required for representing integers. If we go back to section 5.2.4.2.1(1) we see that the limits for int are -32767 to 32767, where a 16-bit two’s complement representation would be -32768 to 32767.

According to the definition of addition (6.5.6(5)), “the result of the binary + operator is the sum of the operands.” Thank you, I didn’t know that before I read that. I feel enlightened now.

But then I found my search button, and went to 3.4.3(3), an example of undefined behavior. They say that “an example of undefined behavior is the behavior on integer overflow.”

Go back to 6.2.5(9), which says that “a computation involving unsigned operands can never overflow.”

Go to the way bottom, section H.2.2(1) states that “The signed C integer types […] are compatible with LIA-1. […] An implementation that defined signed integer types as also being modulo need not detect integer overflow.” (emphasis added)

Alright, so what’s the take? C99 doesn’t define how signed integer overflow works. So all that stuff we learned in class may or may not apply, and you just need to know when it does and when it doesn’t. Not that anyone’s read this far.

As we can clearly see, 10% of resolutions make up 75% of images on Flickr.

Also, the average size of a Flickr image is 1.7MB. They recently announced that they have 6 billion images, for a total of 10PB of images. 10PB of images would use 1,700 WD60EFRX 6TB drives, which retail for $300 each (and yes, you can buy 6TB drives off the shelf). The total cost would be $510,000 for a single set of drives, not counting the computers to run them, and different image resolutions, and replica sets, and so forth. Google did a study in 2007 which found that hard drives which are constantly spinning have an expected lifespan of around 5 or 6 years (you have to do some extrapolating to get that, and it’s an estimate). That comes out to 24 expected failures per month, or $7,200 per month in replacements.

We can assume that Flickr has around 10 datacenters, as a sort of fermi approximation (and CDN caching, etc). If the computers to run the drives cost about as much as the drives itself (which seems reasonable), then we get a total bill of $10 million to setup all of the datacenters, plus $72,000 per month in replacement drives, plus $1 million/yr in staff and $1 million/yr in housing.

Amazon S3 sells data for $0.0275/GB/month over 5PB/month. That comes out to $280,500 per month, or $3.4 million per year. Slightly cheaper than the in-house datacenter.

What is the point of all this? Compression. Every 1% of compression that they can eke out of their images saves them at least $27,000 per year, and that’s a number that’s only growing (unless Flickr folds). So if you can manage to make a JPEG 3% smaller, you could make a hundred grand a year for the rest of your life off the savings.

Flickr is but a small fish in a big pond. According to Quora, Facebook currently has 90 billion pictures, 15x the size of Flickr, and adding 6 billion pictures per month. Someone is uploading all of Flickr to Facebook every month. Using our numbers from above, Facebook spends $5 million per month on new hard drives to put pictures on, not counting replacements. A 1% improvement saves them $50,000 per month.

Note that the .def.js file is just a NodeJS file, like any other. It gets called with the global PERF_UTILS_PATH variable, which defines the path to the perf utils file. This file contains several one useful utilities, such as mkpoisson. You’ll see how this is useful later.

Instances

var ClientInstance = function() {
this.client_id = "";
};

This defines what consistant state is. perf-rest works by maintaining a whole bunch of instances, and moving them between different states according to rules that we’ll define below. One way to think of this is sort of like cookies on a web browser – your web browser may make multiple requests to a web site, but the cookies stay roughly constant. For example here, I’ve defined a variable “client_id” which is a unique ID which the server gives to us.

Requests

This is where we start defining the different possible requests that can be made to the REST server in question. Note that this is not a complete request: it is only a base request which other requests build upon. Much like an object-oriented language (I realize that my audience is going to be more and more Java-oriented, sadly, so you can think of it like Java where classes can subclass other classes).

Woah! I was just talking about subclassing requests! And here we are. An “authenticated” request has all of the fields of a “base” request.

It also has the oncall function defined. This function is run when the request is run, and is given the instance state. This state is used to create a header. After the oncall function is run, the headers are merged with the headers defined in the base request. Headers are the only thing that are merged; everything else is overwritten. For instance, if the authenticated request had specified a hostname, then it would overwrite the one in it’s parents.

Notice the new onfinish function. This function is called with the parsed response from the request. Note that it’s parsed as JSON: If the remote server doesn’t respond in JSON, then that’s so sad for you. File an issue on Github and I’ll fix it.

The onfinish function, as you can see here, can be used to update state from the server’s response.

States

There are three special states, preinit, init, and exit. You will see each of these.

Preinit specifies the delay between when the program starts and when the init state is called. Note here the mkpoisson we noticed before. “delay” can be either a number or a function: If it’s a number, then that exact literal number of seconds is used as the delay. If it’s a function, then that function is called and the return value is used as the number of seconds in the delay. What mkpoisson(x) does is return a function which returns a poisson random number centered on x. That is, by setting delay to be “mkpoisson(60.0)”, we have said that the delay for preinit is a poisson-distributed random number centered on 60.0.

The second of the three special states. When an instance enters the “init” state it is said to be “active” or “alive”, and appears in the output’s instance (“client”) count.

That’s the only special thing about init. Otherwise, it’s just a normal state. “request” specifies which of the above requests gets used. The request gets sent once, when the state is entered. If you don’t want to send a request, use the placeholder “noop”.

“transition” defines which state we transition to after the request is completed. It is a list of probabilities, delays, and destinations. One of these objects is picked out of the list based on the flat probability distribution defined by the “prob”. Note that the sum of the “prob” fields must add to 1, or else undefined behavior occurs.

Once one of the paths is selected, a delay is inserted according to the “delay” field. Obviously, in this instance, the “delay” field is omitted, so the delay is taken to be zero (that is, an immediate transfer to the next state). The “destination” field specifies the name of the state to move to.

Here, we can see the delay field in use. After the request “perform_expensive_action” is performed, there is a 95% chance that after a delay (defined by a poisson distributed random number centered at 30.0) the instance in question will move to the “delay_action” state. Which, by happenchance, is this state. So it’s a loop. With a 5% probability, the instance will immediately move to the “exit” state. While fairly self-explanatory, I feel I should spell it out. When the exit state is reached, the instance is no longer included in the output’s count of instances/clients, and nothing more can be done.

This exports the requests, states, instance definition, and the number of clients.

Conclusion

Here I’ve described in pretty good detail my perf-rest project’s definition files. The actual usage of the program is described on the Github page. If you have any questions or comments, please feel free to email me or track me down or file an issue with the Github issue reporter. Also, pull requests are cool. Or requests for support. Or just drop a line and say you used it.

Recently, a group of researchers at CMU developed a program which uses off-the-shelf 3d models to edit images in 3d. I’m terrible at explaining, so here’s a link to the thing: (also so I don’t lose it…)