Introduction

So I once got it into my head that I needed to figure out how to make a render farm as cheaply as possible. This was somewhat idle, because I think I knew, even then (a year or so ago) that I wouldn't have time to put a real render farm to good use. However, the idea idea stuck with me, and now I'm doing some data visualization on a Friday night.

Making Of

I should quickly pay homage to my predecessor: this guy over at forre.st did some similar analysis on storage, and partially inspired my efforts here.

Back when I first had this idea, I decided that I would try and scrape Newegg.com for the data I needed. Like a n00b, I failed to look at existing scraping options (like scraperwiki) and went straight for the throat with urllib and BeautifulSoup. Eventually, I figured out that I needed to keep my scraped data somewhere, and decided on keeping it in an SQLite database and accessing the data with SQLAlchemy. I think that I figured that since I had access to a MySQL database on my server, I could easily switch to using that if I layered SQLAlchemy into there.

Something striking is the fact that the earliest github commit for that particular project was almost exactly a year before this page first went public, at the beginning of June 8th. Took me long enough to get this out the door.

At some point, I became fed up with trying to wrestle Google's chart API into displaying my data in a nice manner, and decided that by golly, if Google couldn't do a scatterplot right, then I would. That was another Github repo, called Chartalastical! (or Chartalastical-bang) (if you've been reading around my site, you would probably agree that I seem to have gotten better at naming things. Maybe not).

After some time, the project lost steam, and I dropped it in favor of not getting an F at school. After said school ended for the summer, and in between looking frantically for a job, I decided that I should finish something; something easy. And well, this was an easy enough project. I just never finished.

Rooting around my repo, I had several thoughts:

Using my own plotting solution wasn't going to cut it, and the likes of Protovis just ate my lunch, and then threw it in my face. Also, Google's chart API was more mature at that point, and actually usable.

However, I wouldn't even have to scrape, because this guy figured out that Newegg had a JSON API publically available. Of course, it wasn't documented anywhere, but he did a fantastic job of figuring it out. Now, data was only a HTTP request and a import json away!

And then I found this other guy, Paul Tarjan, who had provided his own JSON of CPU data pulled from Newegg. In addition to gathering CPU data into a nice list, he also pulled data from CPUBenchmark.com and put the CPU together with it's score, and made the entire JSON glob public.

Man, life was good. So I grabbed some data from Paul, sat down with the very useful Protovis Primer, and got to wrangling. What resulted (in 2 days!) is below.

Data - Just CPU

This data only includes CPU prices and data: a fuller analysis would include the price of the rest of the supporting system, since that's a non-trivial amount of money. If I have time (ha!) or motivation (HA!) I'll either try adding in constant/proportional amounts to the price tag, or fetch actual Newegg data (instead of using Paul's data) and try to do some figuring of which RAM/mobo one would use with each CPU.

For each plot, the blue dots are Intel CPUs, and the red dots are AMD CPUs. Deeper color means a higher performance score from CPUBenchmarks.com, and the size correspods with the number of cores in the CPU. If you hover your mouse over a dot, a tooltip will pop up showing that CPU's name. Clicking the dot will send you to the Newegg.com product page. Sorry guys, haven't figured out how to make it open in a new tab yet...

Performance vs. Price

This first plot is pretty straightforward. At the time of this writing, the i7 CPUs from Intel lead in performance and price.

Performance vs. Cores

I use a jittering technique that I read about in Visualizing Data by Cleveland, which is a very good guide to, well, visualizing data. So yes, that Phenom does not in fact have 6.2 cores. Sorry to disappoint.

At the time of this writing, hex-core i7s haven't been picked up into Paul's data set, so the only hex-cores are AMD. Interestingly, AMD and Intel each seem to have a linear relationship between performance and core count, with a much steeper slope for Intel.

Cost/Core vs. Price

It is somewhat counter intuitive, though, because better chips are closer to the bottom (if you care only for core count). At the time of this writing, it is striking just how linear each core-progression is.

Cost/Performance Point vs. Price

Substituting performance points for core count results in a tad more chaotic plot. Again, this plot is counter intuitively upside down with regards to which chip is "better". At the time of this writing, the Opterons are throwing off the data set, so it's harder to see interesting trends.

Performance Points/Dollar vs. Price

Flipping the evaluating function, now higher is better. At the time of this writing, cheap quad-cores from AMD are winning.

Performance Point/Dollar vs. Cores

Same evaluating function as last time, but grouped into cores.

Cores/Dollar vs. Cores

Now, replacing performance points with cores, just for a twist.

Data - CPU Extension

Too lazy now, might update later with MOAR VISUALIZATIONS

Other Stuff

Like my predecessor Paul, I'm going to offer my JSON up to everyone. I update it daily, because updating more often is just kind of pointless.

I'll also offer up my update code, which takes Paul's JSON and takes only what I need. Python script, it doesn't require any strange libraries, tested with 2.7 on Ubuntu 11.04 (well, if you're on 2.4, or somewhere else you don't have access to the Python json libraries (introduced in 2.6), then you'll need simplejson). MIT Licensed, because really, it's a couple of lines of code.