The Internet Archive discovers and captures web pages through many different web crawls.
At any given time several distinct crawls are running, some for months, and some every day or longer.
View the web archive through the Wayback Machine.

The seed list for this crawl was a list of Alexa’s top 1 million web sites, retrieved close to the crawl start date. We used Heritrix (3.1.1-SNAPSHOT) crawler software and respected robots.txt directives. The scope of the crawl was not limited except for a few manually excluded sites.

However this was a somewhat experimental crawl for us, as we were using newly minted software to feed URLs to the crawlers, and we know there were some operational issues with it. For example, in many cases we may not have crawled all of the embedded and linked objects in a page since the URLs for these resources were added into queues that quickly grew bigger than the intended size of the crawl (and therefore we never got to them). We also included repeated crawls of some Argentinian government sites, so looking at results by country will be somewhat skewed.

We have made many changes to how we do these wide crawls since this particular example, but we wanted to make the data available “warts and all” for people to experiment with. We have also done some further analysis of the content.

If you would like access to this set of crawl data, please contact us at info at archive dot org and let us know who you are and what you’re hoping to do with it. We may not be able to say “yes” to all requests, since we’re just figuring out whether this is a good idea, but everyone will be considered.

Putting a PC under your TV to play your pictures, movies, and all that jazz, sounds like an excellent idea until you actually try and implement it. I’ve been using a full blown PC since 2004 to record TV and stream media from my server….but it’s been hell.

Problem #1: Choosing the right hardware
Back in 2004, I thought that putting a PC under the TV to play media would be a fantastic idea. I started to plan a bunch of parts and on paper, it looked great. Being slightly more clueless than I am today, I didn’t do much research….which led to me buying a P4 LGA775 (the hottest of all Intel processors) processor and a set of loud components.

The TV tuner card (a Hauppauge PVR-MCE-350) was also very average. It’s an analogue card, but the quality of video it outputs is significantly worse than plugging it directly into the TV (via the same connector). Fast forward to today and I still use the same card. It works OK, but in the world of digital, the card needs updating.

The leads us to the next problem. Working with a FOXTEL box. In 2004, FOXTEL digital was introduced. Ironically, the set top box provided by FOXTEL does not output any digital video (only audio). FOXTEL boxes have only recently outputted in digital with the introduction of FOXTEL HD channels. That brings up a new problem. It’s impossible to create a “purely digital” media center PC…where there’s digital input and digital output. Why? Well firstly, HDMI and Component are protected by HDCP (and any capture cards to bypass HDCP are “illegal” or need heavy licensing fees). Secondly, it takes a beefy PC to capture and view HD video at the same time. Currently, the only way to capture HD video from FOXTEL would be via analogue component inputs. Capture cards that do this are reasonably expensive…$300+ The alternative is CableCard. A technology that has never taken off despite it being amazing and around for several years.

There’s also the problem of IR blasting. This isn’t FTA Digital/Analog, where you tune into a frequency depending on a channel, the capture card sees FOXTEL as a singular channel. Every time you want to change the channel you must blast infrared out to the FOXTEL box. I’ve found that this can often be inaccurate and laggy.

Problem #2: Choosing the software
Now the software. Generally, you’d use Windows Media Center. It comes with Windows. For a simplistic Media Center, it works pretty well. However, it’s hard to extend and skin. Windows 7 media center is particularly average when it comes to UI.

The alternatives? MythTV. XBMC. MediaPortal. MythTV is out, fuck Linux. XBMC is out, no capture card support. This leaves MediaPortal. An open source media center that’s highly extensible and skinable. MediaPortal, by default, looks like balls but it has a ton of cool features. Adding the StreamedMP skin and a couple of plugins makes MediaPortal look and feel like XBMC. It has excellent UIs for TV Shows and Movies (with a bit of work).

The downside? It’s buggy as fuck. Firstly, the StreamedMP skin doesn’t support 4:3. I guess this isn’t a big deal for most, alas, the parents refuse to venture into a world of new technology. This means you have to do a crapton of skin modifying to fit things on the screen. Secondly, I had to go ahead and check out the source from SVN to determine what was hanging MediaPortal for 2 minutes. It turns out there was no such timeout for devices like card readers. If there’s nothing in the card reader, there’d an extended I/O timeout which would hang MediaPortal. It’s an easy source fix, but there’s bugs like this that just make MediaPortal very unfriendly (without a helluva lot of work).

There’s also a problem with timeshifting. A feature that allows you to pause and rewind TV. With FTA, this is great. With FOXTEL, this is absolute pain. The way timeshifting works is that the capture card is saved to a temporary file before it is viewed on the screen. This means there’s usually a 3-4 second delay. This delay makes using onscreen components for the FOXTEL box (such as TV guide) impossible. Unfortunately, both MediaPortal and WMC do not allow you to disable timeshifting.

Finally, EPG (programme guide) data is nearly impossible to get for FOXTEL. There are sources such as OzTiVo, but it is often incomplete. In the end, I ended up writing a parser for YourTV (which has relatively complete FOXTEL data). It’s a very highly customized solution mainly because there is now “raw” data feed (such as XML). I have to parse HTML pages to get meta data. Each day downloads approximately 50-75MB of data (when parsed it’s only ~4MB, 600KB zipped)…which means I have to run this on my overseas VPS. Basically, the stars have to align to get any sort of consistent EPG data.

So, in conclusion?
a) If you want a purely digital media center: you’re going to have to use FTA Digital. You can use analogue component for FOXTEL but it’ll cost you.
b) EPG data for FTA is great! FOXTEL not so much.
c) FTA works great for timeshifting. You’re probably going to want to connect FOXTEL directly to your TV and capture card, so that when you need to record, you can.
d) Buy and research silent, cool and compatible hardware components.
e) 4:3 is out. 16:9 is here. Don’t even bother if you have an old TV.
f) Be prepared to pull your hair out.

You can probably tell this is just a spill of my brain. I haven’t gone in to much detail…but I’ve finally finished the rebuild project of my media center.