(Yes, yes, I absolutely should have done complete offsite backups. Unfortunately, all my backups were on the server itself. So save the lecture; you're 100% absolutely right, but that doesn't help me at the moment. Let's stay focused on the question here!)

I am beginning the slow, painful process of recovering the website from web crawler caches.

There are a few automated tools for recovering a website from internet web spider (Yahoo, Bing, Google, etc.) caches, like Warrick, but I had some bad results using this:

My IP address was quickly banned from Google for using it

I get lots of 500 and 503 errors and "waiting 5 minutes…"

Ultimately, I can recover the text content faster by hand

I've had much better luck by using a list of all blog posts, clicking through to the Google cache and saving each individual file as HTML. While there are a lot of blog posts, there aren't that many, and I figure I deserve some self-flagellation for not having a better backup strategy. Anyway, the important thing is that I've had good luck getting the blog post text this way, and I am definitely able to get the text of the web pages out of the Internet caches. Based on what I've done so far, I am confident I can recover all the lost blog post text and comments.

However, the images that go with each blog post are proving…more difficult.

Any general tips for recovering website pages from Internet caches, and in particular, places to recover archived images from website pages?

(And, again, please, no backup lectures. You're totally, completely, utterly right! But being right isn't solving my immediate problem… Unless you have a time machine…)

42 Answers
42

Here's my wild stab in the dark: configure your web server to return 304 for every image request, then crowd-source the recovery by posting a list of URLs somewhere and asking on the podcast for all your readers to load each URL and harvest any images that load from their local caches. (This can only work after you restore the HTML pages themselves, complete with the <img ...> tags, which your question seems to imply that you will be able to do.)

This is basically a fancy way of saying, "get it from your readers' web browser caches." You have many readers and podcast listeners, so you can effectively mobilize a large number of people who are likely to have viewed your web site recently. But manually finding and extracting images from various web browsers' caches is difficult, and the entire approach works best if it's easy enough that many people will try it and be successful. Thus the 304 approach. All it requires of readers is that they click on a series of links and drag off any images that do load in their web browser (or right-click and save-as, etc.) and then email them to you or upload them to a central location you set up, or whatever. The main drawback of this approach is that web browser caches don't go back that far in time. But it only takes one reader who happened to load a post from 2006 in the past few days to rescue even a very old image. With a big enough audience, anything is possible.

I think you could crawl your static files for the image tags and copy all of those into one giant page of images, instead of having everybody click each link. The diovo.com implementation looks very impressive, hope it works out for you.
–
phloopyDec 15 '09 at 6:00

By going to Google Image search and typing site:codinghorror.com you can at least find the thumbnailed versions of all of your images. No, it doesn't necessarily help, but it gives you a starting point for retrieving those thousands of images.

Configure the web server to return 304 for every image request. 304 means that the file is not modified and this means that the browser will fetch the file from its cache if it is present there. (credit: this SuperUser answer)

In every page in the website, add a small script to capture the image data and send it to the server.

This will get you all the images from codinghorror.com archived by archive.org. This returns 3878 images, some of which are duplicates. It will not be complete, but a good start none the less.

For the remaining images, you can use the thumbnails from a search engine cache, and then do a reverse look-up using these at http://www.tineye.com/ . You give it the thumbnail image, and it will give you a preview and a pointer to closely matching images found on the web.

+1 on the dd recommendation if (1) the raw disk is available somewhere; and (2) the images were simple files. Then you can use a forensic 'data-carving' tool to (for example) pull out all credible ranges that appear to be JPGs/PNGs/GIFs. I've recovered 95%+ of the photos on an iPhone that was wiped this way.

The open source tools 'foremost' and its successor 'scalpel' can be used for this:

In the past I've used http://www.archive.org/ to pull up cached images. It's kind of hit or miss but it has worked for me.
Also, when trying to recover stock photos that I've used on an old site, www.tineye.com is great when I only have the thumbnails and I need the full size images.

This is probably not the easiest or most full-proof solution, but services like Evernote typically save both the text and images when they are stored inside the application - maybe some helpful readers who saved your articles could save the images and send them back to you?

I've had great experiences with archive.org. Even if you aren't able to extract all of your blog posts from the site, they keep periodical snapshots:

This way you can check out each page and see the blog posts you made. With the names of all the posts you can easily find them in Google's cache if archive.org doesn't have it. Archive tries to keep images, Google cache will have images, and I haven't emptied my cache recently so I can help you with the more recent blog posts :)

About five years ago, an early incarnation of an external hard drive on which I was storing all my digital photos failed badly. I made an image of the hard drive using dd and wrote a rudimentary tool to recover anything that looked like a JPEG image. Got most of my photos out of that.

So, the question is, can you get a copy of the virtual machine disk image which held the images?

I suggest the combination of archive.org and a request anonymizer like [Tor][2]. I suggest using anonymizer because that way each of your requests will have a random IP and location and that way you can avoid getting banned by a archive.org (like Google did) for unusually high number of requests.

The wayback machine will have some. Google cache and similar caches will have some.

One of the most effective things you'll be able to do is to email the original posters, asking for help.

I do actually have some infrastructural recommendations, for after this is all cleaned up. The fundamental problem isn't actually backups, it's lack of site replication and lack of auditing. If you email me at the private email field's contents, later, when you're sort of back on your feet, I'd love to discuss the matter with you.

If you're hoping to try to scrape users' caches, you may want to set the server to respond 304 Not Modified to all conditional-GET ('If-Modified-Since' or 'If-None-Match') requests, which browsers use to revalidate their cached material.

If your initial caching headers on static content like images were pretty liberal -- allowing things to be cached for days or months -- you could keep getting revalidate requests for a while. Set a cookie on those requests, and appeal to those users to run a script against their cache to extract the images they still have.

Beware, though: the moment you start putting up any textual content with inline resources that aren't yet present, you could be wiping out those cached versions as revalidators hit 404s.

At the risk of pointing out the obvious, try mining your own computer's backups for the images. I know my backup strategy is haphazard enough that I have multiple copies of a lot of files hanging around on external drives, burned discs, and in zip/tar files. Good luck!

How much is this data worth to you? If it's worth a significant sum (thousands of dollars) then consider asking your hosting provider for the hard drive used to store the data for your website (in the case of data loss due to hardware failure). You can then take the drive to ontrack or some other data recovery service to see what you can get off the drive. This might be tricky to negotiate due to the possibility of other people's unrecovered data on the drive as well, but if you really care about it you can probably work it out.

Very sorry to hear this and I am very annoyed for you, and the timing - I wanted an offline copy of a few of your posts and did HTTrack on your entire site but had to go out (this was a couple of weeks ago) and I stopped it.

If the host is half descent - and by the fact I am guessing you are a good customer... I would ask them to either send you the hard drives (as I am guessing they should be using RAID) or do some recovery themselves.

Whilst this may not be a fast process, I did this with one host for a client and was able to recover entire databases intact (... basically, the host tried an upgrade for the control panel they were using and messed it up.. but nothing was overwritten).