Help scrape Google Video before it’s gone forever!

Update: Google About-Turn

Google have capitulated to feedback and decided to keep Google Video alive and migrate the videos to YouTube, by which point Archive Team had 40% of the content and were well on track to save it.

Google Video will be shutting down within the next few weeks, and for some stupid reason, they’re not just transferring the videos to YouTube (as Google owns both) – instead they’re just pulling the plug and it’s all going to be lost. To fix this rubbish state of affairs, Archive Team are in a race to scrape as much Google Video content as they can before the viewing deadline (29/04/2011) and the download deadline (13/05/2011) – and you can help! Archive.org have kindly donated 100TB for storage, but first we need to index the videos and scrape them.

If you have a lot of bandwidth you can help scrape the videos themselves, but even if you don’t you can help with the indexing effort by running a simple, resource and bandwidth light Linux script and just leaving it running!

Why save it?

YouTube has a 15 minute video length limit – and Google Video doesn’t. This means there are large amounts of video that might be on Google Video and nowhere else, so when they’re gone – they’re gone. A lot of this might not be fantastic material – but a lot of it will be unique and the only copy on the Net. There’s documentaries, films, and all sorts of good stuff, and even personal video blogs will be a snapshot of the times we live in. In short, it’s stuff we as a species, should not throw away.

It’s like the BBC scrapping their archives when they didn’t want to pay to save them – we won’t know what’s been lost until it’s gone, and by then it’ll be too late. So let’s not let that happen, eh?

How you can help if you have ~200GB bandwidth/storage or more: Scrape the videos

Note: This will only work on Linux machines with X running – you can’t run it on headless servers due to phantomjs requirements. Instructions are for Ubuntu 10.10 or later and might need a little modification if you’re running an older or non-Debian based distro.

Get and build phantomjs (a headless web browser) by doing the following:

In a terminal, navigate to the folder where you extracted the google_video_related file (above) and run the following command to help scrape Google Video:

Shell

1

while:;do./related.sh;done

Simply leave the script running, and head on over to #ggtesting on EFnet (IRC) if you need any assistance or in case the script has any issues (p.s. kill the script with Ctrl+Z if it misbehaves – though mine’s been running for about 7 hours solid with no complaints so I doubt you’ll have any).

The script scrapes each page for related videos and sends them off to an archiveteam server. It takes very little processing and bandwidth on your end (a couple of kb/sec, if that) and seems to work just fine.

Every little helps

I’m sure anything you can do to pitch in will be appreciated by Archive Team, the Internets, your future self, your kids, your kids kids, your kids kids kids… you get the picture ;)