Tuesday, December 13, 2005

AWSP offers shell access at Alexa?

I thought Amazon Mechanical Turk was one of the strangest things I've seen in a while, but Amazon is weirding me out again with their new Amazon Web Search Platform (AWSP).

AWSP is supposed to be a developer framework to innovate on top of the crawl and index data available from Alexa. As part of this package, it appears the AWSP offers ssh access to the Alexa cluster where you can write arbitrary C code.

This is either incredibly bold or absurdly foolish. On the one hand, this could be a useful platform for some developers, a utility computing server farm where you can rent machines by the CPU hour and access the incredible Web data available from Alexa. On the other hand, arbitrary C code can do arbitrary things, nicely accessing the data it is supposed to or evilly cracking the machine, fondling other people's data, and launching attacks on other servers.

You have to hand it to Amazon. They've been doing an amazing job thinking outside the box lately. But, sometimes, the box is there for a reason.

Update: In the comments, a couple people are arguing that these accounts appear to be isolated in virtual machines and that I may be overstating the risk. They might be right, perhaps I am being too paranoid, especially given that there are easier targets out there.

Search web services? Interesting if you can't get one of your subsidiaries to innovate fast enough. Not amazon... then you probably don't care.

And any issues like you pointed out (why would you want to share so much with Amazon) are just dismissed by insiders, even though those ultimately prove to be the reason these things (see passport) don't catch on.

Amazon is rapidly becoming Microsoft. About 87% of what Microsoft produces never makes it out the door. Another 20% makes it out the door but ultimately doesn't catch on (everything except xBox, Office and the OS). All the money's made in the last 10%.

Greg - on the crawl data costs, I'm thinking that for many vertical search applications, you'd do an initial selection on the index metadata, and avoid touching the full crawl data until you know what you want to build a new index on.

I haven't read far enough to tell if that's possible, or whether you'd just have to move each arc file across anyway.