Not so long ago, it was both difficult and expensive to perform massive distributed processing using a large cluster of machines. Mainly because:

It was difficult to get the funding to acquire this 'large cluster of machines'. Once acquired, it was difficult to manage (powering/cooling/maintenance) it and we always had a fear of what-if the experiment failed and how would one recover the losses from the investment already made.

After it was acquired and managed, there were technical problems. It was difficult to run massively distributed tasks on the machines, storing and accessing large datasets, parallelization was not easy and Job scheduling was error-prone. Moreover, If nodes failed, detecting this was difficult and recovery was very expensive. Tracking jobs and status was often ignored because it quickly became complicated as number of machines in cluster increased.

Hence it was difficult to innovate and/or solve real-world problems like these:

Social Networking Company : Analyze social, demographic and market data

Phone Company : Locate all customers who have called in a given area

Large Retailer Chain : Wants to know what items a particular customer bought last month or recall a certain product and inform customers who bought that product.

Surveillance Company : Wants to transcode video accumulated over several years

Pharma Company : Wants locate people who were prescribed a certain drug

Just a few years ago, it was difficult. But now, it is easy.

The Open Source Hadoop framework has given developers the power to do some pretty extraordinary things.

Hadoop gives developers an opportunity to focus on their idea/implementation and not worry about software-level "muck" associated with distributed processing (#2 above). It handles job scheduling, automatic parallelization, and job/status tracking all by itself while developers focus on the Map and Reduce implementation. It allows processing of large datasets by splitting the dataset into manageable chunks, spreading it across a fleet of machines and managing the overall process by launching jobs, processing the job no matter where the data is physically located and, at the end, aggregating the job output into a final result.

But if I am a startup, or a university with minimal funding, or a self-employed individual who would like to test distributed processing over a large cluster with 1000+ nodes, can I afford it? OR even If I am a well funded company (think "enterprise") with lot of free cash flow, will management approve the budget for my experiment? Every organization has a person who says "no". Will I be able to fight the battle with those people? Should I even fight the battle (of logistics)? Will I be able to get an environment to experiment with large datasets (think "weather data simulation", oer "genome comparisons")?

Cloud Computing makes this a reality (solving #1 above). Click a button and get a server. Flick a switch and store terabytes of data geographically distributed. Click a button and dispose of temporary resources.

Posts like this and this inspired me to write this post. Amazon Web Services is leveling the playing field for experimentation, innovation and competition. Users are able to iterate on their ideas quickly, if your idea works, bingo! If it does not, shutdown your "droplet" in the cloud and move on to the next idea and start a new "droplet" whenever you are ready.

I would say:

The Open Source Hadoop framework on Amazon EC2/S3 has given everydeveloper the power to do some pretty extraordinary things.

Everyday, I hear new stories about running Hadoop on EC2. For example, The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4TB of raw image TIFF data (stored in S3) into 1.1 million finished PDFs in the space of 24 hours at a computation cost of just $240. It not only makes massive distributed processing easy but also makes it headache-free.

Whether it is Startup companies or University Classrooms in UCSB, BYU, Stanford or even enterprise companies, its just amazing to see every new story that is utilizing Hadoop on Amazon EC2/S3 in innovative ways.

That’s what I love about Amazon Web Services - a common man with just a credit card can afford to think about massive distributed computing and compete with the rest and emerge to the top.

--Jinesh

p.s.The real power and potential of hadoop over Amazon EC2 would be when I see Hadoop-on-demand with Condor spawning EC2 instances on-the-fly when I need them (or when situation demands them) automatically and shutting them down when I don’t need them. Has anybody tried that yet ?

Orglex is focused on providing information services to industry professionals in vertical markets such as health insurance, clinical trials, and venture capital. Their service has three facets: aggregated content specific to the industry, community and networking hubs for the industry, and a targeted recruiting platform.

Orglex usea EC2 and S3 to build and maintain domain-specific ontologies for each vertical market. In contrast to the usual top-down models hand-built by domain experts, their system extracts clues from the information and uses them in a scalable, bottom-up fashion.

The process of building and refining the algorithms is iterative in nature. The scalable nature of EC2 allows them to tune and re-run their algorithms as needed without the need for a dedicated compute cluster. Nik told me that the ability for them to scale up and down has really driven down the cost of experimentation for them and has allowed them to get to market quickly, and with a high quality product.

We have simplified the process of requesting additional EC2 instances. You no longer need to call me at home or send a box of dog biscuits to Rufus.

You can now make a request by simply filling out the Request to Increase the Amazon EC2 Instance Limit form. We'll need to know a little bit about you and about your application and the number of instances that you need, and we'll take care of the rest.

As always, if you are doing something cool with EC2, we really want to hear about it! Write a blog post that we can link to, or simply send us an email at awseditor@amazon.com .

If there was ever any doubt about the power each of us have, this week proved that one person makes a real difference. I am midway two-week trip to New Zealand and Australia, and writing this post from New Zealand. The person that I’m talking about is Nick Jones—let me explain how this evangelism trip came about, and along the way I’ll talk a bit about what I found once here.

How the Trip Came AboutAmazon’s own Jeff Barr came up with an idea that has changed the course of evangelism—at least here at Amazon Web Services. We have a wiki at evangelists.wetpaint.com that allows community members to request that we come to them, rather than some centralized process where we decide who “should” hear about Amazon Web Services. And so in this case Nick posted a request that Amazon send a Web Services evangelist down under. I replied to Nick to say “sure, but not just for one meeting”. Must have been a challenge—check out the wiki page for this trip and you’ll see just how dense the schedule is. Nick wasn’t responsible for every meeting; however a large percentage of these meetings in both New Zealand and Australia were due to his efforts.

The ResultLots of opportunity to meet with the academic/research community (Nick works at the University of Auckland), government agencies, startups, and individual developers on this trip. It’s amazing what you learn—especially when others set the agenda. I am going to describe just a few highlights, which will shortchange others who reinforced the same point; but given the number of meetings it’s the only approach possible.

New Zealand is a long way from traditional tech centers, and there is a single undersea cable that serves the country (although a second one is on the way). The result is that Internet access is expensive, with a wholesale cost of $0.03/MB to communicate with North America. So the research community makes use of KAREN, a network that is funded by the NZ government and that eliminates that transit fee—as long as the other end has a peering agreement. None of this seems to affect the local startup scene though, as I'll describe shortly.

Every city seemed to have a take-charge person. In Christchurch there were two: with Robin Harrington taking the lead at the University of Canterbury, and Christopher Sawtell leading the charge for the Linux group. Robin set up a series of sessions with researches and faculty on campus. It's always exciting to see people think about what these new Web service offerings afford in the way of potential and cost savings. And I was able to learn more about the university and what their needs are. The campus is on a very large piece of land; yet the actual buildings are compact so that there is lots of very lush green space. Kiwis are definitely into "green"--in both the garden and environmental sense.

As mentioned, the other Christchurch leaders were long-time officers of the local Linux user group. They went well out of their way to accommodate my schedule and arrange a meeting on a non-normal night. Then they even invited me out for dinner at a Chinese restaurant. Great place to eat! We met on the university campus; you know it's a comp sci department when the name on the lab door says "Crypt 2".

The Kiwi research community has access to the highest number of supercomputers per capita in the world. These were used for at least part of the rendering of Lord of the Rings, a fact that many techies say “thank you” for.

Wellington has a vibrant Web community, and seems to be a hotbed of tech startups. The original intent was that I'd present to a few local startups. The event kept growing on its own until Catalyst Consulting stepped in and agreed to host it. Then it got bigger yet, presenting venue challenges... Don Christie from Catalyst posted a blog entry about the meeting, where I presented to a group of well over 100 people (believe that it was closer to 150), in a packed incubation center. Wow, what energy! What the folks in the room didn’t realize was that from the balcony outside the meeting I was able to see the neighborhood where I lived briefly many years ago (in the background of this photo). What a distraction! Another blog post by a different attendee is here.

In Hamilton I met with one of New Zealand’s largest Web design firms. They have all sorts of innovation in their reference list; not least of which was setting themselves up as an Internet registrar. Like so many others, they were enthusiastic and excited about the potential of Web-Scale Computing. At this point I also switched to renting a car--was a combination of destinations in suburban areas and a late-night travel schedule to Auckland. The rental vehicle reminded me that New Zealand uses the other side of the road, and that I should too...

Finally, Auckland is a more traditional business community but still full of tech startups. Had an opportunity to meet with some of them as well. In both Wellington and Auckland I realized how hands-on the government is about promoting their software industry as an export. The folks in NZTE (New Zealand Trade & Export) were impressive--unlike a typical government agency these staff members come from the software industry, and have a very realistic view of the world. There are plenty of success stories in New Zealand's software industry that don't involve government agencies, of course; however being promoted as an export industry definitely provides lift.

I finally met Nick on Thursday.

Who wants to be next? Nick and the rest of the New Zealand community set the bar...

Earlier this month we introduced two new stories. You can read about how Digital Chalk used 3 different services to create a system for creating, editing, and hosting training videos.

You can also read about how Sonian Networks (previously blogged here) used the same services to create a highly scalable system for archiving and indexing corporate email and other internally generated content.

We've got more stories in the works, so please check the Success Stories part of our site from time to time.

My friends at Bungee Labs have rolled out the newest release of Bungee Connect, their browser-based application development and hosting platform.

They have also released a library which makes it really easy to make calls to Amazon SimpleDB. The library wraps all of the SimpleDB SOAP calls and handles all of the authentication as well. Per their recent blog post, all you need to do to get started is to enter your AWS developer credentials. You can read about the library here. Per my earlier blog post, you can also access Amazon FPS from Bungee Connect with ease.

Bungee Connect is the development component of Bungee's Platform-as-a-Service model. Without leaving your desk (or your web browser) you can design, build, and deploy a complex application. The application might involve calling SOAP or REST web services, mashing up data from multiple local and remote sources, and doing some significant local processing as well. There's no charge to develop an application. Once built and deployed, the developer is billed based on actual usage of the application. There's more on this over at ProgrammableWeb.

You may be reading this and thinking that it sounds cool, only to realize that you don't yet have access to Amazon SimpleDB. We are adding new users to the SimpleDB beta just as fast as possible. If you are not yet on the waiting list, go here and click the Sign Up for Web Service button near the top right of the page. Before you do that, make absolutely sure that you have attached a credit card to your AWS account. If you are already using another for-pay service such as Amazon S3 or EC2, you have already done this. About 99.9% of our existing SimpleDB beta testers gained their access in this way.

The other 0.1% were desperate for access and managed to beg their way into the beta using various social engineering tricks. Sample tricks include desperate emails to me, emails with a very predictable pattern:

Paragraph 1 is always something like "Hey Jeff, remember that time we were using a PDP-8 together back in 7th Grade? Man, those were the good old days. I've been meaning to catch up with you for a long time. How's life?"

Paragraph 2 is then "I'm now at a startup, and my life wont be complete without access to SimpleDB. Can you help?"

Believe it or not, I get at least one such email per week. In fact, our limited betas have proven to be very effective at getting reconnected to old friends, which is never a bad thing. Of course, the more clever and more desperate the appeal, the better.

S3Stat is a log analysis tool for Amazon S3. This very helpful tool uses the log files generated by S3, analyzes them using Webalizer, and generates a variety of insightful and colorful reports. I have been using S3Stat on one of my own buckets for the last couple of months and have been pleased with the results. I use an S3 bucket to store the pictures that I post on my personal blog and now I know a lot more about the popularity of each one.

There's a one-month free trial and usage after that costs just $2 per month. Take a look at the pricing plan to learn more. While you are on the site you may want to take a look at their handy list of S3 resources as well.

In order to allow developers to gain a better understanding of the Amazon SimpleDB Query language, we have just posted a pair of tutorials:

In Query 101: Building Amazon SimpleDB Queries, you will learn about the basic principles of the language, including the comparison and set operators, and how to use them in simple and range queries. With that as a base, you will then learn about multi-valued queries, which (naturally enough) operate on SimpleDB attributes which have multiple values. Finally, you will learn about multi-predicate queries, using the union and intersection operators to ask more complex questions.

In Query 201: Tips & Tricks for Amazon SimpleDB Query, you will learn about lexicographic comparison, querying for numerical data and dates, using negation, tuning your queries using BoxUsage, partitioning your data for best query performance, and efficient retrieval of result sets.