How Amazon uses big data to prevent warehouse theft

Amazon has become the cloud king, with its Amazon Web Services (AWS) offerings providing cloud-based storage and processing that takes a lot of the cost out of deploying new products and services and developing applications. Netflix, DropBox and Yelp are all AWS clients, but the most important user might be Amazon itself.

Today at the Web 2.0 Summit, Alyssa Henry, VP of Amazon’s AWS Storage Services, gave one example of how Amazon uses its cloud storage and processing power to handle one issue that is little thought about but vital to its overall profitability: combatting warehouse theft.

According to Henry, Amazon has more than 1.5 billion items in its retail catalog and more than 200 fulfillment centers around the world. That’s a lot of objects in a lot of places for the online retailer to keep track of. Keeping the most valuable items protected isn’t as easy as just putting the highest-priced products under lock and key. As Henry said, sometimes, due to limited availability or other factors, a lower-priced product might actually be more highly sought-after by criminals. There’s also the question of how big the cage is, how big the item is, how many items can be fit in each cage, and so on.

To determine which items are most likely to be stolen, Amazon stores the product catalog data in S3, which ends up having more than 50 million updates a week. The team spins up Amazon compute clusters every 30 minutes, crunch the data, and the data is fed back to the warehouse and website. At the center of the service is the new Elastic Map Reduce, a new hosted Hadoop framework running on AWS that lets customers spin up the equivalent of a supercomputer for processing big data.

Amazon isn’t the only one using EMR for big data processing: Henry gave Yelp as another example. In its particular use case, Yelp has been leveraging AWS and EMR to improve its autocorrect options in its search function, processing all of the searches that users have done and determining which search option was the “correct” one, based on which the most number of users clicked on.