]]>Pinterest’s engineering team has created a self-serving platform — comprised of home grown, open source and commercial tools — that hooks in with Hadoop to orchestrate and process all of the company’s enormous amounts of data, consisting of 30 billion pins. The company shared a few details behind the project for the first time on Thursday.

The heart of Pinterest is the ability of its users to create and share personalized boards that display pins — essentially, visual bookmarks to various places on the web — and so the company needed to find a way to obtain insights into all that user activity and provide a more enjoyable experience through related pins, search indexing and image processing. These are all heavy-duty tasks that fall beyond the scope of Hadoop by itself.

Currently, the social scrapbook and visual discovery site has roughly 10 petabytes of data in Amazon S3 and processes about a petabyte of data daily using Hadoop, according to a blog post from Pinterest data engineer Mohammad Shahangian published Thursday. Although Hadoop is great at storing and processing that data, Shahangian wrote, it’s not an easy system to use by itself and it requires some customization with the help of other services to really take advantage of its capabilities.

All the data Pinterest is processing over time

To help Hadoop process data faster, the Pinterest team uses MapReduce as a way to separate compute and storage from each other. With MapReduce, Pinterest’s Hadoop clusters can be synced together with S3 so that they can process all of the stored data in the Amazon cloud. This is helpful in that if one of the clusters were to go down or was in need of a hard reset, no work would be lost; all the data would remain in S3 as opposed to having everything be held up in the Hadoop clusters.

How Pinterest uses MapReduce

The Pinterest team also decided on using Hive in conjunction with Hadoop and found that Hive’s ability to store metadata was helpful in that it could catalog all the data that comes from each Hadoop job. Hive also allows the team to take advantage of some of the functionality of common SQL tools, like the ability to list tables and detail the information within them; this is supposedly a much more simple interface to work.

From the blog post:

We orchestrate all our jobs (whether Hive, Cascading, HadoopStreaming or otherwise) in such a way that they keep the HiveMetastore consistent with what data exists on disk. This makes is possible to update data on disk across multiple clusters and workflows without having to worry about any consumer getting partial data.

Pinterest also uses Puppet, the open source configuration management tool produced by Puppet Labs (see disclosure), as a way to keep track of its customized system.

From the blog post:

Puppet had one major limitation for our use case: when we add new nodes to our production systems, they simultaneously contact the Puppet master to pull down new configurations and often overwhelm the master node, causing several failure scenarios. To get around this single point of failure, we made Puppet clients “masterless,” by allowing them to pull their configuration from S3 and set up a service that’s responsible for keeping S3 configurations in sync with the Puppet master.

Another thing of interest is the fact that Pinterest decided to go with the Hadoop-as-a-service startup Qubole to take care of its Hadoop jobs, because Amazon’s Elastic MapReduce (EMR) had trouble performing when the company grew out to over a couple hundred nodes. As Shahangian details, Qubole, whose founders are two former Facebook engineers responsible for helping create Hive, had no problem scaling out horizontally to thousands of nodes on only one cluster and its efficiency led to a 30-to-60 percent increase in throughput compared with when the company was using EMR.

Eventually, Pinterest plans to experiment with Hadoop 2, which can help manage resources in a cluster all by itself (unlike Hadoop), but so far the company seems satisfied enough with its souped-up version of Hadoop powered by services like Qubole and Puppet.

Disclosure: Puppet Labs is backed by True Ventures, a venture capital firm that is an investor in Gigaom.

But outside of some whiz-bang technical stuff, there are also hundreds of million of people with access to the internet and a desire to get more information about something as esoteric as PHACE syndrome or as common as depression. Add thousands of medical research papers and descriptions about diseases and treatments available via the National Library of Medicine for free, and you have a business opportunity.

Making meaning from medical research

When Jay Bartot and Derek Streat decided started Medify in 2010 the goal was to use those free research papers to train a machine learning algorithms how to deliver intelligible information to health queries from the masses. Bartot, who had co-founded Farecast, a startup that built a predictive algorithm to tell users the best time to buy airplane tickets, decided to take his knowledge of prediction to the health world after his own family’s brush with a medical problem.

In May, Medify was purchased by Alliance Health Networks a Salt Lake City, Utah-based startup that has built out a community of 1.5 million people who gather to discuss diseases and medical conditions. Now, with Medify on board and a community of people whose discussions about health are also a great source of data, Alliance Health is seeing how mining unstructured data from professionals and patients alike can help improve heath.

Medify had used the National Library of Medicine to build out ontologies that it uses to “teach” the algorithms to understand medical terminology and treatment plans, and has then built up a user interface around those algorithms. Like IBM has found with Watson, its supercomputer that has found a role helping doctors diagnose illnesses based on symptoms, medicine is a good place for this type of data mining.

The goal is to take the ontologies learned from Medify and combine its algorithms with what people discuss in Alliance Health’s communities. Then, Alliance can apply new algorithms to see who in the community is offering the best advice, understand how patients influence and inspire each other, and then help pharmaceutical companies and even doctors understand and influence how patients make medical decisions.

Getting value from big data doesn’t have to be a big undertaking.

It’s actually a great example of how thinking about big data doesn’t have to be as complicated as using a supercomputer and expensive clinical research filter through algorithms to help doctors diagnose illnesses. Streat says the company’s data is only in the low-terabyte range and they process it using Amazon Web Services, including EC2 and Elastic MapReduce. They add new data every day and refresh their machine learning algorithms weekly, if not every few days.

There’s a place for these simpler solutions, and by bringing together a community and providing it with information, Alliance Health might become a company much like Spiceworks is in the IT space — able to both monetize and help a community of niche users in a way that benefits everyone. For example, a company that makes a new diabetes test might pay to sponsor the diabetes channel on Alliance or may even pay to find out who the big influencers are in the forums associated with that channel. If done correctly, users might even welcome sponsored how-tos or better information delivered about a new drug or device.

Streat explained that as far as medicine and data-combining go, there are many efforts around devices and even fancier data sets. However, he’s confident that even with something like expensive clinical data that’s locked behind paywalls, just being able to direct people to better answers and give them a sense of community is a good place to start.

Investors seem to think it’s a decent bet as well. Alliance Health has raised $20 million since it’s founding in 2006 from investors such as New World Ventures, Physic Ventures, Epic Ventures and Highway 12 Ventures.

]]>Change.org, the popular web platform for promoting social change, passed the 20-million user mark last month and expects to surge past 25 million later this year. And yet it runs not a single server in its offices.

“We run nothing in house. I don’t have a single rack in this place. We don’t even have a PBX,” said Kyle VanderBeek, manager of infrastructure for the 6-year-old San Francisco-based organization. The IT team is similarly lean. Four people on the 150-person organization handle loosely-defined IT functions but there’s “zero staff” dedicated solely to IT. Tasks around things like email are shared by VanderBeek himself and HR and other operations people.

Riding the success of online petitions

The operation zoomed to prominence when the parents of Trayvon Martin, the teenager who was shot by a neighborhood watch volunteer, used the site to launch a petition seeking an investigation. That online petition logged 1,000 signatures per minute so it knows a little something about scaling. To keep up and running, it looks outside using software as a service tools for in-office functions. It runs its backend analytics and business intelligence on Amazon Elastic Map Reduce (EMR) and front-end web analytics on Google analytics and MixPanel, said Tim James, software engineer.

The key to success is choosing the cloud services best suited for a particular job and if things change, being able to move. There’s not a lot of orthodoxy. “Six months ago what we really needed was faster disk speed. At that time AWS [Amazon Web Services] did not offer a pure high-speed I/O system so we talked to a number of vendors and found that hybrid hosting at Rackspace worked best so we moved our primary database hosting there,” VanderBeek said.

The company still uses Amazon for archival storage, to serve up its pages as well as Simple Workflow Services (SWS), and the aforementioned EMR for back-end analytics. “We could have gone all-Rackspace using its CDNs and other offerings but we have 20 million users and are getting more international all the time so we looked at all the CDNs and sticking with Amazon made sense — it has 27 points of presence on every continent. It fits our needs and our audience better than other companies.”

Users: Keep your deployment options open

People who deploy workloads to cloud have to evaluate just how much they will lock into a given platform’s higher-level services. There is controversy even among AWS users about whether it’s wise to use Amazon DynamoDB or even SWS because that makes it harder to move. “If you’re on DynamoDB or RDS {Relational Database Service] that move is not as easy– it’s more a hidden software stack that they manage themselves that makes a transition off more difficult,” he said.

Change.org engineering staff.

He is not concerned about Change.org’s use of EMR or SWS however. “Migrating would not be difficult in that we are bound to simple workflow for the control of our processing but we don’t have to run any of the processing or storage itself in AWS,” he said.

The important thing from the end-user standpoint is to steer clear of religious wars — to keep an open mind and to run continual cost-benefit analyses of different deployment options. It’s important to keep tab son new features and functions of the platforms — something that could be a full-time job given the pace at which the cloud providers — especially Amazon — updates its options.

And, keep in mind that while cloud is excellent for many jobs there may be times when on-premises IT is the best option. “I may actually order some servers next week,” VanderBeek said. “When you look at economies of scale, every once in awhile you can find a business case to make a capital outlay and save money.”

]]>Amazon has become the cloud king, with its Amazon Web Services (AWS) offerings providing cloud-based storage and processing that takes a lot of the cost out of deploying new products and services and developing applications. Netflix, DropBox and Yelp are all AWS clients, but the most important user might be Amazon itself.

Today at the Web 2.0 Summit, Alyssa Henry, VP of Amazon’s AWS Storage Services, gave one example of how Amazon uses its cloud storage and processing power to handle one issue that is little thought about but vital to its overall profitability: combatting warehouse theft.

According to Henry, Amazon has more than 1.5 billion items in its retail catalog and more than 200 fulfillment centers around the world. That’s a lot of objects in a lot of places for the online retailer to keep track of. Keeping the most valuable items protected isn’t as easy as just putting the highest-priced products under lock and key. As Henry said, sometimes, due to limited availability or other factors, a lower-priced product might actually be more highly sought-after by criminals. There’s also the question of how big the cage is, how big the item is, how many items can be fit in each cage, and so on.

To determine which items are most likely to be stolen, Amazon stores the product catalog data in S3, which ends up having more than 50 million updates a week. The team spins up Amazon compute clusters every 30 minutes, crunch the data, and the data is fed back to the warehouse and website. At the center of the service is the new Elastic Map Reduce, a new hosted Hadoop framework running on AWS that lets customers spin up the equivalent of a supercomputer for processing big data.

Amazon isn’t the only one using EMR for big data processing: Henry gave Yelp as another example. In its particular use case, Yelp has been leveraging AWS and EMR to improve its autocorrect options in its search function, processing all of the searches that users have done and determining which search option was the “correct” one, based on which the most number of users clicked on.