Big Data For Less – Dealing with Large Data Sets on a Startup’s Budget

Building large data applications can present a unique set of technical challenges because things that often work well in the conventional development environment can become incredibly arduous or expensive when applied on a much bigger scale. This talk will cover some of those challenges and potential solutions for each – in particular the talk with be targeted at towards people who want to leverage lots of data, but are also trying to keep their costs under control.

Here are some examples:

Lose your staging/test environment. Normally people have a dev environment, staging environment and then production. We roll straight to production (since paying for all that hardware can be costly) and try to live by the mantra fail early and save state.

Throw data away as soon as you don’t need it anymore. A lot of the time people like to save old data to do investigations, debug problems (ever hear of root cause analysis?), or use later on. Each month we pull in over 600TB of data during our crawl, pull out the stuff we care about, compress it, and then toss the rest so it never has to hit the disks of our machines.

Store compressed data. Search engines have claimed to keep their entire web index in RAM to reduce latency. We not just store our data on disk, but we compress it, meaning we have to uncompress it before giving it to users. By employing caches and indexes our API design makes this acceptable.

Make assumptions about data reliability. In cloud computing many people encourage you to keep copies of your data, so should a node ever disappear then your data will be available. When saving money, one has to accept some chance of ruin (and data going missing). The mantra of having granular and reliable enough backups such that failure is an unnoticeable blip can get too expensive. We had checkpoints backing up to S3, which were removed to keep costs down. Later, when the company was profitable, we added them back.

Computer time > developer time. Many times people will say, it is cheaper to throw hardware at the problem than engineering resources. In our case, that definitely isn’t true. We have had developers rewrite some standard libraries (like integer compression) to squeak out additional performance and save money.

Understand your data compression. With big data compressing your data can save disk space and network usage (which is important if you are computing data across a lot of boxes), but the more you compress the more CPU time it will cost. You should strive to find the right balance where the costs make sense for your data set (and for us that meant writing our own compression libraries for our use case).

Don’t put everything in the cloud. People assume that you can save money with cloud computing, but that isn’t always the case – check your math and build your software to be deployed anywhere.

The talk format will be technical in nature, going through the points above as well as some system design and architecture options that can impact your budget. Attendees will come away with some creative ideas to save money on their “big data” applications and a detailed insight into how some of these items have worked in production.

People planning to attend this session also want to see:

Kate Matsudaira

SEOmoz

Kate Matsudaira fills the role as Vice President of Engineering at SEOmoz where she is responsible for managing the core technology team. Prior to SEOmoz, she filled the role of VP Engineering at another startup, Delve Networks (acquired by Limelight). At Delve she helped create and monetize a very large distributed system used for online video delivery and video search. Prior to that she worked in at other leading technology companies like Amazon.com, Microsoft, and Sun Microsystems.
Kate has extensive knowledge of building large scale distributed web systems, web services, and search. Kate has a B.S. in Computer Science from Harvey Mudd College, and has completed graduate work at the University of Washington in both Business and Computer Science (M.S.).