The Five Minute Interview – SimpleReach

This article is one in a series of quick-hit interviews with companies using Apache Cassandra and/or DataStax Enterprise for key parts of their business. For this interview, we chatted with Eric Lubow who is CTO of SimpleReach, which is headquartered in New York City.

DataStax: Eric, tell us what you guys are doing at SimpleReach.

Eric: What we’ve done is create a platform that is basically Google analytics for social media. We help major companies understand how well their web sites and content are doing socially.

DataStax: What kind of data challenges does such a thing bring to the table for you?

Eric: So, to do what we do effectively, we have to amass an enormous amount of data. And I’m talking everything from general market data, to times series data about each social vertical, which includes every type of social interaction: a tweet, a Facebook “like”, a Digg, a Reddit, an up/down vote, you name it.

DataStax: How does Cassandra and DataStax Enterprise help you?

Eric: Cassandra makes it possible for us to drink in all that high velocity data and the heavy data volumes very quickly, and store it in a way that helps us to analyze data patterns in a time series fashion.

DataStax: What type of configuration do you use?

Eric: Right now we keep about 5 terabytes of live information between one Cassandra cluster of about 12 nodes for real-time data and a DataStax Enterprise cluster of 3 nodes for analytics with Hive and MapReduce, with each cluster being in different data centers. Over the next two months, we predict that the amount of data we’re going to have to manage will absolutely explode.

DataStax: What other things can you tell us about your use of Cassandra and DataStax Enterprise?

Eric: We use the right tools for the right job. Cassandra and DataStax Enterprise are parts of our infrastructure, but we also use MongoDB for some things as well as Redis and a MySQL-based columnar database. We use Cassandra for our authoritative / primary datastore.

We love Cassandra because it can ingest massive amounts of data. Its ability to insert data in huge quantities anywhere in a cluster and have it show up everywhere else has become absolutely invaluable to us. Other choices like MongoDB would clearly not support such a thing.