The Five Minute Interview – Gnip

This article is one in a series of quick-hit interviews with companies using Apache Cassandra and/or DataStax Enterprise for key parts of their business. For this interview, we spoke with Greg Greenstreet who is VP of engineering at Gnip.

DataStax: Greg, thanks for the time today. Please give us an overview of what Gnip is all about.

Greg: Primarily, we serve as the most reliable source of social data to the world. That may sound ambitious, but from a practical perspective we front publishers such as Twitter, Tumblr, Facebook, WordPress, and many more. We take the firehoses from those publishers and provide that data to our customers who want to leverage it for their business.

We’ve been in business since 2008 and currently, our customers are serving social data to 90% of the Fortune 500.

DataStax: What’s your infrastructure look like right now?

Greg: We use a combination of on-premise hardware and systems running in cloud providers. From a development perspective, we use Java for a lot of our data processing and Ruby for front end work.

DataStax: Am I right in saying you guys have a classic big data use case?

Greg: We have both the big data and big bandwidth problems to solve. From a big data perspective, we currently serve out more than 100 billion activities per month to our customers, plus we save all of that data historically. It’s not uncommon for us to digest 20,000 tweets per second, so data comes in very, very fast. And that’s just from one publisher.

We offer both real-time and historical capabilities for all our premium social data publishers. Our core business was built on real time data delivery but increasingly there is demand for a historical perspective across our publishers.

The real-time business has high bandwidth requirements so we like to control the network and compute resources where as batch processing can be pushed off efficiently to cloud platforms.

DataStax: What brought you to Cassandra?

Greg: We’re not an analytics company; instead, we serve all the best companies in social media analytics, business intelligence, finance and ad tech. We are more concerned with realtime processing than batch analytics, so Cassandra was a more natural platform for us than others.

As you can imagine, the write load for us is massive. We need our systems to scale horizontally because the data for Twitter alone can triple is size in just one year. As an example, one project that keeps a week’s worth of data online in a rolling window fashion has 10’s of TB’s just for that one week.

So we need a real-time, massively scalable architecture, where no one node is a point of failure, that can easily span multiple data centers and cloud availability zones, and that’s Cassandra.

DataStax: Did you start out using Cassandra or something else?

Greg: We began by using a Lucene-based system, but that quickly fell down in the face of the write and read loads we have.

DataStax: What are some example use cases that Cassandra covers for you?

Greg: One big area for us is compliance. For example, if someone deletes a tweet they made a year ago, we’re not allowed to serve that up historically to our customers. Cassandra was the only database that could handle that type of activity on that much data for us.

We started with Cassandra for the compliance system, but since then, we also use Cassandra for many other projects internally. Now we’re using Cassandra to serve the payload of our data; it’s the source of record for us. Cassandra’s also exceptionally good at time series data so we use it wherever time series use cases are involved.

DataStax: Do you use other databases besides Cassandra?

Greg: We still use some legacy relational databases for small application support and Redis in some areas, but it’s primarily Cassandra for big data storage and retrieval.

DataStax: What advice would you give to help people get started with Cassandra?

Greg: I’d say the primary thing to know up front is how to size the data on the nodes and determine the cluster configuration you need to support your expected I/O traffic and data volumes. Knowing how to grow the cluster efficiently and handle the various maintenance tasks is very important, especially if you’re dealing with many TB’s of data like we are. If you’re going to go too far in one direction, oversizing vs. undersizing your cluster is better.

DataStax: Greg, thanks for sharing what you guys are doing with Cassandra.