Tag Archives: Dataflow

zulily is a flash sales company. We post a product on the site, and puff… it’s gone in 72 hours. Online ads for those products come and go just as fast, which doesn’t leave us much time to manually evaluate the performance of the ads and take corrective actions if needed. To optimize our ad spend, we need to know in real-time how each ad is doing, and this is exactly what we engineered.

While we track multiple metrics to measure impact of an ad, I am going to focus on one that provides a good representation of the system architecture. This is an engineering blog after all!

The metric in question is Cost per Total Activation, or CpTA in short. The formula for the metric is this: divide the total cost of the ad by the number of customer activations. We call the numerator in this formula “spend” and refer to the denominator as an “activation”. For example, if an ad costs zulily $100 between midnight and 15:45 PST on January 31 and results in 20 activations, the CpTA for this ad as of 15:45 PST is $100/20 = $5.

Here’s how zulily collects this metric in real-time. For the sake of simplicity, I will skip archiving processes that are sprinkled on top the architecture below.

The source of the spend for the metric is an advertiser API, e.g. Facebook. We’ve implemented a Spend Producer (in reference to the Producer-Consumer model) that queries the API every 15 minutes for live ads and pushes the spend into a MongoDB. Each spend record has a tracking code that uniquely identifies the ad.

The source for the activations is a Kafka stream of purchase orders that customers place with zulily. We consume these orders and throw them into an AWS Kinesis stream. This gives us the ability to process and archive the orders without causing an extra strain on Kafka. It’s important to note that relevant orders also have the ad’s tracking code, just like the spend. That’s the link that glues spend and activations together.

The Activation Evaluator application examines each purchase and determines if the purchase is an activation. To do that, it looks up the previous purchase in a MongoDB collection for the customer Id on the purchase order. If the most recent transaction is non-existent or older than X days, the purchase is an activation. The Activation Evaluator updates the customer record with the date of the new purchase. To make sure that we don’t drop any data if the Activation Evaluator runs into issues, we don’t move the checkpoint in the Kinesis stream until the write to Mongo is confirmed.

The Activation Evaluator sends evaluated purchases into another Kinesis stream. Chaining up Kinesis stream is a pretty common pattern for AWS applications, as it allows for the separation of concern and makes the whole system more resilient to failure of individual components.

The Activation Calculator reads the evaluated purchases from the second Kinesis stream and captures them in Mongo. We index the data by tracking code and timestamp, and voila, a simple count() will return the number of activations for a specified period.

The last step in the process is to take the Spend and divide it by the activations. Done.

With this architecture, zulily measures a key advertising performance metric every 15 minutes and uses it to pause poorly-performing ads. The metric also serves as an input for various Machine Learning models, but more on those in a future blog post… Stay tuned!!

Clickstream is the largest data set at zulily. As mentioned in zulily’s big data platform we store all our data in Google cloud storage and use Hadoop cluster in Google Compute Engine to process it. Our data size has been growing at a significant pace, this required us to reevaluate our design. The dataset that was growing fastest was our clickstream dataset.

We had to make a decision to either increase the size of cluster drastically or try something new. We came across Cloud Dataflow and started evaluating it as an alternative.

Why did we move clickstream processing to Cloud Dataflow?

Instead of increasing the size of the cluster to manage peak loads of clickstream Cloud Dataflow gave us ability to have an on-demand cluster that could dynamically grow or shrink reducing costs

Clickstream data was growing very fast and the ability to isolate it from other operational data processing would help the whole system in the long run

Clickstream data processing did not require complex correlations and joins which are a must in some of the other datasets – this made the transition easy.

Programming model was extremely simple and transition from development perspective was easy

Most importantly, we have always thought about our data processing clusters as transient entities, which meant dropping another entity in this case Cloud Dataflow which was different from Hadoop was not a big deal. Our data would still reside in GCS and we could still use clickstream data from Hadoop cluster where required.

High Level Flow

One of the core principles of our clickstream system is to be self serviceable. In short, teams can start recording new types of events anytime by just registering a schema to our service. Clickstream events are pushed to our data platform through our real-time system(more later) and through log shipping to GCS.

Create multiple outputs from data in the logs based on event types and schema using PCollectionTuple

Apply ParDo transformations on the data to make it optimized for querying by business users (e.g. transform all dates to PST)

Store the data back into GCS, which is our central data store but also for use from other Hadoop processes and loading to Big query.

Loading to BigQuery is not done directly from Dataflow but we use bq load command which we found to be much more performant

The next step for us is to identify other scenarios which are a good fit to transition to Cloud Dataflow, which have similar characteristics to clickstream – high volume, unstructured data not requiring heavy correlation with other data sets.