Seattle's Best Tech Team

Main menu

Post navigation

Hi, I’m Mark — a mathematician turned a coffeeshop owner turned an actuary turned a software engineer. When I discovered my passion for software development, I quit my day job, went through an immersive development bootcamp and got recruited by Zulily’s Member Engagement Platform team. This post is about my first six months here.

The amount of collective knowledge and experience around me was stunning.

I was expected to work directly with the Marketing team to clarify requirements. No middlemen.

Iteration was king!

Things that I am still learning with the help from my team:

Form validation in React;

Designing and building high-scale distributed systems;

Tuning AWS services to match our use cases;

On-call rotations (the team owns several tier-1 services);

And many more…

There is one recurring theme in all these experiences. No matter how new, huge, or unfamiliar a task may seem at first, it’s just a matter of breaking it down methodically into manageable chunks that can then be understood, built, and tested. This process can be time consuming, and I feel lucky that my team understands that. It’s clear that engineers at Zulily are encouraged to take the time and build products that last.

One of the things that I find truly amazing here is the breadth of work that is considered ‘full stack’. DevOps, Big Data, Micro Services, and React client apps are just some of the areas in which I have been able to expand my knowledge in the last several months. It may have been overwhelming at first, but when you are among teammates that have vast expertise, acquiring these new skills became an exciting part of my daily routine at Zulily.

It’s hard to compare what I know now to what I knew six months ago —my understanding has expended in breadth and depth — and I’m excited to see what the next six months will bring.

Share:

Like this:

The
appeal of serverless architecture sometimes feels like a ­­silver bullet: Offload
your application code to your favorite cloud providers’ serverless offering and
enjoy the benefits of extreme scalability, rapid development, and pay-for-use
pricing. At this point in time most cloud providers’ solutions and tooling are
well defined, and you can get your code executing in a serverless fashion with
little effort.

The
waters start to get a little murkier when considering making an entire
application serverless while maintaining the coding standards and
maintainability of traditional server-based architecture. Serverless has a
tendency to become ownerless (aka hard to debug and painful to work on) over
time without the proper structure, tooling and visibility. Eventually these
applications suffer due to code fragility and slow development time, despite
how easy and fast they were to prototype.

On
the pricing team at Zulily, our core services are written in Java using the
Dropwizard framework. We host on AWS using Docker and Kubernetes and deploy
using Gitlab’s CICD framework. For most of our use cases this setup works well,
but it’s not perfect for everything. We do feel confident though that this
setup provides the structure and tools that help in avoiding the ownerless trap
and this setup helps us achieve our goals of well-engineered service
development. While this list of goals will differ between teams and projects,
for our purposes it can be summarized below:

Goals:

Testability
(Tests & Local debugging & Multiple environments)

Observability
(Logging & Monitoring)

Developer
Experience (DRY & Code Reviews & Speed & Maintainability)

Deployments

Performance
(Latency & Scalability)

We
are currently redesigning our competitive shopping pipeline and thought this
would be a good use case to experiment with a serverless design. For our use
case we were most interested in its ability to handle unknown and intermittent
load and were also curious to see if it improved our speed of development and
overall maintainability. We think we have been able to leverage the benefits of
a serverless architecture while avoiding some of the pitfalls. In this post,
I’ll provide an overview of the choices we made and lessons we learned.

OUR STACK IN A NUTSHELL AND HOW IT WORKS

There
are many serverless frameworks, some are geared towards being platform agnostic,
others focus on providing better UIs, some make deployment a breeze, etc… We
found most of these frameworks were abstraction layers on top of AWS/GCP/Azure
and since we are primarily an AWS-leaning shop at Zulily, we decided to stick
with their native tools where we could, knowing that later we could swap out
components if necessary. Some of these tools were already familiar to the team
and we wanted to be able to take advantage of any new releases by AWS without
waiting for an abstraction layer to implement said features.

Basic architecture of serverless stack:

Below
is a breakdown of the major components/tools we are using in our serverless
stack:

API Gateway:

What it is: Managed
HTTP service that acts as a layer (gateway) in front of your actual application
code.

How we use it: API gateway
handles all the incoming traffic and outgoing responses to our service. We use
it for parameter validation, request/response transformation, throttling, load
balancing, security and documentation. This allows us to keep our actual
application code simple while we offload most of the boilerplate API logic and
request handling to API Gateway. API gateway also acts as a façade layer where
we can provide clients with a single service and proxy requests to multiple
backends. This is useful for supporting legacy Java endpoints while the system
is migrated and updated.

Lambda:

What it is: Lambda is a serverless compute service that runs your code in
response to events and automatically manages the underlying compute resources.

How we use it: We use Lambdas
to run our application code using the API gateway Lambda proxy integration. Our
Lambdas are written in Nodejs to reduce cold start time and memory footprint
(compared to Java). We have a 1:1 relationship between an API endpoint and Lambda
to ensure we can fine tune the memory/runtime per Lambda and utilize API Gateway
to its fullest extent, for example, to bounce malformed requests at the gateway
level and prevent Lambda execution. We also use Lambda layers, a feature
introduced by AWS in 2018, to share common code between our endpoints while
keeping the actual application logic isolated. This keeps our deployment size
smaller per lambda and we don’t have to replicate shared code.

SAM (Cloudformation):

What it is: AWS SAM is
an open-source framework that you can use to build serverless applications on
AWS. It is an extension of Cloudformation which lets you bundle and define
serverless AWS resources in template form, create stacks in AWS, and enable
permissions. It also provides a set of command line tools for running
serverless applications locally.

How we use it: This is the
glue for our whole system. We use SAM to describe and define our entire stack
in config files, run the application locally, and deploy to integration and
production environments. This really ties everything together and without this
tool we would be managing all the different components separately. Our API
gateway, Lambdas (+ layers) and permissions are described and provisioned in
yaml files.

How we use it: We use
DynamoDB for our ingestion pipeline getting data into our system. Dynamo was
the initial reason we chose to experiment with serverless for this project.
Your serverless application is limited by your database’s ability to scale
fluidly. In other words, it doesn’t matter if your application can handle large
spikes in traffic if your database layer is going to return capacity errors in
response with this increase in load. Our ingestion pipeline spikes daily from
~5 to ~700 read or write capacity units per second and is scheduled to
increase, so we wanted to make sure throwing additional batches of reads or
writes to our datastore from our serverless API wasn’t going to cause a
bottleneck during a spike. In addition to being able to scale fluidly, Dynamo
has also simplified our Lambda–datastore interaction as we don’t have the
complexity overhead of trying to open and share database connections between
lambdas because Dynamo’s low-level API uses HTTP(S). This is not to say it’s
impossible to use a different database (lots of examples out there of this),
but its arguably simpler to use something like Dynamo.

Normal spikey ingestion traffic for our service with dynamo
scaling

Other tools: Gitlab CICD, Cloudwatch, Elk stack, Grafana:

These
are fairly common tools, so we won’t go into detail about what they are.

How we use it: We use
Gitlab CICD to deploy all of our other services, so we wanted to keep that the
same for our serverless application. Our Gitlab runner uses a Docker image that
has AWS SAM for building, testing, deploying (in stages) and rolling back our
serverless application. Our team already uses the Elk stack and Grafana for
logging, visualization and alerting. For this service all of our logs from API
Gateway, Lambda and Dynamo get picked up in Cloudwatch. We use Cloudwatch as a
data source in Grafana and have a utility that migrates our Cloudwatch logs to
Logstash. That way we can keep our normal monitoring and logging systems
without having to go to separate tools just for this project.

DID WE ADDRESS OUR GOALS?

So
now that we have laid out our basic architecture and tooling: How well does
this system address our goals mentioned at the start of this post? Below is a
breakdown for how these have been addressed and a (yes, very subjective) grade
for each.

Testability (Tests & Local debugging & Multiple environments) de

Overall Grade:A-

Positives: The tooling
for serverless has come a long way and impressed us in this category. One of
the most useful features is the ability to start your serverless app locally by
running

sam local start-api <OPTIONS>

This
starts a local http server based on your AWS::Serverless::Api specification in
your template.yaml. When you make a request to your local server, it reads the
corresponding CodeUri property of your lambda (AWS::Serverless::Function) and
starts a docker container (same image AWS runs deployed lambdas) to run your
lambda locally in conjunction with the request. We were able to write unit
tests for all the lambdas and lambda layer code as well as deploy to specific
integ/prod environments (discussed more below). There are additional console
and CLI tools for triggering API gateway endpoints and lambdas directly.

Negatives: Most of the
negatives for this category are nitpicks. Unit testing with layers took some
tweaking and feels a little clumsy and sam
local start-api doesn’t exactly mimic the deployed instance in how it
handles errors and parameter validation. In addition, requests to the local
instance were slow because it starts a docker container locally every time an
endpoint is requested.

Observability (Logging
& Monitoring)

Overall Grade: B

Positives: At the end of
the day this works pretty well and mostly mimics our other services where we
have our logs searchable in Kibana and our data visible in Grafana. The
integration with Cloudwatch is pretty seamless and really easy to get started
with.

Negatives: Observability
in the serverless world is still tough. It tends to be distributed across lots
of components and can be frustrating to track down where things unraveled. With
the rapid development of tooling in this category, we don’t see this being a
long-term problem. One tool that we have not tried, but could bump this grade
up, is AWS X-ray which is a distributed tracing system for AWS components.

Developer Experience (DRY & Code Reviews & Speed & Maintainability)

Overall Grade: A-

Positives: Developer
experience was good, bordering on great. Each endpoint is encapsulated in its
own small codebase which makes adding new code or working on existing code
really easy. Lambda layers have solved a
lot of the DRY issues. We share
response, error and database libraries between the lambdas as well as NPM
modules. All of our lambda code gets reviewed and deployed like our normal
services.

Negatives. In our view
the two biggest downsides have been unwieldy configuration and immature
documentation. While configuration has its benefits, and it’s great to be able
to see the entire infrastructure of your application in code, this can be tough
to jump into and SAM/Cloudformation has a learning curve. Documentation is solid
but could be better. Part of the issue is the rapid pace of feature releases
and some confusion on best practices.

Deployments

Overall Grade: A

Positives: Deployments
are awesome with SAM/Cloudformation. From our Gitlab runner we execute:

which creates a change set (shows effects of
proposed deployment) and then creates the stack in AWS. This will create/update
all the resources, IAM permissions and associations defined in template.yaml.
This is really fast, a normal build and deployment of all of our Lambdas, API
Gateway endpoints, permissions etc… takes ~90 seconds (not including running
our tests).

Negatives: Unclear best
practices for deploying to multiple environments. One approach is to use
different API stages in API gateway and have your other resources versioned for
those stages. Or (the way we chose) is to have completely different stacks for
different environments and pass a DeployEnv variable into our Cloudformation
scripts.

Performance (Latency & Scalability)

Overall Grade: A-

Positives: To test our performance and
latency before going live we proxied all of our requests in parallel to both our
existing (Dropwizard) API and new serverless API. It is important to consider
the differences between the systems (data stores, caches etc..), but after some
tweaking we were able to achieve equivalent P99, P95 and P50 response times
between the two systems. In addition to the ability to massively scale out
instances in response to traffic, the beauty of the serverless API is that you
can fine tune performance on a function (endpoint) basis. CPU share is
proportionally allocated depending on overall memory of each Lambda, so
increasing the memory of each Lambda has the ability to directly decrease
latency. When we first started routing traffic in parallel to our serverless
API we noticed some of the bulk API endpoints were not performing as well as we
had hoped. Instead of increasing the overall CPU/memory of the entire
deployment, we just had to do this for the slow Lambdas.

Negatives: A well-known and often discussed
issue with serverless is dealing with cold start times. This refers to the
latency increase when an initial instance is brought up by your cloud provider
to process a request to your serverless app. In practice this is on the scale
of sometimes several hundred ms latency added on cold start for our endpoints.
Once spun up, instances won’t get torn down immediately and subsequent requests
(if routed to this same instance) won’t suffer from that initial cold start
time. There are several strategies to avoid these cold starts like
“pre-warming” your API with dummy requests when you know traffic is about to
spike. You can also configure your application to run with more memory which
persist for longer between requests (+ increase CPU). Both of these strategies
cost money, so the goal is to try to find the right balance for your service.

Example of scaling (at ~14:40) an arbitrary
computation-heavy lambda from 128mb to 512mb in response times

Overall latency with spikes over a ~3 hour period (most lambdas configured at 128mb or 256mb). Spikes re either cold start times or calls that involve several hundred queries (bulk APIs) or a combo. Note* this performance is tweaked to be equivalent to the p99 p90 and p50 times of the old service.

WRAPPING UP

At
the end of the day this setup has been a fun experiment and for the most part
satisfied our requirements. Now that we have set this up once, we will
definitely use it again for suitable projects in the future. A few things we
didn’t touch on in this post are pricing and security, two very important goals
for any project. Pricing is tricky to get an accurate comparison of because it
depends on traffic, function size, function execution time and many other
configurable options along the entire stack. It is worth mentioning, though,
that if you have large and consistent traffic to your APIs, it is highly
unlikely that serverless will be as cost effective as rolling you own (large)
instance. That being said, if you want a low maintenance service that can scale
on demand to handle spikey, intermittent traffic, it is definitely worth
considering. Security considerations are also highly dependent on your use
case. For us, this is an internal facing application so strategies like
associating an internal firewall, limited resource IAM permissions and basic
gateway token validation sufficed. For public facing applications there are a
variety of security strategies to consider that aren’t mentioned in this post. Overall, we would argue that the
tools are at a point where it does not feel like you are sacrificing in terms
of development standards by choosing a serverless setup. Still not a silver
bullet, but a great tool to have in the toolbox.

Share:

Like this:

Introduction

Here at Zulily, we offer thousands of new products to our customers at a great value every day. These products are available for about 72 hours; to inform existing and potential customers about our ever-changing offerings, the Marketing team launches new ads daily for these offerings on Facebook.

To get the biggest impact, we only run the best-performing ads. When done manually, choosing the best ads is time-consuming and doesn’t scale. Moreover, the optimization lags behind the continuously changing spend and customer activation data, which means wasted marketing dollars. Our solution to this problem is an automated, real-time ad pause mechanism powered by Machine Learning.

Predicting CpTA

Marketing uses various metrics to measure ad efficiency. One of them is CpTA or Cost per Total Activation (see this blog post for a deeper dive on how we calculate this metric). Lower CpTA means spending less money to get new customers so lower is better.

To pause ads with high CpTA, we trained a Machine Learning model to predict the next-hour CpTA using the historical performance data we have for ads running on Facebook. If the model predicts that the next-hour CpTA of an ad will exceed a certain threshold, that ad will be paused automatically. The marketing team is empowered to change the threshold at any time.

Ad Pause Service

We host the next-hour CpTA model as a service and have other wrapper microservices deployed to gather and pass along the real-time predictor data to the model. These predictors include both relatively static attributes about the ad and dynamic data such as the ad’s performance for the last hour. This microservice architecture allows us to iterate quickly when doing model improvements and allows for tight monitoring of the entire pipeline.

The end-to-end flow works as follows. We receive spend data from Facebook for every ad hourly. We combine that with activation and revenue data from the Zulily web site and mobile apps to calculate the current CpTA. Then we use CpTA threshold values set by Marketing and our next-hour CpTA prediction to evaluate and act on the ad. This automatic flow helps manage the large number of continuously changing ads.

Results and Conclusion

The automatic ad pause system has increased our efficiency through the Facebook channel and gave Marketing more time to do what they do best: getting people excited about fresh and unique products offered by Zulily. Stay tuned for our next post where we take a deeper dive into the ML models.

Share:

Like this:

At zulily, we take pride in helping our customers discover amazing products at tremendous value. We rely heavily on highly curated images and videos to narrate these stories. These highly curated images, in particular, form the bulk of the content on our site and apps.

Two of our popular events from weekend of Oct 20th 2018

Today, we will talk about how zulily leverages serverless technologies for serving optimized images on the fly on a variety of devices with varying resolutions and screen sizes

In any given month, we serve over 23 billion image requests using our CDN partners. This results in over a Petabyte per month of data transferred to our users around the globe.

We use AWS Simple Storage Service (S3) bucket as the origin for all our images. Our in-house studio and merchandising teams upload rich images using internal tools to S3 buckets. As you can imagine, these images are pretty huge in terms of file size. Downloading and displaying these images as-is would result in sub-optimal experience for our customers and waste of bandwidth.

Share:

Like this:

Remember Bart Simpson’s punishment for being bad? He had to write the same thing on the chalkboard over and over again, and he absolutely hated it! We as humans hate repetitive actions, and that’s why we invented computers – to help us optimize our time to do more interesting work.

At zulily, our Marketing Specialists previously published ads to Facebook individually. However, they quickly realized that creating ads manually was limiting to the scale they could reach in their work: acquiring new customers and retaining existing shoppers. So in partnership with the marketing team, we worked together to build a solution that would help the team use resources efficiently.

At first, we focused on automating individual tasks. For instance, we wrote a tool that Marketing used to stitch images into a video ad. That was cool and saved some time but still didn’t necessarily allow us to operate at scale.

Now, we are finally at the point where the entire process runs end-to-end efficiently, and we are able to publish hundreds of ads per day, up from a handful.

Here’s how we engineered it.

The Architecture

Sales Events

Sales Events is an internal system at zulily that stores the data about all sales events we run; typically, we launch 100+ sales each day that could include 9,000 products that last three days. Each event includes links to appropriate products and product images. The system exposes the data through a REST API.

Evaluate an Event

This component holds the business logic that allows us to pick events that we want to advertise, using a rules-based system uniquely built for our high-velocity business. We implemented the component as an Airflow DAG that hits the Sales Events system multiple times a day for new events to evaluate. When a decision to advertise is made, the component triggers the next step.

Make Creatives

In this crucial next step, our zulily-built tool creates a video advertisement, which is uploaded to AWS S3 as an MP4 file. These creatives also include metadata used to match Creatives with Placements downstream.

Product Sort

A sales event at zulily could easily have dozens if not hundreds of products. We have a Machine Learning model that uses a proprietary algorithm to rank products for a given event. The Product Sort is available through a REST API, and we use it to optimize creative assets.

Match Creatives to Placements

A creative is a visual item that needs to be published so that a potential shopper on Facebook can see it. That end result advertisement that is seen by the potential shopper is described by a Placement. A Placement defines where on Facebook the ad will go and who the audience should be for the ad. We match creatives with placements using Match Filters defined by Marketing Specialists.

Define Match Filters

Match Filters allow Marketing Specialists to define rules that will pick a Placement for a new Creative.

These rules are based on the metadata of Creatives: “If a Creative has a tag X with the value Y, match it to the Placement Z.”

MongoDB

Once we match a Creative with one or more Placements, we persist the result in MongoDB. We use the schemaless database technology rather than a SQL database because we want to be able to extend the schema of Creatives and Placements without having to update table definitions. MongoDB (version 3.6 and above) also gives us a change stream, which is essentially a log of changes happening to a collection. We rely on this feature to automatically kick off the next step.

Publish Ads to Facebook

Once the ad definition is ready, and the new object is pushed to the MongoDB collection, we publish the ad to Facebook through a REST API. Along the way, the process automatically picks up videos to S3 and uploads them to Facebook. Upon a successful publish, the process marks the Ad as synced in the MongoDB collection.

Additional Technical Details

While this post is fairly high level, we want to share a few important technical details about the architecture that can be instructive for engineers interested in building something similar.

Self-healing. We run our services on Kubernetes, which means that the service auto-recovers. This is key in an environment where we only have a limited time (in our case, typically three days) to advertise an event.

Retry logic. Whenever you work with an external API, you want to have some retry logic to minimize downtime due to external issues. We use exponential retry, but every use case is different. If the number of re-tries is exhausted, we write the event to a Dead Letter Queue so it can be processed later.

Event-driven architecture. In addition to MongoDB change streams, we also rely on message services such as AWS Kinesis and SQS (alternatives such as Kafka and RabbitMQ are readily available if you are not in AWS). This allows us to de-couple individual components of the system to achieve a stable and reliable design.

Data for the users. While it’s not shown directly on the diagram, the system publishes business data it generates (Creatives, Placements, and Ads) to zulily’s analytics solution where it can be easily accessed by Marketing Specialists. If your users can access the data easily, it’ll make validations quicker, help build trust in the system and ultimately allow for more time to do more interesting work – not just troubleshooting.

Share:

Like this:

Here at zulily, we maintain a wide variety of database solutions. In this post, we’ll focus on two that we use for two differing needs in Ad Tech: MongoDB for our real-time needs and Google’s BigQuery for our long-term archival & analytics needs. In many cases we end up with data that needs to be represented and maintained in both databases, which causes issues due to their conflicting needs and abilities. While previously we had solved this issue by creating manual batch jobs for whatever collections we wanted represented in BigQuery, it resulted in delays of data appearing in BigQuery and issues with maintaining multiple mostly-similar-but-slightly-different jobs. As such, we’ve developed a new service, called mongo_bigquery_archiver, that automates the process of archiving data to BigQuery with only minimal configuration.

For the configuration of the archiver, we have a collection, ‘bigquery_archiver_settings,’ established on the Mongo server whose contents we want to back up, which is under a custom database ‘config’ on that server. This collection maintains one document for each collection that we want to back up to BigQuery, which serves both as configuration and an up-to-date reference for the status of that backup process. Starting out, the configuration is simple:

{
"mongo_database" : The Mongo database that houses the collection we want to archive.
"mongo_collection" : The Mongo collection we want to archive.
"bigquery_table" : The BigQuery table we want to create to store this data. Note that the table doesn't need to already exist! The archiver can create them on the fly.
"backfill_full_collection" : Whether to upload existing documents in the Mongo collection, or only upload future ones. This field is useful for collections that may have some junk test data in them starting out, or for quickly testing it for onboarding purposes.
"dryrun": Whether to actually begin uploading to BigQuery or just perform initial checks. This is mostly useful for the schema creation feature discussed below.
}

And that’s it! When that configuration is added to the collection, if the archiver process is active, it’ll automatically pick up the new configuration and create a subprocess to begin archiving that Mongo collection. The first step is determining the schema to use when creating the BigQuery table. If one already exists or is (optionally) specified in the configuration document manually, it’ll use that. Otherwise, it’ll begin the process of determining the maximum viable schema. By iterating through the entire collection, it analyzes each document to determine what fields exist across the superset of all present documents, coming up with all fields that maintain consistent types across every document and/or exist in some documents and not others, treating ones with the missing fields as null values. In addition, it analyzes subdocuments in the same way, creating RECORD field definitions that similarly load all the valid fields, recurring as necessary based on the depth of the collection’s documents. When complete, it stores the generated maximum viable schema in the configuration document for the user to review and modify as needed, in case there’s extraneous fields that, while possible to upload to BigQuery, would just result in useless overhead. It creates a BigQuery table based on this generated schema, and moves on to the next step.

Next, it uses a new feature added in Mongo 3.6, change streams. A change stream is like a subscription to the OpLog on the Mongo server for the collection in question – whenever an operation comes in that modifies the collection, the subscriber is notified about it. From here, we maintain a subscription for the watched collections, and whenever an update comes in, we process that update to get the current state of the document in Mongo, without querying again since we can configure the change stream to also give us the latest version of the document. By filtering down using the generated schema from before, we can upload the change to the BigQuery table via the BigQuery streaming API, along with the kind of operation it is – create, replace, update, or delete. In case of a delete, we blank all fields but the _id field and log that as a DELETE operation in the table. This table now represents the entire history of operations on the collection in question and can be used to track the state of the collection across time.

Since a full-history table makes it somewhat cumbersome to get the most current version of the data, the archiver automatically creates views on top of all generated tables, using the following query:

What this query does is determine the latest operation performed on each unique document ID, which is guaranteed to be unique in the Mongo collection. If the latest operation is a DELETE, it doesn’t show any entries for that particular ID in the final query results, otherwise it shows the status of the document as of the latest modification. This results in a view of each unique document exactly as of its latest update. With the minimal latency of the BigQuery streaming API, changes are reflected in BigQuery within seconds of their creation in Mongo, allowing for sophisticated real-time analytics via BigQuery. While BigQuery does not have official SLAs about the performance of the streaming API, we consistently see results uploaded within 3-4 seconds at most via query results. The preview mechanism via the BigQuery UI does not accurately reflect it, but querying the table via a SELECT statement properly shows the results.

An example of the architecture. Note that normal user access
always hits Mongo and never goes to BigQuery, making the process more efficient.

Thanks to the Archiver, we’ve been able to leverage the strengths of both Mongo and BigQuery in our rapidly-modifying datasets while not having to actively maintain two disparate data loading processes for each system.

Like this:

Introduction

When I got started managing software development projects the standard methodology in use was the Waterfall, in which you attempted to understand and document the entire project lifecycle before anyone wrote a single line of code. While this may have offered some benefit for coordinating large multi-team projects, more often than not it resulted in missed deadlines, inflexibility, and delivering systems that didn’t fully achieve their business goals. Agile has since emerged as the dominant project methodology, addressing many of the Waterfall’s shortcomings. Agile recognizes that it’s unlikely we’ll know everything up front, and as such is built on an iterative foundation. This allows us to learn as we go, regularly incorporate stakeholder feedback, and to avoid failure.

For Data Science and Machine Learning (DS/ML) projects, I’d argue that an iterative approach is a necessary, but not sufficient for a successful project outcome. DS/ML projects are different. And these differences can fly below the traditional project manager’s radar until they pop up late in the schedule and deliver a nasty bite. In this blog post I’ll point out some of the key differences I’ve seen between a traditional software development project and a DS/ML project, and how you can protect your team and your stakeholders from these hidden dangers.

Project Goal

At a high level DS/ML projects typically seek to do one of three things: 1) Explain; 2) Predict; or 3) Find Hidden Structure. In the first two we are predicting something, but with different aims. When we are tasked with explanation we use ‘transparent’ models, meaning they show us directly how they arrive at a prediction. If our model is sufficiently accurate, we can make inferences about what drives the outcome we’re trying to predict. If our goal is simply getting the best possible prediction we can also use ‘black box’ models. These models are generally more complex and don’t provide an easy way to see how they make their predictions, but they can be more accurate than transparent models. When the goal is to find hidden structure, we are interested in creating groups of like entities such as customers, stores, or products, and then working with those groups rather than the individual entities. Regardless of the immediate goal, in all three cases we’re using DS/ML to help allocate scarce organizational resources more effectively.

When we want to explain or predict we need to explicitly define an outcome or behavior of interest. Most retail organizations, for example, are interested in retaining good customers. A common question is “How long will a given customer remain active?” One way to answer this is to build a churn model that attempts to estimate how likely a customer is to stop doing business with us. We now have a defined behavior of interest: customer churn, so we’re ready to start building a model, right? Wrong. A Data Scientist will need a much more precise definition of churn or risk building a model that won’t get used. Which group of customers are we targeting here? Those who just made their first purchase? Those who have been loyal customers for years? Those who haven’t bought anything from us in the last 6 months? High value customers whose activity seems to be tapering off recently? Each of these definitions of ‘at risk’ creates a different population of customers to model, leaving varying amounts of observed behavior at our disposal.

By definition we know less about new customers than long time loyal customers, so a model built for the former use case will probably not generalize well to the latter group. Once we’ve defined the population of customers under consideration, it’s really important to compute the ‘size of the prize’.

Size of the Prize

Say we build a churn model that makes predictions that are 100% accurate (if this were to happen, we’ve likely made a modeling mistake – more on that later), and we apply that model to the customer audience we defined to be at risk of churn. What’s the maximum potential ROI? Since our model can correctly distinguish those customers who will churn from those that won’t, we’d only consider offering retention incentives to those truly at risk of leaving. Maybe we decide to give them a discount on their next purchase. How effective have such interventions been in the past at retaining similar customers?

If a similar discount has historically resulted in 2% of the target audience making an incremental purchase, how many of todays at risk customers would we retain? If you assume that each retained customer’s new order will be comparable to their average historical order amount, and then back out the cost of the discount, how much is left? In some cases, even under ideal conditions you may find that you’re targeting a fairly narrow slice of customers to begin with, and the maximum ROI isn’t enough to move forward with model-based targeting for a retention campaign. It’s much better to know this before you’ve sunk time and effort in to building a model that won’t get used. Instead, maybe you could build a model to explain why customers leave in the first place and try to address the root causes.

Definition of Success

Many times there is a difference in the way you’ll assess the accuracy of a model and how stakeholders will measure the success of the project. DS/ML practitioners use a variety of model performance metrics, many of which are quite technical in nature. These can be very different from the organizational KPIs that a stakeholder will use to judge a project’s success. You need to make sure that success from a model accuracy standpoint will also move the KPI needle in the right direction.

Models are built to help us make better decisions in a specific organizational context. If we’re tasked with improving the decision making in an existing process, we need to understand all the things that are taken into account when making that decision today. For example, if we build a model that makes recommendations for products based on previous purchases, but fail to consider current inventory, we may recommend something we can’t deliver. If our business is seasonal in nature, we may be spot on with our product recommendation and have plenty on hand, but suggest something that’s seasonally inappropriate.

Then there is the technical context to consider. If the goal will be making a recommendation in real time in an e-commerce environment, such as when an item is added to a shopping cart, you’ve got to deliver that recommendation quickly and without adding any friction to the checkout process. That means you’ll need to be ready to quickly supply this recommendation for any customer at any time. Models are usually built or ‘trained’ offline, and that process can take some time. A trained model is then fed new data and will output a prediction. This last step is commonly called ‘scoring’. Scoring can be orders of magnitude faster than training. But keep in mind that some algorithms require much more training time than others. Even if your scoring code can keep up with your busiest bursts of customer traffic, if you want to train frequently – perhaps daily so your recommendations take recent customer activity into account – the data acquisition, preparation, and training cycle may not be able to keep up.

Some algorithms need more than data, they also require supplying values for ‘tuning parameters’. Think of these as ‘knobs’ that have to be dialed in to get the best performance. The optimal settings for these knobs will differ from project to project, and can vary over time for the same model. A behavioral shift in your customer base, seasonality, or a change in your product portfolio can all require that a model be retrained and retuned. These are all factors that can affect the quality of your model’s recommendations.

Once you have clearly defined the desired outcome, the target audience, the definition of success from both the KPI and model accuracy perspectives, and how the model will be deployed you’ve eliminated some of the major reasons models get built but don’t get used.

Model Inputs and Experimentation

In traditional software development, the inputs and outputs are usually very well defined. Whether the output is a report, an on-line shopping experience, an automatic inventory replenishment system, or an automobile’s cruise control, we usually enter into the project with a solid understanding of the inputs and outputs. The desired software just needs to consume the inputs and produce the outputs. That is certainly not to trivialize these types of projects; they can be far more difficult and complex than building a predictive model.

But DS/ML projects often differ in one key respect: while we’ll usually know (or will quickly define) what the desired output should be, many times the required inputs are unknown at the beginning of project. It’s also possible that the data needed to make the desired predictions can’t be acquired quickly enough or is of such poor quality that the project is infeasible. Unfortunately these outcomes are not often apparent until the Data Scientist has had a chance to explore the data and try some preliminary things out.

Our stakeholders can be of immense help when it comes to identifying candidate model inputs. Data that’s used to inform current decision making processes and execute existing business rules can be a rich source of predictive ‘signal’. Sometimes our current business processes rely on information that would be difficult to obtain (residing in multiple spreadsheets maintained by different groups) or almost impossible to access (tribal knowledge or intuition). Many algorithms can give us some notion of which data items they find useful to make predictions (strong signal), which play more of a supporting role (weak signal) or are otherwise uninformative (noise). Inputs containing strong signal are easy to identify, but the distinction between weak signal and noise is not always obvious.

Ghost Patterns

If we’re lucky we’ll have some good leads on possible model inputs. If we’re not so lucky we’ll have to start from scratch. There are real costs to including uninformative or redundant inputs. Besides the operational costs of acquiring, preparing, and managing inputs, not being selective about what goes into an algorithm can cause some models to learn ‘patterns’ in the training data that turn out to be spurious.

Say I asked you to predict a student’s math grade. Here’s your training data: Amy gets an ‘A’, Bobby gets a ‘B’, and Cindy gets a ‘C’. Now make a prediction: What grade does David get? If that’s all the information you had, you might be inclined to guess ‘D’, despite how shaky that seems. You’d probably be even less inclined to hazard a guess if I asked about Mary’s grade. The more data we put into a model the greater the chance that some data items completely unrelated to the outcome just happen to have a pattern in the training data that looks useful. When you try to make predictions with a data set that your model didn’t see during training, that spurious pattern won’t be there and model performance will suffer.

To figure out which candidate model inputs are useful to keep and which should be dropped from further consideration, the DS/ML practitioner must form hypotheses about and conduct experiments on the data. And there’s no guarantee that there will be reliable enough signal in the data to build a model that will meet your stakeholder’s definition of success.

Time Travel

We hope to form long, mutually beneficial relationships with our customers. If we earn our customers’ repeat business, we accumulate data and learn things about them over time. It’s common to use historical data to train a model from one period of time and then test its accuracy on data from a subsequent period. This reasonably simulates what would happen if I build a model on today’s data and use the model to predict what will happen tomorrow. Looking into the past like this is not without risk though. When we reach back into a historical data set like this, we need to be careful to avoid considering data that arrived after the point in business process at which we want to make a prediction.

I once built a model to predict how likely it was that a customer would make their first purchase between two points in time in their tenure. To test how accurate my model was I needed to compare it to real outcomes, so that meant using historical data. When I went to gather the historical data I accidentally included a data item that captured information about the customer’s first purchase – something I wouldn’t know at the point in time at which I wanted to make the prediction. If a customer had already made their first purchase at the time I wanted make the prediction, they wouldn’t be part of the target population to begin with.

The first indication that I had accidentally let my model cheat by peeking into the future was that it was 100% accurate when tested on a new data set. That generally doesn’t happen, at least to me, and least of all on the first version of a model I build. When I examined the model to see which inputs had a strong signal, the data item with information from the future stood out like a sore thumb. In this case my mistake was obvious, so I simply removed the data item that was ‘leaking’ information from the future and kept going. This particular information leak was easy to detect and correct, but that’s not always the case. And this issue is something that can bite even stakeholders during the conceptualization phases of projects, especially when trying to use DS/ML to improve decision making in longer running business processes.

Business processes that run over days, weeks, or longer typically fill in more and more data as they progress. When we’re reporting or doing other analysis on historical data, it can be easy to lose sight of the fact that not all of that data showed up at the same time. If you want to use a DS/ML capability to improve a long running business process, you need to be mindful of when the data actually becomes available. If not, there’s a real risk of proposing something that sounds awesome but is just not feasible.

Data availability and timing issues can also crop up in business processes that need to take quick action on new information. Just because data appears in an operational system in a timely fashion, that data still has to be readied for model scoring and fed to the scoring code. This pre-scoring data preparation process in some cases can be computationally intensive and may have its own input data requirements. Once the data is prepared and delivered to the scoring process, the scoring step itself is typically quick and efficient.

Unified Modeling Language (UML) Sequence & Timing Diagrams are useful tools for figuring out how long the end to end process might take might take. It’s wise to get a ballpark estimate on this before jumping into model building.

Deploying Models as a Service

Paraphrasing Josh Wills, a Data Scientist is a better programmer than most statisticians, and a better statistician than most programmers. That said, you probably still want software engineers building applications and machine learning engineers building modeling and scoring pipelines. There are two main strategies an application can use to obtain predictions from a model: The model can be directly integrated into the application’s code base or the application can call a service (API) to get model predictions. This choice can have a huge impact on architecture and success of an ML/DS project.

Integrating a predictive model directly into an application may seem tempting – no need to stand up and maintain a separate service, and the end-to-end implementation of making and acting on a prediction is in the same code base, which can simplify troubleshooting and debugging. But there are downsides. Application and model development are typically done on different cadences and by different teams. An integrated model means more a complicated deployment and testing process and can put application support engineers in the awkward position of having to troubleshoot code developed by another team with a different set of skills. Integrated models can’t easily be exposed to other applications or monitoring processes and can cause application feature bloat if there’s a desire to include a capability to A/B test competing models.

Using a service to host the scoring code gets around these issues, but also impacts the overall system architecture. Model inputs need to be made available behind the API. At first blush, this may seem like a disadvantage – more work and more moving parts. But the process that collects and prepares data for scoring will often need to operate in the background anyway, independent of normal application flow.

Exposing model predictions as a service has a number of advantages. Most importantly, it allows teams work more independently and focus on improving the parts of the system that best align with their skill sets. A/B testing of two or more models can be implemented behind the API without touching the application. Having the scoring code run in its own environment also makes it easier to scale. You’ll want to log some or all of the predictions, along with their inputs, for off-line analysis and long-term prediction performance monitoring. Being able to identify the cases where the model’s predictions are the least accurate can be incredibly valuable when looking to improve model performance.

If a latter revision of the model needs to incorporate additional data or prepare existing data in a different way, that work doesn’t need to be prioritized with the application development team. Imagine that you’ve integrated a model directly into an application, but the scoring code needs to be sped up to maintain a good user experience. Your options are much more limited than if you’ve deployed the scoring code as a service. How about adding in that A/B testing capability? Or logging predictions for off line analysis? Even just deploying a retuned version of an existing model will require cross-team coordination.

Modeling within a Service Based Architecture

The model scoring API is the contract between the DS/ML & application development teams. Changing what’s passed into this service or what’s returned back to the application (the API’s ‘signature’) is a dead giveaway that the division of responsibilities on either side of the API was not completely thought through. That is a serious risk to project success. For teams to work independently on subsystems, that contract cannot change. A change to an API signature will require both service producers and service consumers to make changes and will often result in a tighter coupling between the systems – one of the problems we’re trying to avoid in the first place with a service-based approach. And always keep the number of things going into and coming out that API to a bare minimum. The less the API client and server know about each other the better.

Application development teams may be uneasy about relying on such an opaque service. The more opaque a service is the less insight the application team has into a core component of their system. It may be tempting to include a lot of diagnostic information in the API’s response payload. Don’t do it. Instead, focus your efforts on persisting this diagnostic information behind the API. It’s fine to return a token or tracing id that can be used to retrieve this information through another channel at a later point in time, just keep your API signature as clean as possible.

As previously discussed, DS/ML projects are inherently iterative in nature and often require substantial experimentation. At the outset we don’t know exactly which data items will be useful as model inputs. This presents a problem for a service-based architecture. You want to encourage the Data Scientist to build the best model they can, so they’ll need to run a lot of little experiments, each of which could change what the model needs as inputs. So Machine Learning engineers will need to wait a bit until model input requirements settle down enough to the point where they can start to build data acquisition and processing pipelines. But there’s a catch: Waiting too long before building out the API unnecessarily extends timelines.

So how do we solve this? One idea is to work at the data source rather than data item level. The Data Scientist should quickly be able to narrow down the possible source systems from which data will need to acquired, and not long after that know which tables or data structures in those sources are of interest. One useful idiom from the Data Warehousing world is “Touch It, Take It”. This means that if today you know you’ll need a handful of data items from a given table, it’s better to grab everything in that table the first time rather than cherry-pick your first set and then having to open up the integration code each time you identify the need for an additional column. Sure, you’ll be acquiring some data prospectively, but you’ll also be maintaining a complete set of valuable candidate predictors. You’ll thank yourself when building the next version of the model, or different model in that same domain, because the input data will already be available.

Once the Data Scientist has identified the desired tables or data structures, you’ll have a good idea of the universe of data that could potentially be made available to the scoring code behind the API. This is the time to nail down the leanest API signature you can. A customer id goes in and a yes / no decision and a tracing token comes out. That’s it. Once you’ve got a minimal signature defined freeze it – at least until the first version of the model is in production.

Business Rules First

Predictive models are often used to improve on an existing business process built on business rules. If improving a business rule based process, consider the business rules as version zero of the model. They establish the baseline level of performance against which the model will be judged. Consider powering the first version of the API with these business rules. The earlier the end-to-end path can be exercised the better, and having a business rule based version of the model available as a fallback can provide a quick way to roll back the first model without rolling back the architecture.

In Closing

In this post I’ve tried to highlight some of the more important differences I’ve experienced between a traditional software development project and a DS/ML project. I was fortunate enough to be on the lookout for a few of these, but most only came to light in hindsight. DS/ML projects have enough inherent uncertainty; hopefully you’ll be able to use some of the information in this post to avoid some of these pitfalls in your next DS/ML project.

In Marketing Tech, one of our jobs is to tell customers about zulily offers. These days everything and everyone goes mobile, and Mobile Push notifications are a great way to reach customers.

Our team faced a double-sided challenge. Imagine that you have to ferry passengers across a river. There’ll be times when only one or two passengers show up every second or so, but they need to make it across as soon as possible. Under other circumstances, two million passengers will show up at once, all demanding an immediate transfer to the opposite bank.

One way to solve this is to build a big boat and a bunch of small boats and use them as appropriate. While this works, the big boat will sit idle most of the time. If we build the big boat only, we will be able to easily handle the crowds, but it will cost a fortune to service individual passengers. Two million small boats alone won’t work either because they will probably take the entire length of the river.

Fortunately, in the world of software we can solve this challenge by building a boat that scales. Unlike the Lambda architecture with two different code paths for real-time and batch processing, an auto-scaling system offers a single code path that can handle one or one million messages with equal ease.

Let’s take a look at the system architecture diagram.

Campaigns and one-offs are passengers. In the case of a campaign, we have to send potentially millions of notifications in a matter of minutes. One-offs arrive randomly, one at a time.

An AWS Kinesis Stream paired with a Lambda function make a boat that scales. While we do need to provision both to have enough capacity to process the peak loads, we only pay for what we use with Lambda, and Kinesis is dirt-cheap.

We also ensure that the boat doesn’t ferry the same passenger multiple times, which would result in an awful customer experience (just imagine having your phone beep every few minutes). To solve this problem, we built a Frequency Cap service on top of Redis, which gave us a response time under 50ms per message. Before the code attempts to send a notification, it checks with the Frequency Cap service if the send has already been attempted. If it has, the message is skipped. Otherwise, it is marked as “Send Attempted”. It’s important to note that the call to the Frequency Cap API is made before an actual send is attempted. Such a sequence prevents the scenario where we send the message and fail to mark it accordingly due to a system failure.

Another interesting challenge worth explaining is to how we line up millions of passengers to board the boat efficiently. Imagine that they all arrive without a ticket, and the ticketing times vary. Yet, the board departs at an exact time that cannot be changed. We solve for this by ticketing in advance (the Payload Builder EMR service) and gathering passengers in a waiting area (files in S3). At an exact time, we open multiple doors from the waiting area (multithreading in the Kinesis Loader Java service), and the passengers make their way onto the boat (Kinesis Stream). The Step Function AWS service connects the Payload Builder and a Kinesis Loader into a workflow.

In summary, we built a system that can handle one or one million Mobile Push notifications with equal ease. We achieved this by combining batch and streaming architecture patterns and adding a service to prevent duplicate sends. We also did some cool stuff in the Payload Builder service to personalize each notification so check back in a few weeks for a new post on that.

Like this:

Introduction

A/B testing is essential to how we operate a data-driven business at zulily. We use it to assess the impact of new features and programs before we roll them out. This blog post focuses on some of the more practical aspects of A/B testing. It is divided into four parts. It begins with an introduction to A/B testing and how we measure long-term impact. Then, it moves into the A/B splitting mechanism. Next, it turns to Decima, our in-house A/B test analysis platform. Finally, it goes behind the scenes and describes the architecture of Decima.

A/B Testing

A/B Testing Basics

In A/B testing, the classic example is changing the color of a button. Say a button is blue, but a PM comes along with a great idea: What would happen if we make it green instead? The blue button is version A, the current version, the control. The green button is version B, the new version, the test. We want to know: Is the green button as awesome as we think? Is it a better experience for our users? Does it lead to better outcomes for our business? To find out, we run an A/B test. We randomly assign some users to see version A and some to see version B. Then we measure a few key outcome metrics for the users in each group. Finally, we use statistical analysis to compare those metrics between the two groups and determine whether the results are significant.

Statistical significance is a formal way of measuring whether a result is interesting. We know that there is natural variability in our users. Not everyone behaves exactly the same way. So, we want to check if the difference between A and B could just be due to chance. Pretend we ran an A/A test instead. We randomly split the users into two groups, but everyone gets the blue button. There is a range of differences (close to zero) that we could reasonably expect to see. When the results of the A/B test are statistically significant, it means they would be highly unusual to see under an A/A test. In that case, we would conclude that the green button did make a difference.

Figure 1. A/B testing – Split users and assign to version A or B. Measure behavior of each group. Use statistical analysis to compare.

Cumulative Outcome Metrics

To shop on zulily, users have to create an account. Requiring our users to be signed in is great for A/B testing, and for analytics in general. It means we can tie together all of a user’s actions via their account id, even if they switch browsers or devices. This makes it easy to measure long-term behaviors, well beyond a single session. And, since we can measure them, we can A/B test for them.

One of the common outcomes we measure at zulily is purchasing. A short-term outcome would be: How much did this user spend in the session when they saw the blue or green button? A long-term outcome would be: How much did the user spend during the A/B test? Whenever a user sees the control or test experience, we say they were exposed. A user can be exposed repeatedly over the course of a test. We accumulate outcome metrics from the first exposure through the end of the test. By measuring cumulative outcomes, we can better understand long-term impact and not be distracted by novelty effects.

Figure 2. Cumulative outcome metrics – Measure each user’s behaviors from their first exposure forward. Users can be exposed multiple times – focus on the first time. Do not count the user’s behaviors before their first exposure.

Lift

Usually, A/B test analysis measures the difference between version B and version A. For an outcome metric x, the difference between test and control is xB – xA. This difference, especially for cumulative outcomes, can increase over time. Consider the example of spend per exposed user. As the A/B test goes on, both groups keep purchasing and accumulating more spend. Version B is a success if the test group’s spend increases faster than the control’s.

Instead of difference, we measure the lift of B over A. Lift scales the difference by the baseline value. For an outcome metric x, the lift of test over control is (xB – xA) / xA * 100%. We have found that lift for cumulative metrics tends to be stable over time.

Figure 3. Lift over time – Cumulative behaviors increases over time for both A and B, so the difference between them grows too. The lift tends to stay constant, making it a better summary of the results.

Power Analysis

Before starting an A/B test, it is good to ask two questions: What percent of users should get test versus control? and How long will the test need to run? The formal statistical way of answering these questions is a power analysis. First, we need to know what is the smallest difference (or lift) that would be meaningful to the business. This is called the effect size. Second, we need to know how much the outcome metric typically fluctuates. The power analysis calculates the sample size, the number of users needed to detect this size of effect with statistical significance.

There are two components to using the sample size. The split is the fraction of users in test versus control, and this impacts the sample size needed. The time for the test is however long it will take to expose that many users. Since users can come back and be exposed again, the cumulative number exposed will grow more slowly as time goes on. Purely mathematically, the more unbalanced the split (the further from 50-50 in either direction), the longer the test. Likewise, the smaller the effect size, the longer the test.

Size + Time – Practical Considerations

Often the power analysis doesn’t tell the whole story. For example, at zulily we have a strong weekly cycle – people shop differently on weekends from weekdays. We always recommend running A/B tests for at least one week, and ideally in multiples of seven days. Of course, if the results look dramatically negative after the first day or two, it is fine to turn off the test early.

The balance of the split affects the length of the test run, but we also consider the level of risk. If we have a big program with lots of moving parts, we might start with 90% control, 10% test. On the flip side, if we want to make sure an important feature keeps providing lift, we might maintain a holdout with 5% control, 95% test. But, if we have a low risk test, such as a small UI change, a split at 50% control, 50% test will mean shorter testing time.

A/B Split

Goals for the A/B Split

There are three key properties that any splitting strategy should have. First, the users should be randomly assigned to treatments. That way, all other characteristics of the users will be approximately the same for each of the treatment groups. The only difference going in is the treatment, so we can conclude that any differences coming out were caused by the treatment. Second, the treatments for each A/B test should be assigned independently from all other A/B tests. That way, we can run many A/B tests simultaneously and not worry about them interfering with each other’s results. Of course, it wouldn’t make sense to apply two conflicting tests to the same feature at the same time. Third, the split should be reproducible. The same user should always be assigned to the same treatment of a test. The treatment shouldn’t vary randomly from page to page or from visit to visit.

Our Strategy

At zulily, our splitting strategy is to combine the user id with the test name and apply a hash function. The result is a bucket number for that user in that test. We often set up more buckets than treatments. This provides the flexibility to start with a small test group and later increase it by moving some of the buckets from control to test.

Our splitting strategy has all three key properties. First, the hash produces pseudo-random bucketing. Second, by including the test name, the user will get independent buckets for different tests. Third, the bucket is reproducible because the hash function is deterministic.

The hash is very fast to compute, so developers don’t have to worry about the A/B split slowing down their code. To implement a test, at the decision point in the code the developer places a call to our standard test lookup function with the test name and user id. It returns the bucket number and treatment name, so the user can be directed to version A or version B. Behind the scenes, the test lookup function generates a clickstream log with the test name, user id, timestamp, and bucket. We on the Data Science team use the clickstream records to know exactly who was exposed to which test when and which treatment they were assigned.

Audience v. Exposure

There are two main ways to assign users to an A/B test: using an audience or exposure. In an audience-based test, before the test launches we create an audience – a group of users who should be in the test – and randomly split them into control and test. Then we measure all of those users’ behavior for the entire test period. This is straightforward but imprecise. Not everyone in the audience will actually be touched by the A/B test. The results are statistically valid, but it will be more difficult to detect an effect due to the extra noise.

Instead, we prefer exposure-based testing. The user is only assigned to a treatment when they reach the feature being tested. The number of exposed users increases as the test runs. The only users in the analysis are those who could have been impacted by the A/B test, so it is easier to detect a lift. In addition, we only measure the cumulative outcomes starting from each user’s first exposure. This further refines the results by excluding anything a user might have done before they had a chance to be influenced by the test.

Decima UI

A Bit of Roman Mythology

The ancient Romans had a concept of the Three Fates. These were three women who control each mortal’s thread of life. First, Nona spins the thread, then Decima measures it, and finally Morta cuts it when the life is over. We named our A/B test analysis system Decima because it measures all of the live tests at zulily.

Figure 5. Three Fates – In ancient Roman mythology the Three Fates control the thread of life. Decima’s role is to measure it.

Decima UI

The Decima UI is the face of the system to internal users. These include PMs, analysts, developers, or anyone interested in the results of an A/B test. It has two main sections: the navigation and information panel and the results panel. Figure XX shows a screenshot of Decima displaying a demo A/B test.

Figure 6. Decima UI – At zulily, Decima displays the results of A/B tests. The left panel is for navigation and information. The main panel shows the results for each outcome metric.

Navigation + Information

The navigation and information panel is on the left. A/B tests are organized by Namespace or area of the business. Within a namespace, the Experiment drop-down lists the names of all live tests. The Platform drills down to just exposures and outcomes that occurred on that platform or group of platforms (all-apps, mobile-web, etc). The Segmentation drills down to users in a particular segment (new vs existing, US vs international, etc).

The date information shows the analysis_start_date and analysis_end_date. The results are for exposures and outcomes that occurred in this date range, inclusive. The n_days shows the length of the date range. The analysis_run_date shows the timestamp when the results were computed. For live tests, the end date is always yesterday and the run date is always this morning.

Results

The main panel displays the results for each outcome metric. We analyze whether the lift is zero or statistically significantly different from zero. If a lift is significant and positive, it is colored green. If it is significant and negative, it is colored orange. If it is flat, it is left gray. The plot shows the estimated lift and its 95% confidence interval. It is easy to see whether or not the confidence interval contains zero.

The table shows the average (or proportion for a binary outcome), standard deviation, and sample size for each treatment group. Based on the statistical analysis, it shows the estimated lift, confidence interval bounds, and p-value for comparing each test group to the control.

Figure 7. Single metric results – Zoom in one metric in the Decima UI. The plot shows the 95% confidence interval for lift. The table shows summary numbers and statistical results.

Common Metrics

We use a variety of outcome metrics depending on the goal of the new feature being tested. Our core metrics include purchasing and visiting behaviors. Specifically, spend per exposed approximates the impact of the test to our top-line. For each exposed user, we measure the cumulative spend (possibly zero) between their first exposure date and the analysis end date. Then we average this across all users for each treatment group. Spend per exposed can be broken down into two components: chance of purchase and spend per purchaser. Sometimes a test might cause more users to purchase but spend lower amounts, or vice versa. Spend per exposed combines the two to capture the overall impact. Revisit rate measures the impact of the test to repeat engagement. For each exposed user, we count the number of days they came back after their first exposure date. We have found that visit frequency is a strong predictor of future behaviors, months down the road.

Figure 8. Common outcome metrics. Spend per exposed can be broken into chance of purchase and demand per purchaser. Revisit rate is a proxy for long-term behavior.

Decima Architecture

Three Modules of Decima

Decima is comprised of three main modules. Each is named after a famous contributor to the field that corresponds to its role. Codd invented the relational database model, so the codd module assembles the user-level dataset from our data warehouse. Gauss was an influential statistician (the Gaussian or Normal distribution is named after him), so the gauss module performs the statistical analysis. Tufte is considered a pioneer in data visualization, so the tufte module displays the results in the Decima UI. Decima runs in Google Compute Engine (GCE), with a separate Docker container for each module.

Codd

The codd module is in charge of assembling the dataset. It is written in Python. It uses recursive formatting to compose the query out of parameterized query components, filling values for the dates, test name, etc. Then it submits the query to the data warehouse in Google BigQuery and exports the resulting dataset to Google Cloud Storage (GCS).

Figure 9. Codd – The codd module of Decima does data assembly.

Gauss

The gauss module takes care of the statistical analysis. It is written in R. It imports the dataset produced by codd from GCS into a data.table. It loops through the outcome metrics and performs the statistical test for lift for each one using speedglm. It also loops through platforms and segmentations to generate results for the drill downs. Finally, it gathers all the results and writes them out to a file in GCS.

Tufte

The tufte module serves the result visualizations. It is also written in R. It imports the results file produced by gauss from GCS. It creates the tables and plots for each metric in the test using ggplot2. It displays them in an interactive UI using shiny. The UI is hosted in GCE and can be accessed by anyone at zulily.

Decima Meta

The fourth module of Decima is decima-meta. It doesn’t contain any software, just queries and configuration files. The queries are broken down into reusable pieces. For example, the exposure query and outcome metrics query can be mixed and matched. Each query piece has parameters for frequently changed values, such as dates or test ids. The configuration files are written in JSON and there is one per A/B test. They specify all the query pieces and parameters for codd, as well as the outcome metrics for gauss. The idea is: running an A/B test analysis should be as easy as adding a configuration file for it.

About the Author

Julie Michelman is a Data Scientist at zulily. She designs and analyzes A/B tests, utilizing Decima, the in-house A/B test analysis tool she helped build. She also builds machine learning models that are used across the business, including marketing, merchandising, and the recommender system. Julie holds a Master’s in Statistics from the University of Washington.

Like this:

zulily is a flash sales company. We post a product on the site, and puff… it’s gone in 72 hours. Online ads for those products come and go just as fast, which doesn’t leave us much time to manually evaluate the performance of the ads and take corrective actions if needed. To optimize our ad spend, we need to know in real-time how each ad is doing, and this is exactly what we engineered.

While we track multiple metrics to measure impact of an ad, I am going to focus on one that provides a good representation of the system architecture. This is an engineering blog after all!

The metric in question is Cost per Total Activation, or CpTA in short. The formula for the metric is this: divide the total cost of the ad by the number of customer activations. We call the numerator in this formula “spend” and refer to the denominator as an “activation”. For example, if an ad costs zulily $100 between midnight and 15:45 PST on January 31 and results in 20 activations, the CpTA for this ad as of 15:45 PST is $100/20 = $5.

Here’s how zulily collects this metric in real-time. For the sake of simplicity, I will skip archiving processes that are sprinkled on top the architecture below.

The source of the spend for the metric is an advertiser API, e.g. Facebook. We’ve implemented a Spend Producer (in reference to the Producer-Consumer model) that queries the API every 15 minutes for live ads and pushes the spend into a MongoDB. Each spend record has a tracking code that uniquely identifies the ad.

The source for the activations is a Kafka stream of purchase orders that customers place with zulily. We consume these orders and throw them into an AWS Kinesis stream. This gives us the ability to process and archive the orders without causing an extra strain on Kafka. It’s important to note that relevant orders also have the ad’s tracking code, just like the spend. That’s the link that glues spend and activations together.

The Activation Evaluator application examines each purchase and determines if the purchase is an activation. To do that, it looks up the previous purchase in a MongoDB collection for the customer Id on the purchase order. If the most recent transaction is non-existent or older than X days, the purchase is an activation. The Activation Evaluator updates the customer record with the date of the new purchase. To make sure that we don’t drop any data if the Activation Evaluator runs into issues, we don’t move the checkpoint in the Kinesis stream until the write to Mongo is confirmed.

The Activation Evaluator sends evaluated purchases into another Kinesis stream. Chaining up Kinesis stream is a pretty common pattern for AWS applications, as it allows for the separation of concern and makes the whole system more resilient to failure of individual components.

The Activation Calculator reads the evaluated purchases from the second Kinesis stream and captures them in Mongo. We index the data by tracking code and timestamp, and voila, a simple count() will return the number of activations for a specified period.

The last step in the process is to take the Spend and divide it by the activations. Done.

With this architecture, zulily measures a key advertising performance metric every 15 minutes and uses it to pause poorly-performing ads. The metric also serves as an input for various Machine Learning models, but more on those in a future blog post… Stay tuned!!