Seattle's Best Tech Team

Main menu

About Sudhir Hasbe

Sudhir is an accomplished product management leader with over 16 years of experience building industry-leading products at startup and blue-chip companies. He has a proven record of leading product teams across engineering and marketing to deliver business results through customer insights and innovation.

In 2014, we made a decision to build our core data platform on Google Cloud Platform and one of the products which was critical for the decision was Google BigQuery. The scale at which it enabled us to perform analysis we knew would be critical in long run for our business. Today we have more than 200 unique users performing analysis on a monthly basis.

Once we started using Google BiqQuery at scale we soon realized our analysts needed better tooling around it. The key requests we started getting were

Ability to schedule jobs: Analysts needed to have ability to run queries at regular intervals to generate data and metrics.

Define workflow of queries: Basically analysts wanted to run multiple queries in a sequence and share data across them through temp tables.

Simplified data sharing: Finally it became clear teams needed to share this data generated with other systems. For example download it to leverage in R programs or send it to another system to process through Kafka.

zuFlow Overview

zuFlow is zulily’s a query workflow and scheduling solution for Google BigQuery. There are few key concepts

Job: Job is a executable entity that encompasses multiple queries with a schedule.

Query: SQL statement that can be executed on Google BigQuery

Keyword: Variable defined to be used in the queries

Looper: Ability to define loops like foreach statements.

High Level Design

zuFlow is a web application that enables users to setup jobs and run them either on demand or based on a schedule.

We have integrated the system with off the shelf open source scheduler called SOS. We do plan to migrate this to Airflow in future.

Flowrunner is the brain of the system written in python. It leverages data from config db and executes the queries and stores back the runtime details in the db. Few key capabilities it provides are

Concurrency: We have to manage our concurrency to make sure we are not overloading the system

Retry: In few scenarios based on error codes we retry the queries

Cleanup: It is responsible for cleaning up after the jobs are run including historical data cleanup

zuFlow Experience

Job Viewer: Once logged-in you can see your jobs or you can view all jobs in the system

Creating Job: You can provide it a name, schedule to run and email address of the owner.

Keywords/variables: You can create keywords which you can reuse in your query. This enables analysts to define a parameter and use it in there queries instead of hardcoding values. We also have predefined system keywords for date time handling and making it easier for users to shard tables. Examples:

DateTime: CURRENT_TIMESTAMP_PT, CURRENT_DATE_PT, CURRENT_MONTH_PT, CURRENT_TIME_PT, LAST_RUN_TIMESTAMP_PT, LAST_RUN_TIMESTAMP, LAST_RUN_DATE_PT
BQ Format Pacific date of the last run of this job (will be CURRENT_DATE_PT on first run)

Looping: Very soon after our first release we got requests to add loops. This enables users to define variable and loop through the values.

Query Definition: Now you are ready to write a Google BigQuery query and define where the output will be stored. There are 4 options

BQ Table: In this case you provide BQ table and decide if you want to replace it or append to it. You can also define the output table as temp table and system will clean it up after execution of job is completed.

CSV: If you pick CSV you need to provide GCS location for output

Cloud SQL(MySQL): You can also export to the Cloud SQL.

Kafka: You can provide Kakfa topic name to publish results as messages.

You can define multiple queries and share data across them through temp tables in BQ.

Job Overview: This shows the full definition of the job.

We have thought about open sourcing the solution. Please let us know if you are interested in this system

zulily is a daily business, we launch our events every day at 6am PST and then most of our sales are in early hours of launching the events. It is critical for our merchants to know what is happening and react to drive more sales. We have significantly improved merchants ability to drive sales by providing them new real-time home page so everyday when they come they can take actions based on the situation.

Historical View:

Historically we had a dashboard for our merchants which was not very useful. It showed them upcoming events and some other info, but when you come every day you want to know what is happening today not tomorrow.

New View

We replaced this non actionable home page with a new real-time version which shows merchants real-time sales for there events, conversion rates, real-time inventory, top selling styles and projected styles which would sell out. This enables merchants to talk to vendors to get more inventory or add different products.

Technical Design

To build a real-time home page for merchants we had to combine real-time clickstream data (unstructured) with real-time sales (structured) and historical event and product data. Bringing these very different types of data-sets into a single pipeline and in real-time merging/processing them was a challenge.

Real-time Clickstream & Orders

We have built a high scale real-time collection service called ZIP. It peaks every day around 18k to 20k transactions per second. Our clickstream data & Order data is collected through this service. One of the capabilities of ZIP is to publish this data in real-time to Kafka cluster. This enables other systems to access data that is being collected in near-real-time.

We will describe other capabilities of this service in future post.

Historical data:

We have our data platform running on Google Cloud Platform and includes Google DataProc as our ETL Processing platform which after processing data stores it in Google Big Query for analytics and in Google Cloud Storage. All our historical data which includes our products, events, prices and orders are stored in Google Big Query and Google Cloud storage.

Spark Streaming Processing

We used spark streaming to connect the clickstream and order data collected in Kafka with historical data in GCS using GCS connector provided by Google. This allowed us to create derivative datasets like real-time sales, conversion rates, top sellers which were stored in AWS Aurora DB. AWS Aurora is an amazing database for high scale queries. In future we will write up a post on why Aurora compared to other options.

Data Access through Zata

We then used our ZATA API to access this data from our Merch tools to build amazing UI for our merchants.

Spark Streaming Details

Reading the data from kafka(KafkaUtils.createDirectStream)

Kafka Utils is the object with the factory methods to create input dstreams and RDD’s from records in topics in Apache Kafka. createDirectStream skips receivers and zookeeper and uses simple API to consume messages.This means it needs to track offsets internally.

So at the beginning of each batch, connector reads partition offsets for each topic from Kafka and uses them to ingest data. To ensure exactly once semantics, it tracks offset information in Spark Streaming checkpoints,

Reading data from GCS: (TextfileStream)

This method monitors any Hadoop-compatible filesystem directory for new files and when it detects a new file – reads it into Spark Streaming. In our case, we use GCS and streaming job internally uses GCS connector. We pass GCS connector as jar file when invoking the job

/code:

ssc.textFileStream(GCS bucket path)

Merge Values: combineByKey(createCombiner,mergeValue,mergeCombiners):

In SPARK, groupByKey() doesn’t do any local aggregation while computing on the partition’s data, this is where combineByKey() comes in handy.
In combineByKey values are merged into one value at each partition, finally each value from each partition is merged into a single value.So combineByKey is a optimization to groupByKey as we end up sending fewer key value pairs across network

We used combineByKey to calculate aggregations like total sales,average price,demand.Three lambda functions were passes as arguments to this method

combineByKey(createCombiner,mergeValue,mergeCombiners)

createCombiner : The first required argument in the method is a function to be used as the very first aggregation step for each key. This function is invoked only once for every key

mergeValue : This function tells what to do when a combiner is given a new value

mergeCombiners : This Function is called to combine values of a key across multiple partitions

stateful transformations (updateStateBykey() )

We required a framework that supported building knowledge based on both historical and real-time data. Spark Streaming provided just that.
Using stateful functions like updateStateByKey that computes running sum of all the sales we were able to achieve our requirement.
We used updateStateByKey(func) for stateful transformation,
for example : you want to keep track of number of times a customer visited the web page if customer “123” visited twice in the first hour, she visits again in the next hour
aggregated count at the end of second hour should be 3 (includes current batch count and history) so this history state will be in the memory handled by updateStateByKey
Checkpoint mechanism of spark streaming takes care of preserving the state of sales history in memory.
As an additional recovery point, we stored the state in a database
and recovered from the database in case files were cleared from checkpoint during new code deployments or configuration changes.

/code:

soi_batch_agg.updateStateByKey(updateSales)

def updateSales(newstate,oldstate):

# Incase of empty rdd

# If event Product insert timestamp is older than two days then remove from memory

In our initial blog post about zulily big data platform, We briefly talked about ZATA (zulily data access service).Today we want to deep dive into ZATA and explain our thought process and how we built it.

Goals

As a data platform team we had three goals:

Rich data generated by our team shouldn’t be limited to analysts. It should be available for systems & applicationsvia simple and consistent API.

Have the flexibility to change our backend data storage solutions over time without impacting our clients

Zero development time for incremental data driven APIs

ZATA was our solution for achieving our above goals. We abstracted our data platform using a REST-based service layer that our clients could use to fetch datasets. We were able to swap out storage layers without any change for our client systems.

Selecting Data Storage solution

There are three different attributes you have to figure out before you pick a storage technology:

Size of Data: Is it big data or relatively small data? In short, do you need something that will fit in My SQL or do you need to look at solutions like Google Big Query or AWS Aurora?

Query Latency: How fast do you need to respond to Queries? Is it milliseconds or are few seconds OK – especially for large datasets

Data Type: Is it relational data or is it key value pairs or is it complex JSON documents or it is a search pattern?

As an enterprise, we need all combinations of these. The following are choices our team has made over time for different attributes:

Google Big Query: Great for large datasets(in terabytes) but latency is in seconds and supports columnar storage

AWS Aurora: Great for large datasets (in 100s of gigabytes) with very low latency for queries

PostgresXL: Great for large datasets(100s of gigs to terabytes) with amazing performance for aggregation queries. This is very difficult to manage and still early in its maturity cycle. We eventually moved our datasets to AWS Aurora.

Google Cloud SQL, MySQL or SQL Server: For Small datasets(GBs) with real low latency in milliseconds)

Mongo DB or Google Big Table: Good for large scale datasets with low latency document lookup.

Elastic Search: We use Elastic Search for scenarios related to search both fuzzy and exact match.

Dataset used is product-offering which is just a view in the Google Big Query system

Where eventStartDate=[2013-11-15,2013-12-01] is transformed to where eventstartDate between 2013-11-15 & 2013-12-01

Output fields that are requested are eventId,vendorId,productId,grossUnit

Query for Google Big Query is:

Select eventId,vendorId,productId,grossUnit from product-offering where eventStartDate=[2013-11-15,2013-12-01]

The mapping layer decides what mappings to use and how to transform the http request to something that backend will understand. This will be very different for MongoDB or Google Big Table.

Execution Layer

Execution layer is responsible for generating queries using the protocol that the storage engine will understand. It also executes the queries against backend and fetches result sets in an efficient manner. Our current implementation supports various protocols such as mongodb, standard JDBC as well as http request for Google BigQuery, Big Table and elasticsearch.

Transform Layer

This layer is responsible for transforming data coming from any of the backend sources and normalizing it. This allows our clients to be agnostic of storage mechanism in our backend systems. We went JSON as the schema format given how prevalent it is amongst services and application developers

In previous example from Mapping layer the response will be following.

API auto discovery

Our third goal was to have zero development time for incremental data driven API. We achieved this by creating an auto discovery service. The job of this service is to regularly poll the backend storage service for changes and automatically add service definitions to the config db. For example, in Google Big query or My SQL, once you add a view in schema called “zata” we automatically add the API to ZATA service. This way the data engineer can keep adding services for dataset they created without anyone writing new code.

API Schema Definition

Schema service enables users to look for all the APIs supported by zata and also view its schema to understand what requests they can send. Clients can get the list of available datasets;

So far, the client is not aware of the location or has any knowledge of the storage system and this makes the whole data story more agile. It is moved from one location to another, or the schema is altered, it will be fine for all downstream system since the access points and the contracts are managed by Zata.

Storage Service Isolation

As we rolled out ZATA over time, we realized the need for storage service isolation. Having a single service support multiple backend storage solutions with different latency requirements didn’t work very well. The slowest backend tends to slow things down for everyone else.

This forced us to rethink about zata deployment strategy. Around the same time, we were experimenting with dockers and using Kubernetes as an orchestration mechanism.

We ended up creating separate docker containers and kubernetes service for each of the backend storage solutions. So we now have a zata-bigquery service which handles all bigquery specific calls. Similary we have a zata-mongo, zata-jdbc and zata-es service. Each of these kubernetes service can be individually scaled based on anticipated load.

In addition to individual kubernetes service, we also created a zata-router service which is essentially nginx hosted in docker. Zata-router service accepts on incoming HTTP requests for zata and based on the nginx config, it routes HTTP traffic to various kubernetes services available in the cluster. The nginx config in zata-router service is dynamically refreshed by polling service to make new APIs discoverable.

ZATA has enabled us to make our data more accessible across the organization while enabling us to move fast and change storage layer as we scaled up.

Like this:

Clickstream is the largest data set at zulily. As mentioned in zulily’s big data platform we store all our data in Google cloud storage and use Hadoop cluster in Google Compute Engine to process it. Our data size has been growing at a significant pace, this required us to reevaluate our design. The dataset that was growing fastest was our clickstream dataset.

We had to make a decision to either increase the size of cluster drastically or try something new. We came across Cloud Dataflow and started evaluating it as an alternative.

Why did we move clickstream processing to Cloud Dataflow?

Instead of increasing the size of the cluster to manage peak loads of clickstream Cloud Dataflow gave us ability to have an on-demand cluster that could dynamically grow or shrink reducing costs

Clickstream data was growing very fast and the ability to isolate it from other operational data processing would help the whole system in the long run

Clickstream data processing did not require complex correlations and joins which are a must in some of the other datasets – this made the transition easy.

Programming model was extremely simple and transition from development perspective was easy

Most importantly, we have always thought about our data processing clusters as transient entities, which meant dropping another entity in this case Cloud Dataflow which was different from Hadoop was not a big deal. Our data would still reside in GCS and we could still use clickstream data from Hadoop cluster where required.

High Level Flow

One of the core principles of our clickstream system is to be self serviceable. In short, teams can start recording new types of events anytime by just registering a schema to our service. Clickstream events are pushed to our data platform through our real-time system(more later) and through log shipping to GCS.

Create multiple outputs from data in the logs based on event types and schema using PCollectionTuple

Apply ParDo transformations on the data to make it optimized for querying by business users (e.g. transform all dates to PST)

Store the data back into GCS, which is our central data store but also for use from other Hadoop processes and loading to Big query.

Loading to BigQuery is not done directly from Dataflow but we use bq load command which we found to be much more performant

The next step for us is to identify other scenarios which are a good fit to transition to Cloud Dataflow, which have similar characteristics to clickstream – high volume, unstructured data not requiring heavy correlation with other data sets.

Share:

Like this:

In July 2014 we started our journey to building a new data platform that would allow us to use big data to drive business decisions. I would like to start with a quote from our 2015 Q2 earnings that was highlighted in various press outlets and share how we built a data platform that allows zulily to make decisions which were near impossible to do before.

“We compared two sets of customers from 2012 that both came in through the same channel, a display ad on a major homepage, but through two different types of ads,” [Darrell] Cavens said. “The first ad featured a globally distributed well-known shoe brand, and the second was a set of uniquely styled but unbranded women’s dresses. When we analyze customers coming in through the two different ad types, the shoe ad had more than twice the rate of customer activations on day one. But 2.5 years later, the spend from customers that came in through the women dresses ad was significantly higher than the shoe ad with the difference increasing over time.” – www.bizjournals.com

Our vision is for every decision, at every level in the organization, to be driven by data. In early 2014 we realized the data platform we had which was combination of SQL server database for data warehousing primarily for structured operational data + Hadoop cluster for unstructured data was too limiting. We started with defining core principles for our new data platform (v3).

Principles for new platform

Break silos of unstructured and structured data: One of the key issues with our old infrastructure was we had unstructured data in a Hadoop cluster and structured in SQL server, this was extremely limiting. Almost every major decision required us to perform analytics where we had to correlate usage patterns on site(unstructured) with demand or shipment or other transactions(structured). We had to come up with a solution that allowed us to correlate these different types of datasets.

High scale and lowest possible cost: We already had a hadoop cluster but one of the challenges was our data was growing exponentially and as the storage need grew we had to start adding nodes to the cluster which increased the cost even if the compute was not growing at same pace. In short, we were paying at rate of compute(which is way higher) at pace of growth rate of storage. We had to break this. Additionally we wanted to build a highly scalable system which would support our growth.

High availability or extremely low recovery time with eye on cost: Simply put we wanted high availability but not pay for it.

Enable real-time analytics: Existing data platform has no concept of real-time data processing, hence anytime users needed to analyze things in real-time they had to be given access to production operational system which in turn increases risk, this had to change.

Self service: Our tools and systems/services need to be easy for our users to use with no involvement from data team.

Protect the platform from users: Another challenge in existing system was lack of isolation between core data platform components and user facing services. We wanted to make in new platform individual users of the system could not bring down the entire system.

Service(Data) level agreement(SLA): We wanted to be able to give SLA’s for our datasets to our users. This includes data accuracy and freshness. In past this was a huge issue due to design of the system and lack of tools to monitor and report SLAs.

zulily data platform (v3)

zulily data platform includes multiple tools, platforms and products. The combination is what makes it possible to solve complex problem at scale.

There are 6 key broad components in the stack:

Centralized big data storage built on Google cloud storage (GCS). This allows us to have all our data in one place (principle 1) and share it across all systems and layers of the stack.

Big Data Platform is primarily for batch processing of data at scale. We use Hadoop on Google compute engine for our big data processing.

All our data is stored in GCS not HDFS, this enables us to decouple storage from compute and manage cost (principle 2).

Additionally the hadoop cluster for us is transient as it has no data so if our cluster completely goes away we can built new one on the fly. We think about our cluster as a transient entity (principle 3).

We have started looking at Google Dataflow for specific scenarios but more on that soon. Our goal is to make all data available for analytics anywhere from 30 minutes to few hours based on the SLA.

Real-time data platform is a zulily built stack for high scale data collection and processing (principle 4). We now collect more than 10k events per second and it is growing really fast as we see value through our decision portal and other scenarios.

Business data warehouse built on Google BigQuery enables us to provide highly scalable analytics service to 100’s of users in the company.

We push all our data, structured and unstructured, real-time and batch into Google BigQuery hence breaking all silos for data (principle 1).

It also enables us to keep our big data platform and data warehouse completely isolated (principle 6) making sure issues in one environment don’t impact the other system.

Keeping the interface(SQL) same for analysts between old and new platforms enabled us to lower barrier for adoption (principle 5)

We also have a high speed low latency data store that hosts part of our business data warehouse data for access through APIs.

Data Access and Visualization component enables business users to make decisions based on information available.

Real-time decision portal or zuPulse enables business to get insights into the business in real-time.

Tableau is our reporting and analytics visualization tool. Business users use the tool to create reports for their daily execution. This enables our engineering team to focus on core data and high scale predictive scenarios and makes reporting self service (principle 5).

ZATA our data access service enables us to expose all our data through an API. Any data in the data warehouse or low latency data store can be automatically exposed through ZATA. This enables various internal applications at zulily to show critical information to users including our vendors who can see their historical sales on vendor portal.

Core data services are backbone to everything we do – from moving data to cloud, monitoring to and making sure we abide by our SLA’s etc. We continuously add new services and enhance existing services. The goal of these services is to make developers and analysts across other areas more efficient.

zuSync is our data synchronization service which enables us to get data from any source system and move to any destination. Source or Destination can be operational databases (SQL Server, MySQL etc), a queue (RabbitMQ), Google cloud storage (GCS), Google BigQuery, services (http/tcp/udp), ftp location…

zuflow is our workflow service built on top of Google BigQuery and allows analysts to define & schedule query chains for batch analytics. It also has ability to notify and deliver data in different formats.

zuDataMon is our data monitoring service which allows us to make sure data is consistent across all systems (principle 7). This helps us catch issues with our services or with the data early. Today we have more than 70% of issues identified by our monitoring tool instead of users reporting them. our goal is to get this number to be in 90%+.

This is a very high level view of what our data@zulily team does. Our goal is to share more often, with a deeper dive on our various pieces of technology, in the coming weeks. Stay tuned…

Share:

Like this:

We are looking forward to meeting everyone attending the scalability meetup at our office. It is going to be a great event with a good overview of how zulily leverages big data and a deep dive into Google Big Query & Apache Optiq in Hive.

Abstract: zulily, with 4.1 million customers and projected 2014 revenues of over 1 billion dollars, is one of the largest e-commerce companies in the U.S. “Data-driven decision making” is part of our DNA. Growth in the business has triggered exponential growth in data, which required us to redesign our data platform. The zulily data platform is the backbone for all analytics and reporting, along with being the backbone of our data service APIs consumed by various teams in the organization. This session provides a technical deep dive into our data platform and shares key learnings, including our decision to build a Hadoop cluster in the cloud.

Topic: Delivering personalization and recommendations using Hadoop in cloud

Speakers: Steve Reed is a principal engineer at zulily, the author of dropship, and former Geek of the Week. Dylan Carney is a senior software engineer at zulily. They both work on personalization, recommendations and improving your shopping experience.

Abstract: Working on personalization and recommendations at zulily, we have come to lean heavily on on-premise Hadoop clusters to get real work done. Hadoop is a robust and fascinating system, with a myriad of knobs to turn and settings to tune. Knowing the ins and outs of obscure Hadoop properties is crucial for the health and performance of your hadoop cluster. (To wit: How big is your fsimage? Is your secondary namenode daemon running? Did you know it’s not really a secondary namenode at all?)

But what if it didn’t have to be this way? Google Compute Engine (GCE) and other cloud platforms make promises of easier, faster and easier-to-maintain Hadoop installations. Join us as we describe learning from our years of Hadoop use, and give an overview of what we’ve been able to adapt, learn and unlearn while moving to GCE.

Topic: Apache Optiq in Hive

Speaker: Julian Hyde, Principal, Hortonworks

Abstract: Tez is making Hive faster, and now cost-based optimization (CBO) is making it smarter. A new initiative in Hive introduces cost-based optimization for the first time, based on the Optiq framework. Optiq’s lead developer Julian Hyde shows the improvements that CBO is bringing to Hive. For those interested in Hive internals, he gives an overview of the Optiq framework and shows some of the improvements that are coming to future versions of Hive.

Our format is flexible: We usually have 2 speakers who talk for ~30 minutes each and then do Q+A plus discussion (about 45 minutes each talk) finish by 8:45.