How to build an Elasticsearch analytics platform in a weekend

That’s a heat map of 3 months of Wayfair North American orders, exported via Logstash from our relational orders database into Elasticsearch, and visualized with Kibana 4. For some time now, ELK, or Elasticsearch / Logstash / Kibana, has been our opensource platform for text-based log messages, complemented by Graphite and Grafana for numeric logging of counts, averages, etc. We have long used ELK for realtime analysis of emergent problems. With the Kibana 4 upgrade, we’re starting to make it a first-class citizen of our analytics suite, which also includes Hadoop, Vertica, Tableau, and cube-style SQL Server. This blog post lays out the details of a hackathon project that served as a POC for that move. The basic idea was to put some data from and about our search systems into Kibana 4.

Our search platform is Solr, so this meant we were going to put Solr-based data into Elasticsearch. What could go wrong? Plenty. To see why, just start typing Solr into google:

Click on through to “Solr vs. elasticsearch” and you’ll find useful comparisons, lively discussion about strengths and weaknesses, bold declarations of imminent dominance or eclipse, trash talk, etc. In the end they are both web service wrappers around inverted indices made with the Lucene Java library. They have more in common than they having dividing them, and they shouldn’t be fighting. Could Solr and ES get along? Could we put Solr data into ES without having ES talk bad about Solr? We would soon find out. Cognizant of the historical tribal disagreement among the Solr and ES communities, we called our effort “Project Forbidden Marriage.”

Our search usage data is stored in Hive/Hadoop, and our order data in MSSQL. Elastic provides a connector for Hive via jdbc to directly connect to Elasticsearch, and we used pymssql and Kafka to send the order data to logstash and on to ES.

Now that we had data flowing, we hit our first roadblock. We wanted to see order history on a timeline based on order completion date, not the time the message is indexed by Elasticsearch or processed through Logstash, which is the default that appears in the @timestamp field that drives default-behavior time series charting in Kibana. So we changed our query to name the column @timestamp in the Python.

Uh…well that didn’t work. OK, plan B – we tried using the date filter in Logstash to replace the @timestamp field data with our order_complete_date field data.

Hm…that didn’t seem to make a difference at all. To make this long story a little shorter, I tried a few other things, even going as far as writing some Ruby in the Logstash config file to remove the @timestamp field entirely, or modify it, or do ANYTHING with it pre-Elasticsearch. Fun fact: you can’t do that. While it’s not documented as such – in my experience and frantic head-banging I’ve found that this field could not be modified without hacking the infrastructure. We have hacked Solr on several occasions, but we really try not to go there if we don’t have to, because it complicates upgrades. We have been making do with vanilla ELK.

All right, plan C. I renamed my date field in my SQL query to event_timestamp, and set that as the timestamp reference field in Kibana.

Victory! Now we just need to adjust the rest of our timeline-based data to send to that field properly, account for the differences in timestamp formatting across our data sets, and we’re good to go. Luckily for us – a date type field in Elasticsearch will accept multiple date formats, assuming you specify them properly.

Now that the time stuff is all taken care of, and we know that will work, we need to normalize our fields to get all of this data to play nice. While Elasticsearch is a schema-less document store database, it is still important that our data be coherent, so we can visualize all of it together. What I mean by this is, Elasticsearch allows you to enter data without a defined schema, but it’s important to keep your field names the same between data sources. Since we want to combine these two non-relational sets, we need to make sure that the common field names match. Luckily, for our purposes, there are really only a few fields that are linked between our different data sources – and that is customer_id, store_name, ip_address, sku, and city. Keeping these named the same between our data sources will allow us to maintain the relationship between data sources; when I query for

sku:abc123

I’d expect to see results from both data sets, rather than having to search for

sku:abc123 OR skus:abc123

It’s definitely worth discussing a few of these field design choices in detail, as it’s ultimately important as to how we visualize this data. The first field that is noteworthy in its design decision is SKU. In Elasticsearch 2.x, multi-value fields have been removed as a defined field type, however the functionality still remains by default. If we send an array of SKUs, we can still read and aggregate on them independently. For example:

{“sku”: “ABC123”}
{“sku”: [“ABC123”,”DEF456”,”GHI789”]}

If we run a query in Kibana for sku:ABC123, two results will return. These will also aggregate as individual skus, and every operation will work in this manner. This winds up being really important, since some records from orders or search history returns one SKU, and some can return up to dozens – having them all accessible allows us to visualize our data properly, rather than aggregating on all individual elements together, we can aggregate the results of an individual element.

Another important field to mention is keyword. This field represents the search string that a user entered when searching on Wayfair.com, and we want to do a few different things with this data. We want to be able to run a full text search, but we also want to aggregate the results. For those who don’t know, aggregating on an analyzed field can be extremely expensive depending on the contents of the field. For example: if we have data that looks like the keyword ”blue area rugs”, we want to see the count for that entire phrase. If it’s an analyzed field, we’ll wind up seeing counts for blue, area, and rugs independently. However, we need an analyzed field to do a full text search. Enter: the multi-field field. The mapping looks like this:

This allows us a lot of flexibility with our searching and aggregation – we can ask questions like “what are the most common search phrases that include ‘blue’?”. So we search for keyword.analyzed:‘blue’, and aggregate on keyword.raw.

Perfect! Full text search without expensive aggregation. The last field worth mentioning is our geo_point field, which we called coords. We decided to go ip based, considering that is a common item between search and order data – so to get geo information we ran these IPs through a few different facilities. From the hive side, we ran IPs through UDFs to spit the coordinates out from an ip database we have. The Python side of things did something similar.

On geo data –

Go zip code based where possible; this is only relevant to a portion of our data, however. IP is used due to it being common among all data sets

Filter out anything with a default location

To point 1 – geolocation data based on IP is inherently unreliable. It’s getting better over time, sure, but even in the best circumstances it can only be so accurate. Looking back, I would only use IP information for search data, and I would use zip code for order data. Zip code is significantly more reliable, in terms of geolocation. And on that note – geolocation data happens to have a default location if they don’t know where your IP geo-locates to. There are two hot spots on this map. One is on top of New York City, where we do actually do a ton of business, and the other is in Kansas, which is a default address for test orders. Those are outliers. If we wanted to clean this up and see more meaningful data, we’d probably exclude the Kansas one, overlay with population data, and get orders per capita.

So now that we have all of this data, what can we do with it? A LOT of information can come off of two tables, come to find out. We were able to use our search and order data independently and together, and were able to display some pieces of data that no one had really seen here at Wayfair before. Also, we accidentally built a product recommendation engine.

You may have noticed that this dashboard won’t work without providing a SKU to search for. As I mentioned previously, we’re sending an array of SKUs to the SKU field when that is applicable – so this search returns any of the associated SKUs to the SKU that was searched for. This allows us to show “customers also viewed” type material – this is the representation in the top left histogram. As far as this screenshot goes, we searched for a SKU that we know to be a barbecue grill. The other SKUs shown in the top left happen to be grills as well. The top right histogram is similar as well – except this is order based. This is our “customers also purchased” type view, and this histogram lists the grill (of course) and grill accessories, like covers, spatulas, and patio furniture. Furthermore, the bottom left visual shows exactly which search terms lead to this item, which is extremely useful data for analysis. The rest of the data here was “just for fun”. Where did people buy this? What is their name (Jennifers like grills!)? Lastly, fraudulent order ratio, and average order total.

Our next item is a holistic search dashboard. It breaks down all of our search data in a general sense. What we have here first is our Keyword Clickthrough panel. This is a simple idea: after someone searched for something, did they click a SKU? If so, that’s a successful search. Further down, we have our most popular search phrases (using keyword.raw) and most popular search terms. Search terms is an aggregation on an analyzed field (oh no!) but totally useful in this case. We can see that our most common search phrase is storage cabinet, but our most popular term is table. Furthermore, we can see search volume by store, and then search success by store, with count. This allows us some flexibility as well – what if we want to see search success and most popular terms on Birch Lane, which is one of our lifestyle brands?

It turns out that wall mirrors are most popular!

Next, we have a bit more of a granular search dashboard. This allows us to search for a search term, product, or otherwise, and view the results in a few ways. First, we have the count of searches by channel on the top. In this example, we’re looking at our friend “storage cabinets”. In the first chart, we have the counts of searches for this phrase by channel – we can see if this phrase was most popular through mobile, desktop, or otherwise. Below that – we have another visualization of our success. The cool thing about this dashboard is we can geo-locate this data as well. It looks like storage cabinets are popular everywhere, but what about something like pineapples?

This changes our geo-location information quite drastically. Apparently people in the Midwest don’t care about pineapples!

Furthermore – we can use this dashboard to determine the SKUs people clicked on after, and which store they were on. If we filter this by store as well, we’ll be able to see popular SKUs in that term for that store, how successful those searches were, and so on and so forth.

Next useful visual: fraudulent orders. The nature of being an online retail store is that people are going to try and scam us, or use stolen credit cards, or whatever. The good news is – our fraud evaluation system catches these. We decided to plot all fraudulent orders on a map, and break it down by state/city, as well as sku and customer ID. We now know where the majority of our fraud comes from, and which SKUs are commonly marked as fraudulent.

Another dashboard we were able to pull out of this data is a detailed order source visualization.

We are able to see order percentage by channel by store on top – turns out Birch Lane receives a ton of mobile traffic, and Wayfair Canada doesn’t! Then we broke down order count by store, and percent of total orders by store. At the bottom, we have revenue information by store and channel, by store, and then the average order cost (this is, of course, blurred, since we are a publicly traded company and all…).

Next up is a detailed order costs breakdown. This takes all of the parts of an order (shipping, tax, SKU quantity, total, etc) and puts them on a heatmap. This allows us to see potential shipping cost discrepancies in an area, and allows us to say “maybe we need a warehouse near <city>”. For the sake of screen space – there’s a few other heat maps on this dashboard, and they all represent a different portion of an orders cost. This gets particularly powerful when filtering by store – we can see exactly where people are or are not spending money by store – and maybe use that data for more marketing!

And then…. the finale. This is possibly the most important dashboard we built, and proves the true power of this engine. We used it to stalk our company founders!

We tracked their searches, orders, and activity. You can see that they’re mostly in Boston, but they travel to New York City once in a while. Click to embiggen, and you can see what they were looking for. The results show bathroom vanities and a high order count. When we showed the results to Steve Conine, he said “oh wow, look at that! I just remodeled my bathroom!”.

This just goes to show how deep this tool can go – we can see a high level view of all of our search traffic, all the way down to finding the behavior of individual customers.