How-to: Count Events Like a Data Scientist

The ability to quickly and accurately count complex events is a legitimate business advantage.

In our work as data scientists, we spend most of our time counting things. It is the foundational skill that is used in data cleansing, reporting, feature engineering, and simple-but-effective machine learning models like Naive Bayes classifiers. Hilary Mason has a quote about the benefits of counting that I love:

Understand that what big data really means is to be able to count things in data sets of any size, rapidly. And the advantage we get from that is not a technical advantage…it’s actually a cognitive advantage.

Learning how to count things quickly and accurately is both incredibly powerful and surprisingly challenging. To illustrate this, let’s walk through a classic problem that shows you how to think about counting the way a data scientist does. In order to make this example as broadly accessible as possible, I’m going to do every step of this analysis via SQL, instead of with a programming language like Python or Java.

Your First Data Science Project: Analyzing Misspelled Queries

Let’s say that you have just been hired as a data scientist to work on a new search engine for mobile applications. Users have been complaining that it is difficult to find the apps that they want to install:

They have a hard time typing the name of the apps on the phone’s keyboard without making a mistake or having autocorrect “fix” the spelling of a word.

Some of the apps have unusual names that don’t have a single obvious spelling.

The most popular apps inspire copycats with slightly different names, and it can be hard to identify the “real” app.

Your job is to analyze the impact of these misspelled or mistyped queries and then come up with some practical ways to help our users find the apps they want. Fortunately, the developers of the search engine have been keeping careful logs of all the queries and app installs that users have done, and these logs are stored in Hadoop and available for querying:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

>DESCRIBE searches;

event_id:bigint

account_id:bigint

query:string

tstamp_sec:bigint

(some other columns)

>DESCRIBE installs;

event_id:bigint

account_id:bigint

app_id:bigint

search_event_id:bigint

(some other columns)

Note that both of the tables define an event_id field, which is a unique identifier for every logged record. Both tables also contain information about the account_id that issued the query, which identifies a particular user of our search engine. The searches table also has information about the query that was issued and the time of day (in seconds) that the query was received, while the installs table contains an app_id field for the application that was installed as well as a search_event_idfield that can link an install event to the search event that generated it.

Let’s start by finding out how often people perform a search without installing an app. One way to do that is to find out how many event_ids in the searches table do not appear in the search_event_id column of the installs table:

1

2

>SELECT SUM(CAST(b.search_event_id ISNULLASINT))asgood_searches

FROM searchesaLEFT JOIN installsbONa.event_id=b.search_event_id;

While this is useful, it isn’t exactly what we want to know, since there can be lots of reasons why a user didn’t install an app after doing a search. They may have just been browsing for apps, or there could have been an app they wanted but it was too expensive for them to buy, or the app they found could have been fairly large and they wanted to wait until they were connected to a WiFi network to download it. We want to be able to distinguish mistyped queries from these other kinds of search events.

The Unreasonable Effectiveness of Counting

The trick is to realize that there isn’t any single event in our logs that identifies a misspelled query; rather, it’s a combination of events that relate to each other in a certain way:

A user types in a query, gets back a list of results, and doesn’t install any of them.

A few seconds later, the user types in a slightly different query, gets back a new list of results, and installs at least one of the apps in the list.

In our logs, that would be a search event that did not have an associated install event, followed by a search event that did have an associated install event. As an added benefit, we can assume that the query in the second search event is a spell correction of the first one, and we can count how often certain query pairs occur in order to build a powerful spell correction model.

It’s relatively easy to understand this pattern based on our own experience as searchers, but it’s more difficult to formulate a high-performance SQL query that will allow us to accurately count these events. The main challenge is that we need to analyze the relationship between multiple search events that are in the same table, and a naive approach to doing this requires a very slow and very resource-intensive self-join. Our job as data scientists is to figure out ways to make counting complex events fast, and if we have a strong command of SQL, one way of doing that is to use analytic SQL functions like LAG and LEAD in order to analyze the relationship between sequential rows in the same table:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

--Createatable that hasacolumn toindicate whetherasearch

--resulted inan install.

>CREATE TABLE search_installs AS

SELECTa.*,b.search_event_id isnotnullasinstalled

FROM searchesaLEFT JOIN installsbONa.event_id=b.search_event_id;

--Analyze sequential searches by user_id.

>CREATE TABLE spell_correction_candidates AS

SELECT query qw,installed iw,tstamp_sec tsw,

LEAD(query)OVERwqr,LEAD(installed)OVERwir,LEAD(tstamp_sec)OVERwtsr

FROM search_installs

WINDOWwAS(PARTITION BY user_id ORDER BY event_id);

--Aggregate queries over the sequential rows under the condition that

--the first query didn'tresult inan install,the second one did,

--andthe two queries occurred close together intime-under10seconds

--apart.

>SELECT qw,qr,count(*)cnt

FROM spell_correction_candidates

WHERE iw=falseANDir=trueANDtsr-tsw<10

GROUP BY qw,qr;

The first query in this sequence is just a variation on our earlier LEFT JOIN that adds an indicator column for whether or not a search resulted in an install or not. We then perform our analytic SQL query using the LEAD function to get the value of the query, installed, and tstamp_sec values from the next row in the sequence that is partitioned by user_id and ordered by event_id. Finally, we do a simple filter and aggregation query over the results of the analytic SQL query to get the counts of how frequently certain query pairs appear as spell correction candidates.

From Data Analyst to Data Scientist

Using the analytic SQL query is a definite improvement in our model, but it still has some limitations that we would like to overcome. Right now, we’re only counting the queries that immediately precede another one as spell correction candidates, but it’s likely that a user often makes multiple incorrect queries before they enter the query that leads to the app they want. What we would really like to do is consider all of the queries that a user issued in some window of time before the successful query as potential spell correction candidates. Unfortunately, expressing this idea with analytic SQL functions is difficult, because it’s not clear how many rows we would need to lag/lead within each partition in order to identify all of the candidate queries.

Right now, the data for each user is stored in separate rows within separate tables, but if we were to restructure the data into a single table, in which each row contained all of the information about a single user — every search and every install they did — then we would have all of the information we needed within each row. The resulting table’s schema would look something like this:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

>DESCRIBE sessions;

account_id:bigint

search_events:array<struct<

event_id:bigint

query:string

tstamp_sec:bigint

...

>>

install_events:array<struct<

event_id:bigint

search_event_id:bigint

app_id:bigint

...

>>

I call these “supernova schemas” because they look like a star schema where the fact tables have collapsed into one of the dimensions. It’s not immediately obvious that the supernova will help us solve our counting problem, because HiveQL does not have extensive built-in functions for working with arrays of structs. At this point, most data scientists would turn to programming languages like Java or Python in order to complete this analysis. But my commitment was to do every step of the analysis in SQL, so I wrote an extension to Hive that allows us to treat the arrays of structs within each row as if they were tiny database tables that can be queried via SQL. The resulting query looks like this:

1

2

3

4

5

6

7

8

9

10

11

12

--Execute the spell correction analysis;

SELECTa.qw,a.qr,count(*)ascnt

FROM sessions

LATERAL VIEW WITHIN(

"SELECT bad.query qw, good.query qr

FROM t1 as bad, t1 as good

WHERE bad.tstamp_sec < good.tstamp_sec

AND good.tstamp_sec - bad.tstamp_sec < 30

AND bad.event_id NOT IN (select search_event_id FROM t2)

AND good.event_id IN (select search_event_id FROM t2)",

search_events,install_events)a

GROUP BYa.qw,a.qr;

The WITHIN function is a Hive table generating function, and we’re using Hive’s LATERAL VIEW syntax to express the aggregation of the results of the function across all of the rows in the table. Each of the arrays that is passed in after the nested SQL query gets assigned a position-based alias, so search_events is t1 and install_events is t2. Inside of the WITHIN function, we’re using the newly incubating Apache Optiq project to execute a self-join on the (tiny) search_events array in order to identify all of the candidate spell correction queries within a 30-second time window.

I decided to call this project Exhibit, and I hope that it is a useful way for data analysts who are making the transition to data science to work effectively with complex, nested records. I talked about building and using supernova schemas to solve other data science problems at MidwestIO and the GraphLab Conference.

Data Science Superpowers

Although this particular analysis focused on analyzing mistyped and misspelled queries, the form of this analysis occurs in lots of different problem domains. For example, you could use these counting techniques to identify usability problems with your website by analyzing visits that are followed by calls to a customer care center. A hospital could analyze treatment records to identify events that lead to certain negative health outcomes in patients like sepsis. Telecommunications providers can analyze sequences of dropped phone calls to identify locations that have poor network coverage.

The freedom to structure data in a way that makes it easy to quickly and accurately count complex events is my favorite data science superpower. As Hilary said, this isn’t simply a technical advantage, it’s a cognitive one: it allows me to understand the way that my customers experience my business, and then use that understanding to find new ways to improve that experience and create value.

One response on “How-to: Count Events Like a Data Scientist”

Josh…
Thanks for the great article and straightforward example of using the Supernova schema.

Question. I have heard you emphasize the “tiny” internal “record within a record” notion of this schema. I’m wondering if there are any implications to consider if that internal record was actually quite large. For example, in the Oil and Gas industry, we have devices that are connected to multiple sensors. If the parent fact table describes a device, and the child arrays are device connected sensor measurements, are there problems with using Exhibit’s “lateral view within” to query those child arrays which might be quite large.