Manipulate and analyze big data with the full expression power of Rax - Relational Algebra with eXtensions.

Sessionization with Temporal Union

A common problem in web analytics is sessionization or session reconstruction: taking a series of user events and splitting the series into a set of sessions [1]. Often, it is done based solely on the timestamps of the events: a session ends after a certain time of user inactivity. As an example, let's take the following click stream:

userId

page

timestamp

0001

"index.html"

2015-05-10 20:13:13

0001

"search.html"

2015-05-10 20:13:20

0001

"search.html"

2015-05-10 20:19:01

0001

"checkout.html"

2015-05-10 20:25:20

0002

"index.html"

2015-05-10 21:03:31

0002

"search.html"

2015-05-10 21:04:20

0001

"index.html"

2015-05-10 24:25:25

0001

"search.html"

2015-05-10 24:25:30

If we assume that a session ends after at least 15 minutes of user inactivity, the sessionized version of this data would look like this:

userId

session_begin

session_end

0001

2015-05-10 20:13:13

2015-05-10 20:25:20

0002

2015-05-10 21:03:31

2015-05-10 21:04:20

0001

2015-05-10 24:25:25

2015-05-10 24:25:30

So how would you solve it in SQL? The solution to a similar problem, Packing date and time intervals, was posted here. Packing of intervals means that you want to merge all intervals that overlap into one contiguous interval problem. We can transform our original sessionization problem into interval-packing problem by turning every click event into a 15-minute interval. If two such intervals overlap, it means that the original click events were less than 15 minutes apart. So packing such intervals, and then substracting 15 minutes from each row in the result, will give us the sessions. Here is an SQL query that does that: