To date most of my online demos and tutorials have used a very simple data set consisting of a single table with 60 rows of fictitious trade data. Over the last few months I have been searching around for a slightly bigger and more challenging data set to use. Fortunately, our BIg Data PMs (led by Marty Gubar) have put together a very rich big data schema as part of their recently launched Big Data Virtual Machine that you can download from our OTN web page, see here: http://www.oracle.com/technetwork/database/bigdata-appliance/oracle-bigdatalite-2104726.html.

The data set is based around an online customer movie application. Here is a little background…

Oracle MoviePlex Demo Application

Oracle MoviePlex is a fictitious on-line movie streaming company. Customers log into Oracle MoviePlex where they are presented with a targeted list of movies based on their past viewing behavior. Because of this personalized experience and reliable and fast performance, customers spend a lot of money with the company and it has become extremely profitable.

As the users watch movies, search for their favorite actors, rate their movies the system records all the clicks in a log file. The contents of that log file looks like this:

In its raw state this information is not very helpful. It needs a lot of processing to convert the various data points into usable information. In many cases companies have been turning to Hadoop and its related Java-based programming language MapReduce to process and convert these types of files into usable information. Most business users will want to summarise this information by customer and/or movie and then merge this information with other data sets. So how can we make access to and analysis of this type of data much easier? As part of this post I am going to compare the Hadoop-Java-MapReduce approach with an alternative approach using 12c SQL. My objective is not to determine which is solution is the best because each approach has its own camp of supporters. Once we have reviewed the two solutions I will put everything into context and make some recommendations…..so let's bring on the code!

Sessionization using Java

Accessing the Avro file

At the moment In the context of Big Data, everything seem to revolve around Hadoop, MapReduce and Java. It is quite natural for a big data developer to extend their basic map reduce processing to include more complicated requirements. In the case of our movie demo there is a lot of processing that needs to be done using the native features of Hadoop to collect and process the weblog file being generated by our application. There is an excellent video prepared by Marty Gubar (Director of Product Management for Big Data) which explains this process. This is part four of a six-part series that explains the movieplex demo: Part 4. Turn Clicks into Value - Flume & Hive. The movie demo lasts about 5 mins and you can watch here: https://www.youtube.com/watch?v=IwrjJUoUwXY.

The steps shown in the movie explain how to access the avro file and then how to clean the data to provide some interesting additional metrics.

Calculating the sessionization data

Creating the sessionization analysis is a little more complicated. In fact, it requires 370 lines of Java code. Here is the Java code we created for doing the sessionization analysis (the code window is scrollable so you can review the very long code sample):

The actual logic for the sessionization analysis is about 100 lines of code as shown here (at s before, the code window is scrollable so you can review the very long code sample):

As you can see from the code listing this requires a strong knowledge of Java and with 370 lines of code, if we decide to change the details of the pattern that we are searching for it is going to be a lengthy process to make the required changes. Making anything changes to the code to reflect changing business requirements is definitely going to be beyond the skills of your average business analyst and/or data scientists and this might negatively impact the level of project agility.

Making life easier and richer with SQL

My personal view is that sessionization analysis is quite a sophisticated requirement and is best achieved using SQL because the requirements and parameters for the analysis will naturally evolve over time as new questions are triggered by existing answers. Therefore, you need an agile approach to analytics that can deal with constant changing requirements during the data discovery phase.

Accessing the Avro file

First step is to create a DIRECTORY object to access the raw data file:

Cleaning the data

Now the next stage is to clean the data and remove the column identifiers such as custId, movieId, genreId etc. To do this processing I simply used the SQL SUBSTR() function. At the same time I decided to breakout the activity ID column to create unique columns for each type of activity. This approach is useful if you are going to be using the data as part of a data mining project because in many cases it useful to uniquely identify specific attributes. In many cases you will want to create a fact table from this process but in this case I have created a view while I do some basic prototyping:CREATE OR REPLACE VIEW VW_SESSION_DATA AS SELECT

SUBSTR(CUST_ID,11) AS CUST_ID, CASE SUBSTR(MOVIE_ID,11) WHEN 'null' THEN null ELSE SUBSTR(MOVIE_ID,11) END AS MOVIE_ID, CASE SUBSTR(GENRE_ID,11) WHEN 'null' THEN null ELSE SUBSTR(GENRE_ID,11) END AS GENRE_ID, TO_DATE(SUBSTR(SESSION_DATE, 9,19), 'YYYY-MM-DD HH24:MI:SS') AS SESS_DATE, CASE SUBSTR(RECOMMENDED_ID,15) WHEN 'null' THEN null ELSE SUBSTR(RECOMMENDED_ID,16,1) END AS REC_ID
, (CASE SUBSTR(ACTIVITY_ID,12,2)
WHEN '1' THEN '1'
WHEN '2}' THEN '2'
WHEN '3}' THEN '3'
WHEN '4}' THEN '4'
WHEN '5}' THEN '5'
WHEN '6}' THEN '6'
WHEN '7}' THEN '5'
WHEN '8}' THEN '8'
WHEN '9}' THEN '9'
WHEN '10' THEN '10'
WHEN '11' THEN '11'
ELSE null END) AS ACTIVITY_ID
, CASE SUBSTR(ACTIVITY_ID,12) WHEN 'null' THEN null ELSE SUBSTR(ACTIVITY_ID,12,1) END as ACTIVITY_ID, CASE SUBSTR(RATING_ID,9) WHEN 'null' THEN null ELSE SUBSTR(RATING_ID,10,1) END as RATING_ID, case SUBSTR(ACTIVITY_ID,12,1) WHEN '1' THEN 'Y' END as act_rate, case SUBSTR(ACTIVITY_ID,12,1) WHEN '2' THEN 'Y' END as act_complete , case SUBSTR(ACTIVITY_ID,12,1) WHEN '3' THEN 'Y' END as act_pause, case SUBSTR(ACTIVITY_ID,12,1) WHEN '4' THEN 'Y' END as act_start , case SUBSTR(ACTIVITY_ID,12,1) WHEN '5' THEN 'Y' END as act_browse , case SUBSTR(ACTIVITY_ID,12,1) WHEN '6' THEN 'Y' END as act_list, case SUBSTR(ACTIVITY_ID,12,1) WHEN '7' THEN 'Y' END as act_search, case SUBSTR(ACTIVITY_ID,12,1) WHEN '8' THEN 'Y' END as act_login, case SUBSTR(ACTIVITY_ID,12,1) WHEN '9' THEN 'Y' END as act_logout, case SUBSTR(ACTIVITY_ID,12,1) WHEN '10' THEN 'Y' END as act_incomplete, case SUBSTR(ACTIVITY_ID,12,1) WHEN '11' THEN 'Y' END as act_purchaseFROM RAW_SESSION_DATA;

Running a query against this view generates our basic data set that we can then feed into our MATCH_RECOGNIZE clause to create the sessionization result set.

Is there a better, simpler way to deal with the avro file? Well yes there is. As part of the Database 12c release programme we will be adding in-database support for JSON data. This will allow us to take the raw avro file and access stand query it directly from within the database without the need to go through the complicated cleaning process! At the moment we have not made any collateral (PPTs etc) on this feature publicly available so I can't provide you with any supporting links that will give you more information. However, once this feature is released I will revisit this little demo to explore how the new JSON feature can be used along side SQL pattern matching.

This statement uses many of the MATCH_RECOGNIZE features that I have discussed in previous posts. In this specific example I have decided to set the duration between events within a session as 5 minutes. That means if the user does nothing for about 5 minutes then I will assume that a new session has started even if the user has not logged out.

Beyond the basic sessionization model there is another interesting pattern that we can search for within our data set. If a user starts watching a movie then we might not get another event being logged for over two hours. Therefore, we could look for instances where the time between events is less than 150 minutes (assuming most movies last around 2 hours) and there is at least one (or more) "start" event is logged and at least one (or more) "complete" event is logged,

Looking at the results what really jumps out is the level of interaction on the site before, during and after the user has watched a movie. Many of the records that are returned show users searching our movie database, going to their lists of favourite movies and rating the movies they have just watched. All this implies that if we can get customers to begin watching a movie there is a huge knock-on effect in terms of how long they remain on the site and the number of "events" they create while on the site. This is the sort of information that is useful to marketing and campaign management teams.

A different view: Calculating the sessionization data for each movie

While developing the above analysis I began to wonder about analysing the data not from a customer perspective but from the point of view of each movie. This would allow me to look at the session activity over time for each movie and understand the usage patterns across movies and genres. For example, are there certain movies or types of movies that are more or less likely to be rated. Is the process of watching a movie the same across all movies in terms of the events that are registered during a session? There are all sorts of new avenues of analysis that could be created from looking at the data in this way.

So how easy would it be to change the focus of the analysis to movies? Well the code took me approximately 30 seconds to write - actually cut & paste from the customer sessionization example, switch the customer id for the movie id and finally removed a few fields. Here is the new code:

The speed at which you can create these additional/alternative result sets is really great! It is now relatively easy to continue this discovery process by analysing the information by movie genre or any of the other related attributes.

Summary

The main take-away from this post is that (hopefully) I shown how easy it is to use SQL for sessionization analysis. Firstly in terms of creating the framework to support the normal "sessionization" transformation process: taking the source log file, extracting the data points and then grouping the information by user. The great news is that processing of the source file will be greatly simplified when the JSON feature is released because it will be possible to query the log file in a much simpler way to extract the required data points. Secondly, SQL provides a much simpler and easier way to support the evolutionary of the discovery process. Changing the focus from customer analysis to movie analysis is both quick and easy. I am not convinced that a Java-based approach to data discovery can offer the same level of simplicity and agility but I will let you be the judge of that last statement….

Comments

Post a Comment

Popular posts from this blog

This post covers one of the new SQL performance enhancements that we incorporated into Database 12c Release 2. All of these enhancements are completely automatic, i.e. transparent to the calling app/developer code/query. These features are enabled by default because who doesn’t want their queries running faster with zero code changes?

So in this post I am going to focus on the new In-Memory “cursor duration” temporary table feature. Let’s start by looking at cursor duration temp tables…Above image courtesy of wikimedia.org
What is a cursor duration temp table?
This is a feature that has been around for quite a long time. Cursor duration temporary tables (CDTs) are used to materialize intermediate results of a query to improve performance or to support complex multi step query execution. The following types of queries commonly use cursor duration temp tables:WITH Clause and parallel recursive WITHGrouping SetsStar TransformationFrequent Item Set CountingXLATE
What happens during the …

….and now it’s here in PDF format as well!The free big data warehousing Must-See guide for OpenWorld 2017 is now available for download in PDF format - click here, and yes it’s completely free. This comprehensive guide covers everything you need to know about this year’s Oracle OpenWorld conference so that when you arrive at Moscone Conference Center you are ready to get the most out of this amazing conference. The guide contains the following information:Page 8 - On-Demand VideosPage 17 - JustifyYour tripPage 19 - KeyPresentersPage 41 - Must See SessionsPage 90 - Useful MapsChapter 1 - Introduction to the must-see guide.Chapter 2 - A guide to the key the highlights from last year’s conference so you can relive the experience or see what you missed. Catch the most important highlights from last year's OpenWorld conference with our on demand video service which covers all the major keynote sessions. Sit back and enjoy the highlights. The second section explains why you need to atte…

MATCH_RECOGNIZE and predicates
At a recent user conference I had a question about when and how predicates are applied when using MATCH_RECOGNIZE so that’s the purpose of this blog post. Will this post cover everything you will ever need to know for this topic? Probably!
Where to start….the first thing to remember is that the table listed in the FROM clause of your SELECT statement acts as the input into the MATCH_RECOGNIZE pattern matching process and this raises the question about how and where are predicates actually applied. I briefly touched on this topic in part 1 of my deep dive series on MATCH_RECOGNIZE: SQL Pattern Matching Deep Dive - Part 1.
In that first post I looked at the position of predicates within the explain plan and their impact on sorting. In this post I am going to use the built in measures (MATCH_NUMBER and CLASSIFIER) to show the impact of applying predicates to the results that are returned.
First, if you need a quick refresher course in how to use the MATCH…

Keith Laker

Keith Laker

Disclaimer

Opinions expressed are entirely my own and do not reflect the position of Oracle or any other corporation. Do NOT take anything written here, unless explicitly mentioned otherwise, to be Oracle policy or reflecting Oracle's support policy.

About Me

I have been working with Oracle data warehouse technology for over 20 years working on a wide variety of data warehouse projects both as a consultant and an onsite support engineer. I am now part of the Data Warehouse Product Management Team where I am responsible for analytical SQL. I am based in the UK at our Manchester office.

A key part of my role is to work with our sales teams to brief our customers on data warehousing and analytical SQL: explaining the wide variety of new and exciting opportunities that our DW and analytical solutions can support.

I regularly deliver sales training for data warehousing and analytical SQL across all our sales regions and provide competitive intelligence support across all the major data warehouse vendors.