Evaluating KSQL has been high on my to-do list ever since it was released back in August. I wanted to experiment with it using an interesting, high velocity, real-time data stream that would allow me to analyse events at the millisecond level, rather than seconds or minutes. Finding such a data source, that is free of charge and not the de facto twitter stream, is tricky. So, after some pondering, I decided that I'd use my Thrustmaster T300RS Steering Wheel/Pedal Set gaming device as a data source,

The idea being that the data would be fed into Kafka, processed in real-time using KSQL and visualised in Grafana.

This is the end to end pipeline that I created...

...and this is the resulting real-time dashboard running alongside a driving game and a log of the messages being sent by the device.

This article will explain how the above real-time dashboard was built using only KSQL...and a custom Kafka producer.

I'd like to point out, that although the device I'm using for testing is unconventional, when considered in the wider context of IoT's, autonomous driving, smart automotives or any device for that matter, it will be clear to see that the low latency, high throughput of Apache Kafka, coupled with Confluent's KSQL, can be a powerful combination.

I'd also like to point out, that this article is not about driving techniques, driving games or telemetry analysis. However, seeing as the data source I'm using is intrinsically tied to those subjects, the concepts will be discussed to add context. I hope you like motorsports!

Writing a Kafka Producer for a T300RS

The T300RS is attached to my Windows PC via a USB cable, so the first challenge was to try and figure out how I could get steering, braking and accelerator inputs pushed to Kafka. Unsurprisingly, a source connector for a "T300RS Steering Wheel and Pedal Set" was not listed on the Kafka Connect web page - a custom producer was the only option.

To access the data being generated by the T300RS, I had 2 options, I could either use an existing Telemetry API from one of my racing games, or I could access it directly using the Windows DirectX API. I didn't want to have to have a game running in the background in order to generate data, so I decided to go down the DirectX route. This way, the data is raw and available, with or without an actual game engine running.

The producer was written using the SharpDX .NET wrapper and Confluent's .NET Kafka Client. The SharpDX directinput API allows you to poll an attached input device (mouse, keyboard, game controllers etc.) and read its buffered data. The buffered data returned within each polling loop is serialized into JSON and sent to Kafka using the .NET Kafka Client library.

A single message is sent to a topic in Kafka called raw_axis_inputs every time the state of one the device's axes changes. The device has several axes, in this article I am only interested in the Wheel, Accelerator, Brake and the X button.

{
"event_id":4300415, // Event ID unique over all axis state changes
"timestamp":1508607521324, // The time of the event
"axis":"Y", // The axis this event belongs to
"value":32873.0 // the current value of the axis
}

This is what a single message looks like. In the above message the Brake axis state was changed, i.e. it moved to a new position with value 32873.

You can see below which inputs map to the each reported axis from the device.

You can tell by looking at the timestamps, it's possible to have multiple events generated within the same millisecond, I was unable to get microsecond precision from the device unfortunately. When axes, "X", "Y" and "RotationZ" are being moved quickly at the same time (a bit like a child driving one of those coin operated car rides you find at the seaside) the device generates approximately 500 events per second.

Creating a Source Stream

Now that we have data streaming to Kafka from the device, it's time to fire up KSQL and start analysing it. The first thing we need to do is create a source stream. The saying "Every River Starts with a Single Drop" is quite fitting here, especially in the context of stream processing. The raw_axis_inputs topic is our "Single Drop" and we need to create a KSQL stream based on top of it.

With the stream created we can we can now query it. I'm using the default auto.offset.reset = latest as I have the luxury of being able to blip the accelerator whenever I want to generate new data, a satisfying feeling indeed.

We now have our source stream created and can start creating some derived streams from it. The first derived stream we are going to create filters out 1 event. When the X button is pressed it emits a value of 128, when it's released it emits a value of 0.

To simplify this input, I'm filtering out the release event. We'll see what the X button is used for later in the article.

From this stream we are going to create 3 further streams, one for the brake, one the accelerator and one for the wheel.

All 3 axes emit values in the range of 0-65535 across their full range. The wheel emits a value of 0 when rotated fully left, a value of 65535 when rotated fully right and 32767 when dead centre. The wheel itself is configured to rotate 900 degrees lock-to-lock, so it would be nice to report its last state change in degrees, rather than from a predetermined integer range. For this we can create a new stream, that includes only messages where the axis = 'X', and the axis values are translated into the range of -450 degrees to 450 degrees. With this new value translation, maximum rotation left now equates to 450 degrees and maximum rotation right equates -450 degrees, 0 is now dead centre.

We now need to create 2 more derived streams to handle the accelerator and the brake pedals. This time, we want to translate the values to the range 0-100. When a pedal is fully depressed it should report a value of 100 and when fully released, a value of 0.

It is a general rule of thumb in motorsports that "Smooth is Fast", the theory being that the less steering, accelerator and braking inputs you can make while still keeping the car on the desired racing line, results in a faster lap time. We can use KSQL to count the number of inputs for each axis over a Hopping Window to try and capture overall smoothness. To do this, we create our first KSQL table.

A KSQL table is basically a view over an existing stream or another table. When a table is created from a stream, it needs to contain an aggregate function and group by clause. It's these aggregates that make a table stateful, with the underpinning stream updating the table's current view in the background. If you create a table based on another table you do not need to specify an aggregate function or group by clause.

The table we created above specifies that data is aggregated over a Hopping Window. The size of the window is 5 seconds and it will advance or hop every 1 second. This means that at any one time, there will be 5 open windows, with new data being directed to each window based on the key and the record's timestamp.

You can see below when we query the stream, that we have 5 open windows per axis, with each window 1 second apart.

This is the resulting graph in Grafana with each axis stacked on top of each other giving a visual representation of the total number of events overall and total per axis. The idea here being that if you can drive a lap with less overall inputs or events then the lap time should be faster.

Calculating Lap Times

To calculate lap times, I needed a way of capturing the time difference between 2 separate events in a stream. Remember that the raw data is coming directly from the device and has no concept of lap, lap data is handled by a game engine.
I needed a way to inject an event into the stream when I crossed the start/finish line of any given race track. To achieve this, I modified the custom producer to increment a counter every time the X button was pressed and added a new field to the JSON message called lap_number.

I then needed to recreate my source stream and my initial derived stream to include this new field

The next step is to calculate the time difference between each "Buttons5" event (the X button). This required 2 new tables. The first table below captures the latest values using the MAX() function from the axis_inputs stream where the axis = 'Buttons5'

This new stream is again based on the axis_inputs stream where the axis = 'Buttons5'. We are joining it to our lap_marker_data table which results in a stream where every row includes the current and previous values at the point in time when the X button was pressed.

A quick query should illustrate this (I've manually added column heading to make it easier to read)

We can now see the time difference, in milliseconds ( LAP_TIME_MS ), between each press of the X button. This data can now be displayed in Grafana.

The data is also being displayed along the top of the dashboard, aligned above the other graphs, as a ticker to help visualize lap boundaries across all axes.

Anomaly Detection

A common use case when performing real-time stream analytics is Anomaly Detection, the act of detecting unexpected events, or outliers, in a stream of incoming data. Let's see what we can do with KSQL in this regard.

Driving Like a Lunatic?

As mentioned previously, Smooth is Fast, so it would be nice to be able to detect some form of erratic driving. When a car oversteers, the rear end of the car starts to rotate around a corner faster than you'd like, to counteract this motion, quick steering inputs are required to correct it. On a smooth lap you will only need a small part of the total range of the steering wheel to safely navigate all corners, when you start oversteering you will need make quick, but wider use of the total range of the wheel to keep the car on the track and prevent crashing.

To try and detect oversteer we need to create another KSQL table, this time based on the steering_inputs stream. This table counts steering events across a very short hopping window. Events are counted only if the rotation exceeds 180 degrees (sharp left rotation) or is less than -180 degrees (sharp right rotation)

This data is plotted on the Y axis (we're talking graphs now) on the "Steering inputs" panel in Grafana. The oversteer metric can be seen in red and will spike when steering input exceeds 180 degrees in either direction.

Braking too Hard?

Another anomaly I'd like to detect is when maximum brake pressure is applied for too long. Much like the brake pedal in a real car, the brake pedal I'm using has a very progressive feel, a fair amount of force from your foot is required to hit maximum pressure. If you do hit maximum pressure, it shouldn't be for long as you will most likely lock the wheels and skid off the race track, very embarrassing indeed.

The first thing to do is to create a table that will store the last time maximum brake pressure was applied. This table is based on the brake_inputs stream and filters where the value = 100

Something worth mentioning is that if I hold my foot on the brake pedal at the maximum pressure for any period of time, only one event is found in the stream. This is because the device only streams data when the state of an axis changes. If I keep my foot still, no new events will appear in the stream. I'll deal with this in a minute.

Next we'll create a new stream based on the brake_inputs stream and join it to our max_brake_power_time table.

For each row in this stream we now have access to all columns in the brake_inputs stream plus a timestamp telling us when max brake power was last reached. With this data we create a new derived column bi.timestamp - mb.last_max_brake_ts AS time_since_max_brake_released which gives a running calculation of the difference between the current record timestamp and the last time maximum brake pressure was applied

For example, when we query the stream we can see that maximum pressure was applied at timestamp 1508772739115 with a value of 100.0. It's the row immediately after this row that we're are interested in 99.90234225 | 1508772740803 | 1508772739115 | 1688.

Remember, that while an axis is held at the same value, 100.0 in this case, no more events will appear in the stream until the value changes again. This is why we are interested in the row preceding the maximum value, this row is telling us how long the value of 100.0 was applied for. In this case the time it was held for was 1688 milliseconds. Notice that on subsequent rows the value increases, but we are not interested in those rows. In order to isolate what we want, we need another table. This new table takes our previously created stream, brake_inputs_with_max_brake_power_time and groups it by the last_max_brake_ts column. For each grouping we then get the MIN(time_since_max_brake_released).

When we query this table, while stepping hard on the brake pedal for a few seconds at a time, we get the information we want. We can see the timestamp for when maximum brake pressure reached and for how long it was sustained.

Here's what the above data looks like when visualised in Grafana. The bottom graph is showing when maximum brake pressure was hit and on for how long it was sustained. I've set a threshold against the graph of 1 second so any extreme braking is clearly identifiable - if you're that hard on the brakes for that long, you're probably going to end up in the scenery.

The Tale of 2 Laps

After putting it all together, it's time to take to the track and see how it looks. This video shows 2 complete laps onboard with the Caterham Seven 620R around Brands Hatch in the UK. The first lap is a relatively smooth one and the second is quite ragged. Notice that the first lap ( lap 68 ) is quicker overall than the second ( lap 69 ). On lap 69, I start to drive more aggressively and oversteer spikes start to appear in the steering input graph. Lap 69 also has significantly more events overall than lap 68 as a result my more exuberant ( slower ) driving style. You'll also notice that maximum brake pressure is reached a couple of times on each lap, but for no longer than the threshold of 1 second on each occurrence.

Summary

KSQL is awesome! Although it's only a developer preview at this point, it's impressive what you can get done with it. As it evolves over time and mirrors more of the functionality of the underlying Streams API it will become even more powerful, lowering the barrier to entry for real-time stream processing further and further. Take a look at the road map to see what may be coming next.

Oh, and I recently discovered on the #KSQL community Slack group, that you can execute KSQL in Embedded Mode right inside your Java code, allowing you to mix the native Streams API with KSQL - very nice indeed !

Few weeks back, while I was enjoying my holidays in the south of Italy, I started receiving notifications about an imminent announcement by Confluent. Reading the highlights almost (...I said almost) made me willing to go immediately back to work and check all the details about it.
The announcement regarded KSQL: a streaming SQL engine for Apache Kafka!

Before going in detail, lets try to clarify the basics: what is KSQL? Why was it introduced and how does it complement Kafka?

What is KSQL?

We have been writing about Kafka several times, including my recent blogs were I was using it as data hub to capture Game of Thrones tweets and store them in BigQuery in order to do sentiment analysis with Tableau. In all our examples Kafka has been used just for data transportation with any necessary transformation happening in the target datastore like BigQuery, with the usage of languages like Python and engines like Spark Streaming or directly in the querying tool like Presto.

KSQL enables something really effective: reading, writing and transforming data in real-time and a scale using a semantic already known by the majority of the community working in the data space, the SQL!

KSQL is now available as developer preview, but the basic operations like joins, aggregations and event-time windowing are already covered.

What Problem is KSQL Solving?

As anticipated before, KSQL solve the main problem of providing a SQL interface over Kafka, without the need of using external languages like Python or Java.
However one could argue that the same problem was solved before by the ETL operations made on the target datastores like Oracle Database or BigQuery. What is the difference then in KSQL approach? What are the benefits?

The main difference in my opinion is the concept of continuous queries: with KSQL transformations are done continuously as new data arrives in the Kafka topic. On the other side transformations done in a database (or big data platforms like BigQuery) are one off and if new data arrives the same transformation has to be executed again.

So what is KSQL good for? Confluent's KSQL introduction blog post provides some use cases like real time analytics, security and anomaly detection, online data integration or general application development. From a generic point of view KSQL is what you should use when transformations, integrations and analytics need to happen on the fly during the data stream. KSQL provides a way of keeping Kafka as unique datahub: no need of taking out data, transforming and re-inserting in Kafka. Every transformation can be done Kafka using SQL!

As mentioned before KSQL is now available on developer preview and the feature/function list is somehow limited compared to more mature SQL products. However in cases where very complex transformations need to happen those can still be solved either via another language like Java or a dedicated ETL (or view) once the data is landed in the destination datastore.

How does KSQL work?

So how does KSQL work under the hood? There are two concepts to keep in mind: streams and tables. A Stream is a sequence of structured data, once an event was introduced into a stream it is immutable, meaning that it can't be updated or deleted. Imagine the number of items pushed or pulled from a storage: "e.g. 200 pieces of ProductA were stocked today, while 100 pieces of ProductB were taken out".
A Table on the other hand represents the current situation based on the events coming from a stream. E.g. what's the overall quantity of stocks for ProductA? Facts in a table are mutable, the quantity of ProductA can be updated or deleted if ProductA is not anymore in stock.

KSQL enables the definition of streams and tables via a simple SQL dialect. Various streams and tables coming from different sources can be joined directly in KSQL enabling data combination and transformation on the fly.

Each stream or table created in KSQL will be stored in a separate topic, allowing the usage of the usual connectors or scripts to extract the informations from it.

KSQL in Action
Starting KSQL

KSQL can work both in standalone and client-server mode with the first one aimed at development and testing scenarios while the second supporting production environments.
With the standalone mode KSQL client and server are hosted on the same machine, in the same JVM. On the other side, in client-server mode, a pool of KSQL server are running on remote machine and the client connects to them over HTTP.

For my test purposes I decided to use the standalone mode, the procedure is well explained in confluent documentation and consist in three steps:

Clone the KSQL repository

Compile the code

Start KSQL using local parameter

./bin/ksql-cli local

Analysing OOW Tweets

I'll use for my example the same Twitter producer created for my Wimbledon post. If you notice I'm not using the Kafka Connect, this is due to KSQL not supporting AVRO formats as of now (remember is still in dev phase?). I had then to rely on the old producer which stored the tweet in JSON format.

For my tests I've been filtering the tweets containing OOW17 and OOW (Oracle Open World 2017), and as mentioned before, those are coming in JSON format and stored in a Kafka topic named rm.oow. The first step is then to create a Stream on top of the topic in order to structure the data before doing any transformation.
The guidelines for the stream definition can be found here, the following is a cutdown version of the code used

Created_At VARCHAR: Created_At is a timestamp, however in the first stream definition I can't apply any date/timestamp conversion. I keep it as VARCHAR which is one of the allowed types (others are BOOLEAN, INTEGER, BIGINT, DOUBLE, VARCHAR, ARRAY<ArrayType> and MAP<VARCHAR, ValueType>).

User VARCHAR: the User field is a JSON nested structure, for the basic stream definition we'll leave it as VARCHAR with further transformations happening later on.

kafka_topic='rm.oow': source declaration

value_format='JSON': data format

Once created the first stream we can then query it in SQL like

select Created_at, text from twitter_raw

with the output being in the form of a continuous flow: as soon as a new tweet arrives its visualized in the console.

The first part I want to fix now is the Created_At field, which was declared as VARCHAR but needs to be mutated into timestamp. I can do it using the function STRINGTOTIMESTAMP with the mask being EEE MMM dd HH:mm:ss ZZZZZ yyyy. This function converts the string to a BIGINT which is the datatype used by Kafka to store timestamps.

Another section of the tweet that needs further parsing is the User, that as per the previous definition returns the whole nested JSON object.

An important thing to notice is that the Created_At is not encoded as BigInt, thus if I execute select Created_At from twitter_fixed I get only the raw number. To translate it to a readable date I can use the STRINGTOTIMESTAMP function passing the column and the data format.

The last part of the stream definition I wanted to fix is the settings of KEY and TIMESTAMP: a KEY is the unique identifier of a message and, if not declared, is auto-generated by Kafka. However the tweet JSON contains the Id which is Twitter's unique identifier, so we should to use it. TIMESTAMP associates the message timestamp with a column in the stream: Created_At should be used. I can defined the two above in the WITH clause of the stream declaration.

When doing a select * from twitter_with_key_and_timestamp we can clearly see that KSQL adds two columns before the others containing TIMESTAMP and KEY and the two are equal to Created_At and Id.

Now I have all the fields correctly parsed as KSQL stream, nice but in my previous blog post I had almost the same for free using Kafka Connect. Now It's time to discover the next step of KSQL: tables!

Let's first create a simple table containing the number of tweets by User_name.

When then executing a simple select * from table we can see the expected result.

Two things to notice:

We see a new row in the console every time there is a new record inserted in the oow topic, the new row contains the updated count of tweets for the screen_name selected

The KEY is automatically generated by KSQL and contains the screen_name

I can retrieve the list of tables define with the show tables command.

It's interesting to notice that the format is automatically set as JSON. The format property, configured via the VALUE_FORMAT parameter, defines how the message is stored in the topic and can either be JSON or DELIMITED.

Windowing

When grouping, KSQL provides three different windowing functions:

Tumbling: Fixed size, non overlapping. The SIZE of the window needs to be specified.

Session: Fixed size, starting from the first entry for a particular Key, it remains active until a new message with the same key happens within the INACTIVITY_GAP which is the parameter to be specified.

I can create simple table definition like the number of tweets by location for each tumbling session with

It's interesting to notice that each entry (e.g. Europe North, Switzerland) is listed at least three times. This is due to the fact that in any point in time there are three overlapping windows (SIZE is 30 seconds and ADVANCE is 10 seconds). The same example can be turn into the session windows by just defining WINDOW SESSION (30 SECONDS).

The windowing is an useful option, especially when combined with HAVING clauses since it gives the option to define metrics for real time analysis.
E.g. I may be interested only items that have been ordered more than 100 times in the last hour, or, in my twitter example in user_locations having a nr_of_tweets greater than 5 in the last 30 minutes.

Joining

So far so good, a nice set of SQL functions on top of data coming from a source (in my case twitter). In the real word however we'll need to mix information coming from disparate sources.... what if I tell you that you can achieve that in a single KSQL statement?

To show an integration example I created a simple topic known_twitters using the kafka-console-producer.

Once started I can type in messages and those will be stored in the known_twitters topic. For this example I'll insert the twitter handle and real name of known people that are talking about OOW. The format will be:

username,real_name

like

FTisiot,Francesco Tisiot
Nephentur,Christian Berg

Once inserted the rows with the producer I'm then able to create a KSQL stream on top of it with the following syntax (note the VALUE_FORMAT='DELIMITED')

This is due to the fact that as of now KSQL supports only joins between a stream and a table, and the stream needs to be specified first in the KSQL query. If I then just swap the two sources in the select statement above:

select user_screen_name from PEOPLE_KNOWN_STREAM a join rm.tweets_by_users b on a.screen_name=b.user_screen_name;

...I get another error

Join type is not supportd yet: INNER

We have to remember that KSQL is still in developer beta phase, a lot of new features will be included before the official release.

adding a LEFT JOIN clause (see bug related) solves the issue and I should be able to see the combined data. However when running

select * from PEOPLE_KNOWN_STREAM left join TWEETS_BY_USERS on screen_name=user_screen_name;

Didn't retrieve any rows. After adding a proper KEY to the stream definition

I was able to retrieve the correct rowset! Again, we are in early stages of KSQL, those fixes will be enhanced or better documented in future releases!

Conclusion

As we saw in this small example, all transformations, summaries and data enrichments were done directly in Kafka with a dialect very easy to learn for anyone already familiar with SQL. All the created streams/tables are stored as Kafka topics thus the standard connectors can be used for sink integration.

As mentioned above KSQL is still in developer preview but the overall idea is very simple and at the same time powerful. If you want to learn more check out the Confluent page and the KSQL github repository!

After last year successful OTN Appreciation Day, it's time again to show our love for a particular feature in any Oracle's tool we use in our work. You may have noted a name change with OTN now becoming ODC: Oracle Developer Community.

What

The feature I want to speak about is OBIEE's Time Hierarchies.
For anybody in the BI business the time dimension(s) are the essence of the intelligence bit: being able to analyze trends, compare current period with previous one, plot year to date or rolling measures are just some of the requirements we get on daily basis.
A time hierarchy definition allows the administrator to set which time levels are exposed, how the rollup/drill down works and how previous/following members of the level are calculated.
Once the hierarchy is defined, all the related calculations are simple as calling a function (e.g. AGO), defining the level of detail necessary (e.g. Month) and the number of items to take into account (e.g. -1).

Time related level based measures - e.g. monthly sum of sales coming from a fact table at daily level

Why

Why do I like time hierarchies? Simple! It's a very clever concept in the RPD, which requires particular knowledge and dedicated attention.

If done wright, once defined, is available in every related table and makes the time comparison formulas easy to understand and to create. If done wrong, errors or slowness in the related analysis can be difficult to spot and improve/fix.

Still time hierarchies are a central piece in every BI implementation, so once understood and implemented correctly give a massive benefit to all developers.

How

We blogged about time dimensions and calculations back in 2007 when OBI was still on version 10! The original functionality is still there and the process to follow is pretty much the same.
Recently was introduced the concept of Logical Sequence Number, a way of speeding up some time series calculations by removing the ranking operations needed to move back (or forth) in history.

I wanted to keep the blog post short, since the time hierarchies information can be found in millions of blog posts. I just wanted the to give few hints to follow when creating a time hierarchy:

It can be created on any data with a predefined order, no need to be a date! you could compare for example a certain product with another in the inventory having the previous code.

The Chronological Key defines the sorting of the level, for example how years, months or dates are ordered. Ordering months alphabetically with a format like YYYY-MM it's correct while using MM-YYYY provides wrong results.

Double check the hierarchies, something like YEAR-> MONTH -> WEEK -> DATE can be incorrect since a week can be split in different months!

Set appropriately the number of elements for each level. This is useful, especially when the hierarchy is complex or pre-aggregated facts, for OBIEE to understand which table to query depending on the level of the analysis.

Setup the Logical Sequence Number. LSNs are useful if you are looking to reduce the impact of the time series processing at a minimum.

If you are looking for very optimal performances for a specific report, e.g. current year vs previous, physicalizing the time series result, previous year, directly in the table alongside with the current year will give what you're looking for.

This was just a quick overview of OBIEE's Time Hierarchies, why are so useful and what you should be looking after when creating them! Hope you found this short post useful.

Follow the #ThanksODC hashtag on Twitter to check which post have been published on the same theme!

The final #GoT episode was transmitted last Sunday, now two years waiting for the next season... How can HBO be so cruel??? And how can I find interesting content for my future blog posts???
At least now European football (not soccer) leagues are back, so TV-side I'm covered!

Going back to serious discussions, Game of Thrones last episode: Yay or Nay? The average sentiment for the episode (taking into account only tweets since Monday) was -0.012: it is negative but represents an improvement when compared to the two previous ones (with episode 6 having the most negative sentiment score).

But... Hey! What is the line on top going in time? The line it's due to the external R call and the fact that is forcing us to include the Tweet Text column in the analysis in order to be evaluated. The evaluation of the sentiment is applied on ATTR(Tweet Text) which means kind of SELECT DISTINCT Tweet_Text in Oracle terms. The line on top is drawn because the same Tweet Text was tweeted across several weeks.

Please notice that the three overall sentiments are close (between 0.01 and 0.10) so, when looking in detail at the distribution of sentiment scores across the episodes we can see that, as expected, are similar.

Zooming to single characters we can see the scatterplot of the last episode, with Jon Snow (or should I say Targaryen?) leading the number of mentions with surprisingly Littlefinger on the second spot and Arya on the third: probably the Baelish dying scene at Winterfell was something highly appreciated by the fans.

On the positive negative feeling almost nothing changed with Arya and the Night King being the negative and positive poles. I've been telling you about change of leadership on the various axes of the scatterplot by visually comparing today's scatterplot with the previous two. However the transition of the character position in the graph can be visualized again on multiple scatterplots.

By creating a scatterplot for each character and assigning to the episodes a different number (E05-1, E06-2, E07-3) I can clearly see how Davos Seaworth for example had a big sentiment variation going very positive in the last episode while Jaime Lanninster was more stable. Zooming into Davos position we can see how the sentiment distribution changed across episodes with the E06 representing the most negative while the E07 has almost all positive tweets.

Looking at the words composing Davos tweets we can immediately spot few thigs:

SIR has a positive sentiment (Sir Davos is how several characters call him) which is driving the overall score in the final episode

The number of tweets mentioning Davos was very small in E06 compared to the other two (we can see the same from the related scatterplot above)

In E07 we see a good number of circles having the same (big) size, possibly is the same text which has been tweeted several times.

To verify the last point we can simply show the Tweet Text along the # of Tweets and discover that almost the same positive Text count for over the 99% of the whole reference to the character.

Emotions

One of the cool functions of the Syuzhet package is named get_nrc_sentiment and allows the extrapolation of emotions from a text based on the NRC emotion lexicon. The function takes a text as input and returns a data frame containing a row for each sentence and a column for emotion or sentiment.
The sentiment can either be positive or negative which we already discussed a lot previously. The emotion is split in eight categories: anger, fear, anticipation, trust, surprise, sadness, joy, and disgust.

We can extract the eight different emotions into eight calculations with the following code

To calculate the Anger Emotion Score we are passing ATTR(Text), the list of Tweet's texts, and taking the output of the anger column of the dataframe. We can do the same for all the other emotions and create separate graphs to show their average across characters for the last episode. In this case I took Disgust, Anger, Fear, Joy and Trust.

We can then clearly see that Bran Stark is the character that has most Disgust associated to. Bron has a special mix of emotions, he's in the top for Anger, Fear and Joy, such a mix can justify the average sentiment which is close to neutral (see scatterplot above). On the Trust side we can clearly see that the North wins with Arya and Sansa on the top, interesting here is to see also Lord Varys.
Looking into BranDisgust detail we can see that is driven by the categorization of the BRAN word as disgusting, probably the dictionary doesn't like cereals.

Scene Emotions

In my previous post I've been talking about the "Game of Couples" and how a single character sentiment score could be impacted by a reference to a second character. For the last episode of the series I wanted to look at different scenes: the main characters I want to analyse are Jon Snow, Littlefinger and Sansa. Specifically I want to understand how people on Twitter reacted to the scenes where the two characters had a big impact: the death of Littlefinger declared by Sansa and the revelation of Jon Targaryen.

The first thing I wanted to check is the Surprise: How are characters categorized by this emotion? We can see Bron on top being driven by the word GOOD in the related tweets.

We can also notice that Petyr score is quite high (0.2590 and 2nd position) while Jon score is pretty low, probably averaged by the huge number of tweets. We can also see that Sansa score is not very high, even if she is the character providing quite a big shock when accusing Littlefinger.

The overall character average surprise doesn't seem to be very relevant, we need to find a way to filter tweets related to those particular scenes: we can do that by including only few keywords in the Tweet Text. Please note we are going to filter words that will create an OR condition. If a tweet contain ANY of the words mentioned, it will be included.

First I wanted to check which are the words in Jon's tweets driving the Surprise sentiment alongside the # of Tweets

However this is only giving us details on which words are classified as Surprise for Jon, nothing really related to the scenes. I can however filter only the tweets with an overall Surprise sentiment for Jon and check which words are mostly associated with them. I also added a filter for Tweets containing the words TARGARYEN OR SON since I assumed those two could be more frequently used describing the scene.

We can clearly see some patterns that are well recognized correctly by the Surprise metric: both Aegon (a reference to Jon's real name) and Aunt (reference to Lyanna or Deanerys?) are in the top 20 and a little bit further right in the graph we can also spot Father. There probably is also some surprise in tweets related to what's going to happen when Jon finds out he's a Targaryen since all keywords are present in the top 20.

When doing a similar analysis on Sansa I wanted to add another metric to the picture: the Average Sentence Emotion Score for all sentences including a word. With this metric we can see how a word (for example AMAZING) changes the average emotion of the sentences where is included. Analysing this metric alone however wouldn't be useful: obviously the words having more impact on emotion are the ones categorized as such in the related dictionary.

I found interesting the following view for Sansa: we see across all the tweets categorized as Surprizing, which are the words most mentioned (Y-axes) and what's the average Surprise emotion value for the sentences were those words were included.

We can spot that MURDER and TREASON were included with a big number of tweets (>500) having an average Surprise score around 2. This seems to indicate that the scene of Sansa convicting Lord Baelish wasn't expected from the fans.

One last graph shows how the character couples (remember the game of couples in my previous post?) have been perceived: the square color defines the average Surprise score while the position in the X-axis confidence (by the # of Tweets).

We can spot that the couple Cercei and Sansa is the one having most Surprise emotion, followed by Cercei and Daenerys. Those two couples may be expected since the single characters had major parts in the last episode. Something unexpected is the couple Sandor Clegane and Brienne, looking in detail, the surprise is driven by a mention to the word MURDER which is included in 57.76% of the Tweets mentioning both.

A last technical note: during the last few weeks I've collected about 700 thousands tweets, the time to analyse them highly depends on the complexity of the query. For simple counts or sums based only on BigQuery data I could obtain replies in few seconds. For other analysis, especially when sentiment or emotion was included, a big portion of the raw dataset was retrieved from BigQuery into Tableau, passed to R with the function results moved back to Tableau to be displayed. Those queries could take minutes to be evaluated.
As written in my previous blog post, the whole process could be speed up only by pre-processing the data and storing the sentiment/emotion in BigQuery alongside with the data.

my series of blog post about Game of Thrones tweet and press analysis with Kafka, BigQuery and Tableau! See you in two years for the analysis of the next season with probably a whole new set of technology!

Another week is gone, another "Game of Thrones" episode watched, only one left until the end of the 7th series.
The "incident" in Spain, with the episode released for few hours on Wednesday screwed all my plans to do a time-wise comparison between episodes across several countries.

I was then forced to think about a new action plan in order avoid disappointing all the fans who enjoyed my previous blog post about the episode 5. What you'll read in today's analysis is based on the same technology as before: Kafka Connect source from Twitter and Sink to BigQuery with Tableau analysis on top.

What I changed in the meantime is the data structure setup: in the previous part there was a BigQuery table rm_got containing #GoT tweets, an Excel table containing Keywords for each character together with the Name and the Family (or House). Finally there was a view on top of BigQuery rm_got table extracting all the words of each tweet in order to analyse their sentiment.
For this week analysis I tried to optimise the dataflow, mainly pushing data into BigQuery, and I added a new part to it: online press reviews analysis!

Optimization

As mentioned during my previous post, the setup described before was miming an analyst workflow, without writing access to datasource. However it was far from optimal performance wise, since there was a cartesian join between two data-sources, meaning that for every query all the dataset was extracted from BigQuery and then joined in memory in Tableau even if filters for a specific character were included.

The first change was pushing the characters Excel data in BigQuery, so at least we could use the same datasource joins instead of relying on Tableau's data-blend. This has the immediate benefit of running joins and filters in the datasource rather than retrieving all data and filtering locally in memory.
Pushing Excel data into BigQuery is really easy and can be done directly in the web GUI, we just need to transform the data in CSV which is one of allowed input data formats.

Still this modification alone doesn't resolve the problem of the cartesian join between characters (stored in rm_characters) and the main rm_got table since also BigQuery native joining conditions don't allow the usage of the CONTAIN function we need to verify that the character Key is contained in the Tweet's Text.
Luckily I already had the rm_words view, used in the previous post, splitting the words contained in the Tweet Text into multiple rows. The view contained the Tweet's Id and could be joined with the characters data with a = condition.

However my over simplistic first implementation of the view was removing only # and @ characters from the Tweet text, leaving all the others punctuation signs in the words as you can see in the image below.

REPLACE(UPPER(TEXT), 'NIGHT KING', 'NIGHTKING'): Since I'm splitting words, I don't want to miss references to the Night King which is composed by two words that even if written separated point the same character.

REGEXP_REPLACE(..,'[^a-zA-Z]',' '): Replaces using regular expression, removing any character apart from the letters A-Z in lower and upper case from the Tweet Text.

The new view definition provides a clean set of words that can finally be joined with the list of characters keys. The last step I did to prepare the data was to create an unique view containing all the fields I was interested for my analysis with the following code:

The view rm_words is used two times: one, as mentioned before, to join the Tweet with the character data and one to show all the words contained in a tweet.

The (SELECT * FROM FLATTEN([DataSet.rm_words],f0__group.word)) subselect is required since word column, contained in rm_words, was a repeated field, that can't be used in joining condition if not flatten.

Please note that the SQL above will still duplicate the Tweet rows, in reality we'll have a row for each word and different character Key contained in the Text itself. Still this is a big improvement from the cartesian join we used in our first attempt.

One last mention to optimizations: currently the sentence and word sentiment is calculated on the fly in Tableau using the SCRIPT_INT function. This means that data is extracted from BigQuery into Tableau, then passed to R (running locally in my pc) which computes the score and then returns it to Tableau. In order to optimize Tableau performance I could pre-compute the scores in R and push them in a BigQuery Table but this would mean a pre-processing step that I wanted to avoid since a real-time analysis was one of my purposes.

Tweet Analysis

With my tidy dataset in place, I can now start the analysis and, as the previous week I can track various KPIs like the mentions by character Family and Name. To filter only current week data I created two parameters Start Date of Analysis and End Date of Analysis

Using those parameters I can filter which days I want to include in my analysis. To apply the filter in the Workbook/Dashboard I created also a column Is Date of Analysis with the following formula

I can now use the Is Date of Analysis column in my Workbooks and filter the Yes value to retain only the selected dates.

I built a dashboard containing few of the analysis mentioned in my previous blog post, in which I can see the overall scatterplot of characters by # of Tweets and Sentence Sentiment and click on one of them to check its details regarding the most common words used and sentence sentiment.

From the scatterplot on top we can see a change of leadership in the # of Tweets with Daenerys overtaking Jon by a good margin, saving him and in the meantime loosing one of the three dragons was a touching moment in the episode. When clicking on Daenerys we can see that the world WHITE is driving also the positive sentiment.

The Night King keep its leadership on the Sentiment positive side. Also in this case the WHITE word being the most used with positive sentiment. On the other side Arya overtook Sansa as character with most negative mentions. When going in detail on The positive/negative words, we can clearly see that STARK (mentioned in previous episode), KILL, WRONG and DEATH are driving the negative sentiment. Interesting is also the word WEAR with negative sentiment (from Google dictionary "damage, erode, or destroy by friction or use.").

A cut down version of the workbook with a limited dataset, visible in the image below, is available in Tableau Public.

Game of Couples

This comparison is all what I promised towards the end of my first post, so I could easily stop here. However as curious person and #GoT fan myself I wanted to know more about the dataset and in particular analyse how character interaction affect sentiment. To do so I had somehow to join characters together if they were mentioned in the same tweet, luckily enough my dataset contained the character mentioned and the list of words of each Tweet. I can reuse the list of words on a left join with the list of characters keys. In this way I have a record for each couple of characters mentioned in a Tweet.

I can then start analysing the Tweets mentioning any couple of characters, with the # of Tweets driving the gradient. As you can see I removed the values where the column and row is equal (e.g. Arya and Arya). The result, as expected, is a symmetric matrix since the # of Tweets mentioning Arya and Sansa is the same as the ones mentioning Sansa and Arya.

We can clearly see that Jon and Daenerys are the most mentioned couple with Sansa and Arya following and in third place Whitewalkers and Bran. This view and the insights we took from it could be problematic to get in cases when the reader is colour blind or has troubles when defining intensity. For those cases a view like the below provides the same information (by only switching the # of Tweets column from Color to Size), however it has the drawback that small squares are hard to see.

The next step in my "couple analysis" is understand sentiment, and how a second character mentioned in the same tweet affects the positive/negative score of a character. The first step I did is showing the same scatterplot as before, but filtered for a single character, in this case Arya.

The graph shows Arya's original position, and how the Sentiment and the # of Tweets change the position when another character is included in the Tweet. We can see that, when mentioned with Daenerys the sentiment is much more positive, while when mentioned with Bran or Littlefinger the sentiment remains almost the same.

This graph it's very easy to read, however it has the limitation of being able to display only one character behaviour at time (in this case Arya). What I wanted is to show the same pattern across all characters in a similar way as when analysing the # of Tweets per couple. To do so I went back to a matrix stile of visualization, setting the colour based on positive (green) or negative (red) sentiment.

As before the matrix is symmetric, and provides us a new set of insights. For example, when analysing Jorah Mormont, we can see that a mention together with Cercei is negative which we can somehow expect due to the nature of the queen. What's strange is that also when Jorah is mentioned with Samwell Tarly there is a negative feeling. Looking deeply in the data we can see that it's due to a unique tweet containing both names with a negative sentiment score.

What's missing in the above visualization is an indication on how "strong" is the relationship between two character based on the # of Tweets where they are mentioned together. We can add this by including the # of Tweets as position of the sentiment square. The more the square is moved towards the right the higher is the # of Tweets mentioning the two characters together.

We can see as before that Jorah and Sam have a negative feeling when mentioned together, but it's not statistically significant because the # of Tweets is very limited (square position completely on the left). Another example is Daenerys and Jon which have a lot of mentions together with a neutral sentiment. As we saw before also the couple Arya and Bran when mentioned together have a negative feeling, with a limited number Tweets mentioning them together. However Bran mentioned with WhiteWalkers has a strong positive sentiment.

It's worth mentioning that the positioning of the dot is based on a uniform scale across the whole matrix. This means that if, like in our case, there is a dominant couple (Daenerys and Jon) mentioned by a different order of magnitude of # of Tweets compared to all other couples, the difference in positioning of all the others dots will be minimal. This could however be solved using a logarithmic scale.

Web Scraping

Warning: all the analysis done in the article including this chapter are performed with automated tools. Due to the nature of the subject (a TV series plenty of deaths, battles and thrilling scenes) the words used to describe a sentence could be automatically classified as positive/negative. This doesn't automatically mean that the opinion of the writer is either positive or negative about the scene/episode/series.

The last part of the analysis I had in mind was about comparing the Tweets sentiment, with the same coming from the episode reviews that I could find online. This latter part relies a lot on the usage of R to scrape the relevant bits from the web-pages, the whole process was:

Search on Google for Beyond the Wall Reviews

Take the top N results

Scrape the review from the webpage

Tokenize the review in sentences

Assign the sentence score using the same method as in Tableau

Tokenize the sentence in words

Upload the data into BigQuery for further analysis

Few bits on the solution I've used to accomplish this since the reviews are coming from different websites with different tags, classes and Ids, I wasn't able to write a general scraper for all websites. However each review webpage I found had the main text divided in multiple <p> tags under a main <div> tag which had an unique Id or class. The R code simply listed the <div> elements, found the one mentioning the correct Id or class and took all the data contained inside the <p> elements. A unique function is called with three parameters: website, Id or class to look for, and SourceName (e.g. Telegraph). The call to the function is like

It will return a dataframe containing one row per <p> tag, together with a mention of the source (Ign in this case).

The rest of the R code tokenizes the strings and the words using the tokenizers package and assigns the related sentiment score with the syuzhet package used in my previous blog post. Finally it creates a JSON file (New Line Delimited) which is one of the input formats accepted by BigQuery.
When the data is in BigQuery, the analysis follows the same approach as before with Tableau connecting directly to BigQuery and using again R for word sentiment scoring.

The overall result in Tableau includes a global Episode sentiment score by Source, the usual scatterplot by character and the same by Source. Each of the visualizations can act as filter for the others.

We can clearly see that AVClub and Indiewire had opposite feelings about the episode. Jon Snow is the most mentioned character with Arya and Sansa overtaking Daenerys.

The AVClub vs Indiewire scoring can be explained by the sencence sentiment categorization. Indiewire had most negative sentences (negative evaluations) while the distribution of AVClub has its peak on the 1 (positive) value.

Checking the words used in the two Sources we can notice as expected a majority of positive for AVClub while Indiewire has the overall counts almost equal.

Going in detail on the words, we can see the positive sentiment of AVClub being driven by ACTION, SENSE, REUNION while Indiewire negative one due to ENEMY, BATTLE, HORROR.

This is the automated overall sentiment analysis, if we read the two articles from Indiewire and AVClub in detail we can see that the overall opinion is not far from the automated score:

From AVClub

On the level of spectacle, “Beyond The Wall” is another series high point, with stellar work ....

From IdieWire

Add to the list “Beyond the Wall,” an episode that didn’t have quite the notable body count that some of those other installments did

To be fair we also need to say that IdieWire article is focused on the war happening and the thrilling scene with the Whitewalkers where words like ENEMY, COLD, BATTLE, DEATH which have a negative sentiment are actually only used to describe the scene and not the feelings related to it.

Character and Review Source Analysis

The last piece of analysis is related to single characters. As mentioned before part of the dashboard built in Tableau included the Character scatterplot and the Source scatterplot. By clicking on a single Character I can easily filter the Source scatterplot, like in this case for Daenerys.

We can see how different Sources have different average sentiment score for the same character, in this case with Mashable being positive while Pastemagazine negative.

Checking the words mentioned we can clearly see a positive sentiment related to PRESENT, AGREED and RIDER for Mashable while the negative sentiment of Pastemagazine is driven by FIGHT, DANGER, LOOSING. As said before just few words of difference describing the same scene can make the difference.

Finally, one last sentence for the very positive sentiment score for Clegor Clegaine: it is partially due to the reference to his nickname, the Mountain, which is used as Key to find references. The mountain is contained in a series of sentences as reference to the place where the group of people guided by Jon Snow are heading in order to find the Whitewalkers. We could easily remove MOUNTAIN from the Keywords to eliminate the mismatch.

We are at the end of the second post about Game of Thrones analysis with Tableau, BigQuery and Kafka. Hope you didn't get bored...see you next week for the final episode of the series! And please avoid waking up with blue eyes!

I don't trust statistics and personally believe that at least 74% of them are wrong.... but I bet nearly 100% of people with any interest in fantasy (or just any) TV shows are watching the 7th series of Game of Thrones (GoT) by HBO.
If you are one of those, join me in the analysis of the latest tweets regarding the subject. Please be also aware that, if you're not on the latest episode, some spoilers may be revealed by this article. My suggestion is then to go back and watch the episodes first and then come back here for the analysis!

If you aren't part of the above group then ¯\_(ツ)_/¯. Still this post contains a lot of details on how to perform analysis on any tweet with Tableau and BigQuery together with Kafka sources and sink configurations. I'll leave to you to find another topic to put all this in practice.

Overall Setup

As described in my previous post on analysing Wimbledon tweets I've used Kafka for the tweet extraction phase. In this case however, instead of querying the data directly in Kafka with Presto, I'm landing the data into a Google BigQuery Table. The last step is optional, since as in last blog I was directly querying Kafka, but in my opinion represents the perfect use case of all technologies: Kafka for streaming and BigQuery for storing and querying data.
The endpoint is represented by Tableau, which has a native connector to BigQuery. The following image represents the complete flow

One thing to notice: at this point in time I'm using a on-premises installation of Kafka which I kept from my previous blog. However since source and target are natively cloud application I could easily move also Kafka in the cloud using for example the recently announced Confluent Kafka as a Service.

A framework for scalably and reliably streaming data between Apache Kafka and other data systems

Using this framework anybody can write a connector to push data from any system (Source Connector) to Kafka or pull data from it (Sink Connector). This is a list of available connectors developed and maintained either from Confluent or from the community. Moreover Kafka Connect provides the benefit of parsing the message body and storing it in Avro format which makes it easier to access and faster to retrieve.

Kafka Source for Twitter

In order to source from Twitter I've been using this connector. The setup is pretty easy: copy the source folder named kafka-connect-twitter-master under $CONFLUENT_HOME/share/java and modify the file TwitterSourceConnector.properties located under the config subfolder in order to include the connection details and the topics.

Once the data is in Kafka, the next step is push it to the selected datastore: BigQuery. I can rely on Kafka Connect also for this task, with the related code written and supported by the community and available in github.

All I had to do is to download the code and change the file kcbq-connector/quickstart/properties/connector.properties

...
topics=rm.got
..
autoCreateTables=true
autoUpdateSchemas=true
...
# The name of the BigQuery project to write to
project=<NAME_OF_THE_BIGQUERY_PROJECT>
# The name of the BigQuery dataset to write to (leave the '.*=' at the beginning, enter your
# dataset after it)
datasets=.*=<NAME_OF_THE_BIGQUERY_DATASET>
# The location of a BigQuery service account JSON key file
keyfile=/home/oracle/Big-Query-Key.json

The changes included:

the topic name to source from Kafka

the project, dataset and Keyfile which are the connection parameters to BigQuery. Note that the Keyfile is automatically generated when creating a BigQuery service.

After verifying the settings, as per Kafka connect instructions, I had to create the tarball of the connector and extract it's contents

The last step is to launch the connector by moving into the kcbq-connector/quickstart/ subfolder and executing

./connector.sh

Note that you may need to specify the CONFLUENT_DIR if the Confluent installation home is not in a sibling directory

export CONFLUENT_DIR=/path/to/confluent

When everything start up without any error a table named rm_got (the name is automatically generated) appears in the BigQuery dataset I defined previously and starts populating.

A side note: I encountered a Java Heap Space error during the run of the BigQuery sink. This was resolved by increasing the heap space setting of the connector via the following call

export KAFKA_HEAP_OPTS="-Xms512m -Xmx1g"

BigQuery

BigQuery, based on Dremel's paper, is Google's proposition for an enterprise cloud datawarehouse which combines speed and scalability with separate pricing for storage and compute. If the cost of storage is common knowledge in the IT world, the compute cost is a fairly new concept. What this means is that the cost of the same query can vary depending on how the data is organized. In Oracle terms, we are used to associating the query cost to the one defined in the explain plan. In BigQuery that concept is translated from "performance cost" to also "financial cost" of a query: the more data a single query has to scan, the higher is the cost for it. This makes the work of optimizing data structures not only visible performance wise but also on the financial side.

For the purpose of the blog post, I had almost 0 settings to configure other than creating a Google Cloud Platform, creating a BigQuery project and a dataset.

During the Project creation phase, a Keyfile is generated and stored locally on the computer. This file contains all the credentials needed to connect to BigQuery from any external application, my suggestion is to store it in a secure place.

Once the data is landed in BigQuery, It's time to analyse it with Tableau!
The Connection is really simple: from Tableau home I just need to select Connect-> To a Server -> Google BigQuery, fill in the connection details and select the project and datasource.

An important feature to set is the Use Legacy SQL checkbox in the datasource definition. Without this setting checked I wasn't able to properly query the BigQuery datasource. This is due to the fact that "Standard SQL" doesn't support nested columns while Legacy SQL (also known as BigQuery SQL) does, for more info check the related tableau website.

Analysing the data

Now it starts the fun part: analysing the data! The integration between Tableau and BigQuery automatically exposes all the columns of the selected tables together with the correctly mapped datatypes, so I can immediately start playing with the dataset without having to worry about datatype conversions or date formats. I can simply include in the analysis the CreatedAt date and the Number of Records measure (named # of Tweets) and display the number of tweets over time.

Now I want to analyse where the tweets are coming from. I can use using the the Place.Country or the Geolocation.Latitude and Geolocation.Longitude fields in the tweet detail. Latitute and Longitude are more detailed while the Country is rolled up at state level, but both solutions have the same problem: they are available only for tweets with geolocation activated.

After adding Place.Country and # of Tweets in the canvas, I can then select the map as visualization. Two columns Latitude (generated) and Longitude (generated) are created on the fly mapping the country locations and the selected visualization is shown.

However as mentioned before, this map shows only a subset of the tweets since the majority of tweets (almost 99%) has no location.

The fields User.Location and User.TimeZone suffer from a different problem: either are null or the possible values are not coming from a predefined list but are left to the creativity of the account owner which can type whatever string. As you can see, it seems we have some tweets coming from directly from Winterfell, Westeros, and interesting enough... Hogwarts!

Checking the most engaged accounts based on User.Name field clearly shows that Daenerys and Jon Snow take the time to tweet between fighting Cercei and the Whitewalkers.

The field User.Lang can be used to identify the language of the User. However, when analysing the raw data, it can be noticed that there are language splits for regional language settings (note en vs en-gb). We can solve the problem by creating a new field User.Lang.Clean taking only the first part of the string with a formula like

IF FIND([User.Lang],'-') =0
THEN [User.Lang]
ELSE
LEFT([User.Lang],FIND([User.Lang],'-')-1)
END

With the interesting result of Italian being the 4th most used language, overtaking portuguese, and showing the high interest in the show in my home country.

Character and House Analysis

Still with me? So far we've done some pretty basic analysis on top of pre-built fields or with little transformations... now it's time to go deep into the tweet's Text field and check what the people are talking about!

The first thing I wanted to do is check mentions about the characters and related houses. The more a house is mentioned, the more should be relevant correct?
The first text analysis I want to perform was Stark vs Targaryen mention war: showing how many tweets were mentioning both, only one or none of two of the main houses. I achieved it with the below IF statement

IF contains(upper([Text]), 'STARK') AND contains(upper([Text]),'TARGARYEN')
THEN 'Both'
ELSEIF contains(upper([Text]), 'STARK')
THEN 'Stark'
ELSEIF contains(upper([Text]), 'TARGARYEN')
THEN 'Targaryen'
ELSE 'None'
END

With the results supporting the house Stark

I can do the same at single character level counting the mentions on separate columns like for Jon Snow

Note the OR condition since I want to count as mentions both the words JON and SNOW since those can uniquely be referred at the same character. Similarly I can create a column counting the mentions to Arya Stark with the following formula

IIF(contains(upper([Text]), 'ARYA'), 1,0)

Note in this case I'm filtering only the name (ARYA) since Stark can be a reference to multiple characters (Sansa, Bran ...). I created several columns like the two above for some characters and displayed them in a histogram ordered by # of Mentions in descending order.

As expected, after looking at the Houses results above, Jon Snow is leading the user mentions with a big margin over the others with Daenerys in second place.

The methods mentioned above however have some big limitations:

I need to create a different column for every character/house I want to analyse

The formula complexity increases if I want to analyse more houses/characters at the same time

My goal would be to have an Excel file, where I set the research Key (like JON and SNOW) together with the related character and house and mash this data with the BigQuery table.

The joining key would be like

CONTAINS([BigQuery].[Text], [Excel].[Key]) >0

Unfortunately Tableau allows only = operators in text joining conditions during data blending making the above syntax impossible to implement. I have now three options:

Give Up: Never if there is still hope!

Move the Excel into a BigQuery table and resolve the problem there by writing a view on top of the data: works but increases the complexity on BigQuery side, plus most Tableau users will not have write access to related datasources.

Find an alternative way of joining the data: If the CONTAINS join is not possible during data-blending phase, I may use it a little bit later...

Warning: the method mentioned below is not the optimal performance wise and should be used carefully since it causes data duplication if not handled properly.

Without the option of using the CONTAINS I had to create a cartesian join during data-blending phase. By using a cartesian join every row in the BigQuery table is repeated for every row in the Excel table. I managed to create a cartesian join by simply put a 1-1 condition in the data-blending section.

I can then apply a filter on the resulting dataset to keep only the BigQuery rows mentioning one (or more) Key from the Excel file with the following formula.

IIF(CONTAINS(UPPER([Text]),[Key]),[Id],NULL)

This formula filters the tweet Id where the Excel's [Key] field is contained in the UPPER([Text]) coming from Twitter. Since there are multiple Keys assigned to the same character/house (see Jon Snow with both keywords JON and SNOW) the aggregation for this column is count distinct which in Tableau is achieved with COUNTD formula.
I can now simply drag the Name from the Excel file and the # of Mentions column with the above formula and aggregation method as count distinct.

The beauty of this solution is that now if I need to do the same graph by house, I don't need to create columns with new formulas, but simply remove the Name field and replace it with Family coming from the Excel file.

Also if I forgot a character or family I simply need to add the relevant rows in the Excel lookup file and reload it, nothing to change in the formulas.

Sentiment Analysis

Another goal I had in mind when analysing GoT data was the sentiment analysis of tweets and the average sentiment associated to a character or house. Doing sentiment analysis in Tableau is not too hard, since we can reuse already existing packages coming from R.

For the Tableau-R integration to work I had to install and execute the RServe package from a workstation where R was already installed and set the connection in Tableau. More details on this configuration can be found in Tableau documentation

Once configured Tableau to call R functions it's time to analyse the sentiment. I used Syuzhet package (previously downloaded) for this purpose. The Sentiment calculation is done by the following formula:

SCRIPT_INT: The method will return an integer score for each Tweet with positives sentiments having positives scores and negative sentiments negative scores

get_sentiment(.arg1,method = 'nrc'): is the function used

ATTR([Text]): the input parameter of the function which is the tweet text

At this point I can see the score associated to every tweet, and since that R package uses dictionaries, I limited my research to tweets in english language (filtering on the column User.Lang.Clean mentioned above by en).

The next step is to average the sentiment by character, seems an easy step but devil is in the details! Tableau takes the output of the SCRIPT_INT call to R as aggregated metric, thus not giving any visual options to re-aggregate! Plus the tweet Text field must be present in the layout for the sentiment to be calculated otherwise the metric results NULL.

Fortunately there are functions, and specifically window functions like WINDOW_AVG allowing a post aggregation based of a formula defining the start and end. The other cool fact is that window function work per partition of the data and the start and end of the window can be defined using the FIRST() and LAST() functions.

We can now create an aggregated version of our Sentiment column with the following formula

WINDOW_AVG(FLOAT([Sentiment]), FIRST(), LAST())

This column will be repeated with the same value for all rows within the same "partition", in this case the character Name.

Be aware that this solution doesn't re-aggregate the data, we'll still see the data by single tweet Text and character Name. However the metric is calculated at total per character so graphs can be displayed.

I wanted to show a Scatter Plot based on the # of Mentions and Sentiment of each character. With the window functions and the defined above it's as easy as dragging the fields in the proper place and select the scatter plot viz.

The default view is not very informative since I can't really associate a character to its position in the chart until I go over the related image. Fortunately Tableau allows the definition of custom shapes and I could easily assign character photos to related names.

If negative mentions for Littlefinger and Cercei was somehow expected, the characters with most negative sentiment are Sansa Stark, probably due to the mysterious letter found by Arya in Baelish room, and Ellaria Sand. On the opposite side we strangely see the Night King and more in general the WhiteWalkers with a very positive sentiment associated to them. Strange, this needs further investigation.

Deep Dive on Whitewalkers and Sansa

I can create a view per Character with associate tweets and sentiment score and filter it for the WhiteWalkers. Looks like there are great expectations for this character in the next episodes (the battle is coming) which are associated with positive sentiments.

When analysing the detail of the number of tweets falling in each sentiment score category it's clear why Sansa and Whitewalkers have such a different sentiment average. Both appear as normal distributions, but the center of the Whitewalkers curve is around 1 (positive sentiment) while for Sansa is between -1 and 0 (negative sentiment).

This explanation however doesn't give me enough information, and want to understand more about what are the most used words included in tweets mentioning WhiteWalkers or Night King.

Warning: the method mentioned above is not the optimal performance wise and should be used carefully since it causes data duplication if not handled properly.

There is no easy way to do so directly in Tableau, even using R since all the functions expect the output size to be 1-1 with the input, like sentiment score and text.
For this purpose I created a view on top of the BigQuery table directly in Tableau using the New Custom SQL option. The SQL used is the following

The SPLIT function divides the Text field in multiple rows one for every word separated by space. This is a very basic split and can of course be enhanced if needed. On top of it the SQL removes references to # and @. Since the view contains the tweet's Id field, this can be used to join this dataset with the main table.

The graph showing the overall words belonging to characters is not really helpful since the amount of words (even if I included only the ones with more than e chars) is too huge to be analysed properly.

When analysing the single words in particular tweets I can clearly see that the Whitewalkers sentiment is driven by words like King, Iron, Throne having a positive sentiment. On the other hand Sansa stark is penalized by words like Kill and Fight probably due to the possible troubles with Arya.

One thing to mention is that the word Stark is classified with a negative sentiment due to the general english dictionary used for the scoring. This affects all the tweets and in particular the average scores of all the characters belonging to the House Stark. A new "GoT" dictionary should be created and used in order to avoid those kind of misinterpretations.

Also when talking about "Game of Thrones", words like Kill or Death can have positive or negative meaning depending on the sentence, a imaginary tweet like

Finally Arya kills Cercei

Should have a positive sentiment for Arya and a negative for Cercei, but this is where automatic techniques of sentiment classification show their limits. Not even a new dictionary could help in this case.

The chart below shows the percentage of words classified with positive (score 1 or 2) or negative (score -1 or -2) for the two selected characters. We can clearly see that Sansa has more negative words than positive as expected while Whitewalkers is on the opposite side.

Furthermore the overall sentiment for the two characters may be explained by the following graph. This shows for every sentence sentiment category (divided in bins Positive, Neutral, Negative), an histogram based on the count of words by single word sentiment. We can clearly see how words with positive sentiment are driving the Positive sentence category (and the opposite).

Finally the last graph shows the words that have mostly impacted the overall positive and negative sentiment for both characters.

We can clearly see that Sansa negative sentiment is due to Stark, Hate and Victim. On the other side Whitewalkers positive sentiment is due to words like King (Night King is the character) and Finally probably due to the battle coming in the next episode. As you can see there are also multiple instances of the King word due to different punctualization preceeding or following the world. I stated above that the BigQuery SQL extracting the words via the SPLIT function was very basic, we can now see why. Little enhancements in the function would aggregate properly the words.

Are you still there? Do you wonder what's left? Well there is a whole set of analysis that can be done on top of this dataset, including checking the sentiment behaviour by time during the live event or comparing this week's dataset with the next episode's one. The latter may happen next week so... Keep in touch!

Last week there was Wimbledon, if you are a fan of Federer, Nadal or Djokovic then it was one of the events not to be missed. I deliberately excluded Andy Murray from the list above since he kicked out my favourite player: Dustin Brown.

Two weeks ago I was at Kscope17 and one of the common themes, which reflected where the industry is going, was the usage of Kafka as central hub for all data pipelines. I wont go in detail on what's the specific role of Kafka and how it accomplishes, You can grab the idea from two slides taken from a recent presentation by Confluent.

One of the key points of all Kafka-related discussions at Kscope was that Kafka is widely used to take data from providers and push it to specific data-stores (like HDFS) that are then queried by analytical tools. However the "parking to data-store" step can sometimes be omitted with analytical tools querying directly Kafka for real-time analytics.

We wrote at the beginning of the year a blog post about doing it with Spark Streaming and Python however that setup was more data-scientist oriented and didn't provide a simple ANSI SQL familiar to the beloved end-users.

As usual, Oracle annouced a new release during Kscope. This year it was Oracle Data Visualization Desktop 12.2.3.0.0 with a bunch of new features covered in my previous blog post.
The enhancement, amongst others, that made my day was the support for JDBC and ODBC drivers. It opened a whole bundle of opportunities to query tools not officially supported by DVD but that expose those type of connectors.

One of the tools that fits in this category is Presto, a distributed query engine belonging to the same family of Impala and Drill commonly referred as sql-on-Hadoop. A big plus of this tool, compared to the other two mentioned above, is that it queries natively Kafka via a dedicated connector.

I found then a way of fitting the two of the main Kscope17 topics, a new sql-on-Hadoop tool and one of my favourite sports (Tennis) in the same blog post: analysing real time Twitter Feeds with Kafka, Presto and Oracle DVD v3. Not bad as idea.... let's check if it works...

Analysing Twitter Feeds

Let's start from the actual fun: analysing the tweets! We can navigate to the Oracle Analytics Store and download some interesting add-ins we'll use: the Auto Refresh plugin that enables the refresh of the DV project, the Heat Map and Circle Pack visualizations and the Term Frequency advanced analytics pack.

Importing the plugin and new visualizations can be done directly in the console as explained in my previous post. In order to be able to use the advanced analytics function we need to unzip the related file and move the .xml file contained in the %INSTALL_DIR%\OracleBI1\bifoundation\advanced_analytics\script_repository. In the Advanced Analytics zip file there is also a .dva project that we can import into DVD (password Admin123) which gives us a hint on how to use the function.

We can now build a DVD Project about the Wimbledon gentleman singular final containing:

A table view showing the latest tweets

A horizontal bar chart showing the number of tweets containing mentions to Federer, Cilic or Both

A circle view showing the most tweeted terms

A heatmap showing tweet locations (only for tweets with an activated localization)

A line chart showing the number of tweets over time

The project is automatically refreshed using the auto-refresh plugin mentioned above. A quick view of the result is provided by the following image.

So far all good and simple! Now it's time to go back and check how the data is collected and queried. Let's start from Step #1: pushing Twitter data to Kafka!

Kafka

We covered Kafka installation and setup in previous blog post, so I'll not repeat this part.
The only piece I want to mention, since gave me troubles, is the advertised.host.name setting: it's a configuration line in /opt/kafka*/config/server.properties that tells Kafka which is the host where it's listening.

If you leave the default localhost and try to push content to a topic from an external machine it will not show up, so as pre-requisite change it to a hostname/IP that can be resolved externally.

The rest of the Kafka setup is the creation of a Twitter producer, I took this Java project as example and changed it to use the latest Kafka release available in Maven. It allowed me to create a Kafka topic named rm.wimbledon storing tweets containing the word Wimbledon.

The same output could be achieved using Kafka Connect and its sink and source for twitter. Kafka Connect has also the benefit of being able to transform the data before landing it in Kafka making the data parsing easier and the storage faster to retrieve. I'll cover the usage of Kafka Connect in a future post, for more informations about it, check this presentation from Robin Moffatt of Confluent.

One final note about Kafka: I run a command to limit the retention to few minutes

This limits the amount of data that is kept in Kafka, providing better performances during query time. This is not always possible in Kafka due to data collection needs and there are other ways of optimizing the query if necessary.

At this point of our project we have a dataflow from Twitter to Kafka, but no known way of querying it with DVD. It's time to introduce the query engine: Presto!

Presto

Presto was developed at Facebook, is in the family of sql-on-Hadoop tools. However, as per Apache Drill, it could be called sql-on-everything since data don't need to reside on an Hadoop system. Presto can query local file systems, MongoDB, Hive, and a big variety of datasources.

As the other sql-on-Hadoop technologies it works with always-on daemons which avoid the latency proper of Hive in starting a MapReduce job. Presto, differently from the others, divides the daemons in two types: the Coordinator and the Worker. A Coordinator is a node that receives the query from the clients, it analyses and plans the execution which is then passed on to Workers to carry out.

In other tools like Impala and Drill every node by default could add as both worker and receiver. The same can also happen in Presto but is not the default and the documentation suggest to dedicate a single machine to only perform coordination tasks for best performance in large cluster (reference to the doc).

The following image, taken from Presto website, explains the flow in case of usage of the Hive metastore as datasource.

Installation

The default Presto installation procedure is pretty simple and can be found in the official documentation. We just need to download the presto-server-0.180.tar.gz tarball and unpack it.

tar -xvf presto-server-0.180.tar.gz

This creates a folder named presto-server-0.180 which is the installation directory, the next step is to create a subfolder named etc which contains the configuration settings.

Then we need to create five configuration files and a folder within the etc folder:

node.environment: configuration specific to each node, enables the configuration of a cluster

jvm.config: options for the Java Virtual Machine

config.properties: specific coordinator/worker settings

log.properties: specifies log levels

catalog: a folder that will contain the data source definition

For a the basic functionality we need the following are the configurations:

Where the coordinator=true tells Presto to function as coordinator, http-server.http.port defines the ports, and discovery.uri is the URI to the Discovery server (in this case the same process).

log.properties

com.facebook.presto=INFO

We can keep the default INFO level, other levels are DEBUG, WARN and ERROR.

catalog

The last step in the configuration is the datasource setting: we need to create a folder named catalog within etc and create a file for each connection we intend to use.

For the purpose of this post we want to connect to the Kafka topic named rm.wimbledon. We need to create a file named kafka.properties within the catalog folder created above. The file contains the following lines

where kafka.nodes points to the Kafka brokers and kafka.table-names defines the list of topics delimited by a ,.

The last bit needed is to start the Presto server by executing

bin/launcher start

We can append the --verbose parameter to debug the installation with logs that can be found in the var/log folder.

Presto Command Line Client

In order to query Presto via command line interface we just need to download the associated client (see official doc) which is in the form of a presto-cli-0.180-executable.jar file. We can now rename the file to presto and make it executable.

mv presto-cli-0.180-executable.jar presto
chmod +x presto

Then we can start the client by executing

./presto --server linuxsrv.local.com:8080 --catalog kafka --schema rm

Remember that the client has a JDK 1.8 as prerequisite, otherwise you will face an error. Once the client is successfully setup, we can start querying Kafka

You could notice that the schema (rm) we're connecting is just the prefix of the rm.wimbledon topic used in kafka. In this way I could potentially store other topics using the same rm prefix and being able to query them all together.

As expected tweets are stored in JSON format, We can now use the Presto JSON functions to extract the relevant informations from it. In the following we're extracting the user.name part of every tweet. Node the LIMIT 10 (common among all the SQL-on-Hadoop technologies) to limit the number of rows returned.

We saw above that it's possible to query with ANSI SQL statements using the Presto JSON function. The next step will be to define a structure on top of the data stored in the Kafka topic to turn raw data in a table format. We can achieve this by writing a topic description file. The file must be in json format and stored under the etc/kafka folder; it is recommended, but not necessary, that the name of the file matches the kafka topic (in our case rm.wimbledon). The file in our case would be the following

As mentioned at the beginning of the article, the overall goal was to analyse Wimbledon twitter feed in real time with Oracle Data Visualization Desktop via JDBC, so let's complete the picture!

JDBC drivers

First step is to download the Presto JDBC drivers version 0.175, I found them in the Maven website. I tried also the 0.180 version downloadable directly from Presto website but I had several errors in the connection.
After downloading we need to copy the driver presto-jdbc-0.175.jar under the %INSTALL_DIR%\lib folder where %INSTALL_DIR% is the Oracle DVD installation folder and start DVD. Then I just need to create a new connection like the following

Note that:

URL: includes also the /kafka postfix, this tells Presto which storage I want to query

Driver Class Name: this setting puzzled me a little bit, I was able to discover the string (with the help of Gianni Ceresa) by concatenating the folder name and the driver class name after unpacking the jar file

** Username/password: those strings can be anything since for the basic test we didn't setup any security on Presto.

The whole JDBC process setting is described in this youtube video provided by Oracle.

We can then define the source by just selecting the columns we want to import and create few additional ones like the Lat and Long parsing from the coordinates column which is in the form [Lat, Long]. The dataset is now ready to be analysed as we saw at the beginning of the article, with the final result being:

Conclusions

As we can see from the above picture the whole process works (phew....), however it has some limitations: there is no pushdown of functions to the source so most of the queries we see against Presto are in the form of

This means that the whole dataset is retrieved every time making this solution far from optimal for big volumes of data. In those cases probably the "parking" to datastore step would be necessary. Another limitation is related to the transformations, the Lat and Long extractions from coordinates field along with other columns transformations are done directly in DVD, meaning that the formula is applied directly in the visualization phase. In the second post we'll see how the source parsing phase and query performances can be enhanced using Kafka Connect, the framework allowing an easy integration between Kafka and other sources or sinks.

One last word: winning Wimbledon eight times, fourteen years after the first victory and five years after the last one it's something impressive! Chapeau mr Federer!

As grandiose a notion as the title may imply, there have been some really promising and powerful moves made in the advancement of smoothly integrating real-time and/or streaming data technologies into most any enterprise reporting and analytics architecture. When used in tandem with functional programming languages like Python, we now have the ability to create enterprise grade data engineering scripts to handle the manipulation and flow of data, large or small, for final consumption in all manner of business applications.

In this cavalcade of coding, we're going to use a combination of Satori, a free data streaming client, and python to stream live world cyber attack activity via an api. We'll consume the records as json, and then use a few choice python libraries to parse, normalize, and insert the records into a mysql database. Finally, we'll hook it all up to Tableau and watch cyber attacks happen in real time with a really cool visualization.

The Specs

For the this exercise, we're going to bite things off a chunk at a time. We're going to utilize a service called Satori, a streaming data source aggregator that will make it easy for us to hook up to any number of streams to work with as we please. In this case, we'll be working with the Live Cyber Attack Threat Map data set. Next, we'll set up our producer code that will do a couple of things. First it will create the API client from which we will be ingesting a constant flow of cyber attack records. Next, we'll take these records and convert them to a data frame using the Pandas library for python. Finally, we will insert them into a MySQL database. This will allow us to use this live feed as a source for Tableau in order to create a geo mapping of countries that are currently being targeted by cyber attacks.

The Data Source

Satori is a new-ish service that aggregates the web's streaming data sources and provides developers with a client and some sample code that they can then use to set up their own live data streams. While your interests may lie in how you can stream your own company's data, it then simply becomes a matter of using python's requests library to get at whatever internal sources you might need. Find more on the requests library here.

Satori has taken a lot of the guess work out of the first step of the process for us, as they provide basic code samples in a number of popular languages to access their streaming service and to generate records. You can find the link to this code in a number of popular languages here. Note that you'll need to install their client and get your own app key. I've added a bit of code at the end to handle the insertion of records, and to continue the flow, should any records produce a warning.

Now that we've set up the streaming code that we'll use to fill our table, we'll need to set up the table in MySQL to hold them all. For this we'll use the SQLAlchemy ORM (object relational mapper). It's a high falutin' term for a tool that simply abstracts SQL commands to be more 'pythonic'; that is, you don't necessarily have to be a SQL expert to create tables in your given database. Admittedly, it can be a bit daunting to get the hang of, but give it a shot. Many developers choose to interact a with a given database either via direct SQL or using an ORM. It's good practice to use a separate python file, in this case settings.py (or some variation thereof), to hold your database connection string in the following format (the addition of the mysqldb tag at the beginning is as a result of the installation of the mysql library you'll need for python), entitled SQLALCHEMY_DATABASE_URI:

'mysql+mysqldb://db_user:pass@db_host/db_name'

Don't forget to sign in to your database to validate success!

Feeding MySQL and Tableau

Now all we need to do is turn on the hose and watch our table fill up. Running producer.py, we can then open a new tab, log in to our database to make sure our table is being populated, and go to work. Create a new connection to your MySQL database (called my db 'hacks') in Tableau and verify that everything is in order once you navigate to the data preview. There are lots of nulls in this data set, but this will simply be a matter of filtering them out on the front end.

Tableau should pick up right away on the geo data in the dataset, as denoted by the little globe icon next to the field.
We can now simply double-click on the corresponding geo data field, in this case we'll be using Country Target, and then the Number of Records field in the Measures area.
I've chosen to use the 'Dark' map theme for this example as it just really jives with the whole cyber attack, international espionage vibe. Note that you'll need to maintain a live connection, via Tableau, to your datasource and refresh at the interval you'd like, if using Tableau Desktop. If you're curious about how to automagically provide for this functionality, a quick google search will come up with some solutions.

Over recent years, bi-modal analytics has gained interest and, dare I say it, a level of notoriety, thanks to Garnter’s repositioning of its Magic Quadrant in 2016. I’m going to swerve the debate, but if you are not up to speed, then I recommend taking a look here first.

Regardless of your chosen stance on the subject, one thing is certain: the ability to provision analytic capabilities in more agile ways and with greater end user flexibility is now widely accepted as an essential part of any modern analytics architecture.

But are there any secrets or clues that could help you in modernising your analytics platform?

What Is Driving the Bi-Modal Shift?

The demand for greater flexibility from our analytics platforms has its roots in the significant evolutions seen in the businesses environment. Specifically, we are operating in/with:

turbulent global economies, leading to a drive to reduce (capex) costs, maximise efficiencies and a need to deal with increased regulation;

broader and larger, more complex and more externalised data sets, which can be tapped into with much reduced latency;

empowered and tech-savvy departmental users, with an increased appetite for analytical decision making, combined with great advances in data discovery and visualisation technologies to satisfy this appetite;

In a nutshell, the rate at which change occurs is continuing to gather pace and so to be an instigator of change (or even just a reactor to it as it happens around you) requires a new approach to analytics and data delivery and execution.

Time to Head Back to the Drawing Board?

Whilst the case for rapid, user-driven analytics is hard to deny, does it mean that our heritage BI and Analytics platforms are obsolete and ready for the scrap heap?

I don’t think so: The need to be able to monitor operational processes, manage business performance and plan for the future have not suddenly disappeared; The need for accurate, reliable and trusted data which can be accessed securely and at scale is as relevant now as it was before. And this means that, despite what some might have us believe, all the essential aspects of the enterprise BI platforms we have spent years architecting, building and growing cannot be simply wiped away.

[Phew!]

Instead, our modern analytics platforms must embrace both ends of the spectrum equally: highly governed, curated and trustworthy data to support business management and control, coupled with highly available, flexible, loosely governed data to support business innovation. In other words, both modes must coexist and function in a relative balance.

The challenge now becomes a very different one: how can we achieve this in an overarching, unified business architecture which supports departmental autonomy, encourages analytical creativity and innovation, whilst minimising inefficiency and friction? Now that is something we can really get our teeth into!

What’s IT All About?

Some questions:

Do you have a myriad of different analytics tools spread across the business which are all being used to fulfil the same ends?

Are you constantly being asked to provide data extracts or have you resorted to cloning your production database and provisioning SQL Developer to your departmental analysts?

Are you routinely being asked to productionise things that you have absolutely no prior knowledge of?

If you can answer Yes to these questions, then you are probably wrestling with an unmanaged or accidental bi-modal architecture.

At Rittman Mead, we have seen several examples of organisations who want to hand greater autonomy to departmental analysts and subject matter experts, so that they can get down and dirty with the data to come up with novel and innovative business ideas. In most of the cases I have observed, this has been driven at a departmental level and instead of IT embracing the movement and leading the charge, results have often been achieved by circumventing IT. Even in the few examples where IT have engaged in the process, the scope of their involvement has normally been focused on the provision of hardware and software, or increasingly, the rental of some cloud resources. It seems to me that the bi-modal shift is often perceived as a threat to traditional IT, that it is somehow the thin end of a wedge leading to full departmental autonomy and no further need for IT! In reality, this has never been (and will never be) the ambition or motivation of departmental initiatives.

In my view, this slow and faltering response from IT represents a massive missed opportunity. More importantly though, it increases the probability that the two modes of operation will be addressed in isolation and this will only ever lead to siloed systems, siloed processes and ultimately, a siloed mentality. The creation of false barriers between IT and business departments can never be a positive thing.

That’s not to say that there won’t be any positive results arising from un-coordinated initiatives, it’s just that unwittingly, they will cause an imbalance in the overall platform: You might deliver an ultra-slick, flexible, departmentally focused discovery lab, but this will encourage the neglect and stagnation of the enterprise platform. Alternatively, you may have a highly accurate, reliable and performant data architecture with tight governance control which creates road-blocks for departmental use cases.

Finding the Right Balance

So, are there any smart steps that you can take if you are looking to build out a bi-modal analytics architecture? Well, here are a few ideas that you should consider as factors in a successful evolution:

1. Appreciate Your Enterprise Data Assets

You’ve spent a lot of time and effort developing and maintaining your data warehouse and defining the metadata so that it can be exposed in an easily understandable and user friendly way. The scope of your enterprise data also provides a common base for the combined data requirements for all of your departmental analysts. Don’t let this valuable asset go to waste! Instead provide a mechanism whereby your departmental analysts can access enterprise data quickly, easily, when needed and as close to the point of consumption as possible. Then, with good quality and commonly accepted data in their hands, give your departmental analysts a level of autonomy and the freedom to cut loose.

2. Understand That Governance Is Not a Dirty Word

In many organisations, data governance is synonymous with red tape, bureaucracy and hurdles to access. This should not be the case. Don’t be fooled into thinking that more agile means less control. As data begins to be multi-purposed, moved around the business, combined with disparate external data sources and used to drive creativity in new and innovative ways, it is essential that the provenance of the enterprise data is known and quantifiable. That way, departmental initiatives will start with a level of intrinsic confidence, arising from the knowledge that the base data has been sourced from a well known, consistent and trusted source. Having this bedrock will increase confidence in your analytical outputs and lead to stronger decisions. It will also drive greater efficiencies when it comes to operationalising the results.

3. Create Interdependencies

Don’t be drawn into thinking “our Mode 1 solution is working well, so let’s put all our focus and investment into our Mode 2 initiatives”. Instead, build out your Mode 2 architecture with as much integration into your existing enterprise platform as possible. The more interdependencies you can develop, the more you will be able to reduce data handling inefficiencies and increase benefits of scale down the line. Furthermore, interdependency will eliminate the risk of creating silos and allowing your enterprise architecture to stagnate, as both modes will have a level of reliance on one another. It will also encourage good data management practice, with data-workers talking in a common and consistent language.

4. Make the Transition Simple

Probably the single most important factor in determining the success of your bi-modal architecture is the quality with which you can transition a Mode 2 model into something operational and production-ready in Mode 1. The more effective this process is, the more likely you are to maximise your opportunities (be it new sales revenue, operating cost etc.) and increase your RoI. The biggest barriers to smoothing this transition will arise when departmental outputs need to be reanalysed, respecified and redesigned so that they can be slotted back into the enterprise platform. If both Mode 1 and Mode 2 activity is achieved with the same tools and software vendors, then you will have a head start…but even if disparate tools are used for the differing purposes, then there are always things that you can do that will help. Firstly, make sure that the owners of the enterprise platform have a level of awareness of departmental initiatives, so that there is a ‘no surprises’ culture…who knows, their experience of the enterprise data could even be exploited to add value to departmental initiatives. Secondly, ensure that departmental outputs can always be traced back to the enterprise data model easily (note: this will come naturally if the other 3 suggestions are followed!). And finally, define a route to production that is not overbearing or cumbersome. Whilst all due diligence should be taken to ensure the production environment is risk-free, creating artificial barriers (such as a quarterly or monthly release cycle) will render a lot of the good work done in Mode 2 useless.

Since I started teaching OBIEE in 2011, I had the pleasure of meeting many fascinating people who work with Business Intelligence.

In talking to my students, I would generally notice three different situations:

Folks were heavy users of OBIEE, and just ready to take their skills to the next level.

They were happily transitioning to OBIEE from a legacy reporting tool, that didn’t have the power that they needed.

There were not-so-good times, like when people were being forced to transition to OBIEE. They felt that they were moving away from their comfort zone and diving into a world of complicated mappings that would first require them to become rocket scientists. They were resistant to change.

It was this more challenging crowd, that mostly sparked my interest for other analytics tools. I received questions like: “Why are we switching to another system? What are the benefits?”

I wanted to have a good answer to these questions. Over the years, different projects have allowed me the opportunity to work with diverse reporting tools. My students’ questions were always in mind: Why? And what are the benefits? So, I always took the time to compare/contrast the differences between OBIEE and these other tools.

I noticed that many of them did a fantastic job at answering the questions needed, and so did OBIEE. It didn’t take me long to have the answer that I needed: the main difference in OBIEE is the RPD!

The RPD is where so much Business Intelligence happens. There, developers spend mind boggling times connecting the data, deriving complex metrics and hierarchies, joining hundreds of tables, and making everything a beautiful drag and drop dream for report writers.

Yes, many other tools will allow us to do magic with metadata, but most of them require this magic to be redefined every time we need a new report, or the report has a different criteria. Yes, the RPD requires a lot of work upfront, but that work is good for years to come. We never lose any of our previous work, we just enhance our model. Overtime, the RPD becomes a giant pool of knowledge for a company and is impressively saved as a file.

For tapping into the RPD metadata, traditionally we have used BI Publisher and OBIEE. They are both very powerful and generally complement each other well. Other tools have become very popular in the past few years. Tableau is an example that quickly won the appreciation of the BI community and has kept consistent leadership in Gartner’s BI Magic quadrant since 2013. With a very slick interface and super fast reporting capability, Tableau introduced less complex methods to create amazing dashboards - and fast! So, what is there not to like? There is really so much TO like!

Going back to the comparing and contrasting, the main thing that Tableau doesn’t offer is… the RPD. It lacks a repository with the ability to save the join definitions, calculations and the overall intelligence that can be used for all future reports.

At Rittman Mead, we’ve been using these tools and appreciate their substantial capabilities, but we really missed the RPD as a data source. We wanted to come up with a solution that would allow our clients to take advantage of the many hours they had likely already put into metadata modeling by creating a seamless transition from OBIEE’s metadata layer to Tableau.

This past week, I was asked to test our new product, called Unify. Wow. Once again, I am so proud of my fellow coworkers. Unify has a simple interface and uses a Tableau web connector to create a direct line to your OBIEE repository for use in Tableau reports, stories and dashboards.

In Unify, we select the subject areas from our RPD presentation layer and choose our tables and columns as needed. Below is a screenshot of Unify using the OBIEE 12c Sample App environment. If you are not familiar with OBIEE 12c, Oracle provides the Sample App - a standalone virtual image with everything that you need to test the product. You can download the SampleApp here: http://www.oracle.com/technetwork/middleware/bi-foundation/obiee-samples-167534.html

We are immediately able to leverage all joins, calculated columns, hierarchies, RPD variables, session variables and that’s not all… our RPD security too! Yes, even row level security is respected when we press the “Unify” button and data is brought back into Tableau. So now, there is no reason to lose years of metadata work because one team prefers to visualize with Tableau instead of OBIEE.

Unify allows us to import only those data needed for the report, as we can utilize ‘in-tool’ filtering, keeping our query sets small, and our performance high.

In sum, Unify unites it all - have your cake and eat it too. No matter which tool you love the most, add them together and you will certainly love them both more.

The ODTUG Kscope17 conference last week in San Antonio was a great event with plenty of very interesting sessions and networking opportunities. Rittman Mead participated during the thursday deep dive BI session and delivered three sessions including a special "fishing" one.

In the meantime Oracle released Data Visualization Desktop 12.2.3.0.0 which was presented in detail during Philippe Lions session and includes a set of new features and enhancements to already existing functionalities. Starting from new datasources, through new visualization options, in this post I'll go in detail on each of them.

The latter two (still in beta) are very relevant since they enable querying any product directly exposing JDBC or ODBC connectors (like Presto) without needing to wait for the official support in the DVD list of sources.

Still in DVD v3 there is no support for JSON or XML files. In my older blog post I wrote how JSON (and XML) can be queried in DVD using Apache Drill, however this solution has Drill installation and knowledge as a prerequisite which is not always achievable in end users environment where self-service BI is happening. I believe future versions of DVD will address this problem by providing full support to both data sources.

Connection to OBIEE

One of the most requested new features is the new interface to connect to OBIEE: until DVD v2 only pre-built OBIEE analysis could be used as sources, with DVD v3 OBIEE Subject Areas are exposed making them accessible. The set of columns and filters can't be retrieved on the fly during the project creation but must be defined upfront during datasource definition. This feature avoids move back and forth from OBIEE to DVD to create an analysis in as datasource, and then use it in DVD.

Another enhancement in the datasource definition is the possibility to change the column delimiter in txt sources, useful if the datasource has an unusual delimiters.

Data Preparation

On the data-preparation side we have two main enhancements: the convert-to-date and the time grain level.
The convert-to-date feature enhances ability for columns to date conversion including the usage of custom parsing strings. Still this feature has some limits like not being able to parse dates like 04-January-2017 where the month name is complete. For this date format a two step approach, reducing the month-name and then converting, is still required.

The second enhancement in the data preparation side is the time grain level and format, those options simplify the extraction of attributes (e.g. Month, Week, Year) from date fields which can now be done visually instead of writing logical SQL.

The Dataflow component in DVD v3 has an improved UI with new column merge and aggregation functionalities which makes the flow creation easier. Its output can now be saved as Oracle database or Hive table eliminating the need of storing all the data locally.

It's worth mentioning that Dataflow is oriented to self-service data management: any parsing or transformation happens on the machine where DVD is installed and its configuration options are limited. If more robust transformations are needed then proper ETL softwares should be used.

New Visualization Options

There are several enhancement on the visualization side, with the first one being the trendlines confidence levels which can be shown, with fixed intervals (90%, 95% or 99%)

Top N and bottom N filtering has been added for each measure columns expanding the traditional "range" one.

Two new visualizations have also been included: waterfall and boxplot are now default visualizations. Boxplots were available as plugin in previous versions, however the five number summary had to be pre-calculated; in DVD v3 the summary is automatically calculated based on the definition of category (x-axis) and item (value within the category).

Other new options in the data visualization area include: the usage of logarithmic scale for graphs, the type of interpolation line to use (straight, curved, stepped ...), and the possibility to duplicate and reorder canvases (useful when creating a BI story).

Console

The latest set of enhancements regard the console: this is a new menu allowing end users to perform task like the upload of a plugin that before were done manually on the file system.

Map Layers: JSON shape files that can be used to render custom maps data.

The process to include a new plugin into DVD v3 is really simple: after downloading it from the store, I just need open DVD's console and upload it. After a restart of the tool, the new plugin is available.

The same applies for Map Layers, while custom R scripts still need to be stored under the advanced_analytics\script_repository subfolder under the main DVD installation folder.

As we saw in this blog post, the new Data Visualization Desktop release includes several enhancement bringing more agility in the data discovery with enhancements both in the connections to new sources (JDBC and ODBC) and standard reporting with OBIEE subject areas now accessible. The new visualizations, the Analytics Store and the plugin management console make the end user workflow extremely easy also when non-default features need to be incorporated. If you are interested in Data Visualization Desktop and want to understand how it can be proficiently used against any data source don't hesitate to contact us!

There are many types of scenarios in which data science could help your business. For example, customer retention, process automation, improving operational efficiency or user experience.

It is not however always initially clear which questions to concentrate on, or how to achieve your aims.

This post presents information about the type of questions you could address using your data and common forms of bias that may be encountered.

Types of Question

Descriptive: Describe the main features of the data, no implied meaning is inferred. This will almost always be the first kind of analysis performed on the data.

Exploratory: Exploring the data to find previously unknown relationships. Some of the found relationships may define future projects.

Inferential: Looking at trends in a small sample of a data set and extrapolating to the entire population. In this type of scenario you would end up with an estimation of the value and an associated error. Inference depends heavily on both the population and the sampling technique.

Predictive: Look at current and historical trends to make predictions about future events. Even if x predicts y, x does not cause y. Accurate predictions are hard to achieve and depend heavily on having the correct predictors in the data set. Arguably more data often leads to better results however, large data sets are not always required.

Causal: To get the real relationship between variables you need to use randomised control trials and measure average effects. i.e. if you change x by this much how does y change. Even though this can be carried out on observed data huge assumptions are required and large errors would be introduced into the results.

Biases in data collection or cleaning

It is very easy to introduce biases into your data or methods if you are not careful.
Here are some of the most frequent:

Selection/sampling bias: If the population selected does not represent the actual population, the results are skewed. This commonly occurs when data is selected subjectively rather than objectively or when non-random data has been selected.

Confirmation bias: Occurs when there is an intentional or unintentional desire to prove a hypothesis, assumption, or opinion.

Outliers: Extreme data values that are significantly out of the normal range of values can completely bias the results of an analysis. If the outliers are not removed in these cases the results of the analysis can be misleading. These outliers are often interesting cases and ought to be investigated separately.

Simpson's Paradox: A trend that is indicated in the data can reverse when the data is split into comprising groups.

Overfitting: Involves an overly complex model which overestimates the effect/relevance of the examples in the training data and/or starts fitting to the noise in the training data.

Underfitting: Occurs when the underlying trend in the data is not found. Could occur if you try to fit a linear model to non linear data or if there is not enough data available to train the model.

Confounding Variables: Two variables may be assumed related when in fact they are both related to an omitted confounding variable. This is why correlation does not imply causation.

Non-Normality: If a distribution is assumed to be normal when it is not the results may be biased and misleading.

Data Dredging: This process involves testing huge numbers of hypotheses about a single data set until the desired outcome is found.

I recently detailed data load possibilities with the tools provided with Essbase under OAChere. Whilst all very usable, my thoughts turned to systems that I have worked on and how the loads currently work, which led to how you might perform incremental and / or automated loads for OAC Essbase.

A few background points:

The OAC front end and EssCS command line tools contain a ‘clear’ option for data, but both are full data clears – there does not seem to be a partial or specifiable ‘clear’ available.

The OAC front end and EssCS command line tools contain a ‘file upload’ function for (amongst other things) data, rules, and MAXL (msh) script files. Whilst the front-end operation has the ability to overwrite existing files, the EssCS Upload facility (which would be used when trying to script a load) seemingly does not – if an attempt is made to upload a file that already exists, an error is shown.

The OAC ‘Job’ facility enables a data load to be conducted with a rules file; the EssCS Dataload function (which would be used when trying to script a load) seemingly does not.

MAXL still exists in OAC, so it is possible to operate at Essbase ‘command level’

Whilst the tools that are in place all work well and are fine for migration or other manual / adhoc activity, I am not sure what the intended practice might be around some ‘real world’ use cases: a couple of things that spring to mind are

Incremental loads

Scheduled loads

Large ASO loads (using buffers)

Incremental Loads

It is arguably possible to perform an incremental load in that

A rules file can be crafted in on-prem and uploaded to OAC (along with a partial datafile)

Loads appear to be conducted in overwrite mode, meaning changed and new records will be handled ok

It is possible that (eg) a ‘current month’ data file could be loaded and reloaded to form an incremental load of sorts. The problem here will come if data is deleted for a particular member combination in the source from one day to the next – with no partial clear (eg, of current month data) seemingly possible, there is no way of clearing redundant values (at least for an ASO cube…for a BSO load, the ‘Clear combinations’ functionality of the load rules file can be used…although that has not yet been tested on this version).

So in the case of an ASO cube, the only option using available tools would be to ensure that ‘contra’ records are added to the incremental load file. This is not ideal, as it is another process to follow in data preparation, and would also add unnecessary zeros to the cube. For these reasons, I would generally look to effect a partial clear of the ‘slice’ being loaded before proceeding with an incremental the load.

The only way I can see of achieving this under OAC would be to take advantage of the fact that MAXL is available and effect the clear using alter database clear data.

I may have missed something, but I see no obvious way of being able to automate this process with the on-board facilities.

Automating the load process

Along with the points listed above, some other facts to be aware of:

It is possible to manually transfer files to OAC using FTP

It is possible to amend the cron scheduler for the oracle user in OAC

Even bearing in mind the above, I should caveat this section by saying getting ‘under the hood’ in this way is possibly not supported or recommended, and should only be undertaken at your own risk.

Having said that…

By taking advantage of the availability of FTP and cron, it should be possible to script a solution that can run unattended, for full and incremental loads. Furthermore, data clears (full or partial) can be included in the same process, as could parallel buffer loading for ASO or any other MAXL-controllable process (within the confines of this version of Essbase).

The OAC environment

A quick look around discloses that the /u01/latency directory is roughly the equivalent of the ../user_projects/epmsystem1/EssbaseServer/essbaseserver1 (or equivalent) directory in an on-prem release in that it contains the /app ‘parent’ directory which in turn contains a subdirectory structure for all application and cube artefacts. Examining this directory for ASOSamp.Basic shows that the uploaded dataload.* files are here, along with all other files listed by the Files screen of OAC:

Note that remote connection is via the opc user, but this can be changed to oracle once connected (by using sudo su – oracle).

As oracle, these files can be manually deleted…doing so means they will no longer be found by the EssCS Listfiles command or the Files screen within OAC (once refreshed). If deleted manually, new versions of the files can be re-uploaded via either of the methods detailed above (whilst an overwrite option exists in the OAC Files facility, there seems to be no such option with the EssCS Upload feature…trying to upload a file that already exists results in an error.

All files are owned by the oracle user, with no access rights at all for the opc user that effects a remote connection via FTP.

Automation: Objectives

The objective of this exercise was to come up with a method that, unattended, would:

Upload received files (data, rules) to OAC from a local source

Put them in the correct OAC directory in a usable format

Invoke a process that runs a pre-load process (eg a clear), a load, and (if necessary a post load process)

Clear up after itself

Automation: The Process

The first job is to handle the upload of files to OAC. This could be achieved via a psftp script that uploads the entire contents of a nominated local directory:

The EssCSUpload,bat script above (which can, of course be added to a local scheduler so that it runs unattended at appointed times) passes a pre-scripted file to psftp to connect and transfer the files. Note that the opc user is used for the connection, and the files are posted to a custom-created directory, CUSTOM_receive (under the existing /u01/latency). The transferred files are also given a global ‘rw’ attribute to assist with later processing

Now the files are in the OAC environment, control is taken up there.

A shell script (DealWithUploads) is added to the oracle home directory:

This copies all the files in the nominated receiving directory to the actual required location – in this case, the main ASOSamp/Basic directory. Note the use of ‘-p’ with the copy command to ensure that attributes (ie, the global ‘rw’) are retained. Once copied, the files are deleted from the receiving directory so that they are not processed again.

Once the files are copied into place, startMAXL is used to invoke a pre-prepared msh script:

as can be seen, this clears the cube and re-imports from the uploaded file using the uploaded rules file. The clear here is a full reset, but a partial clear (in the case of ASO) can be used here instead if required

As with the ‘local’ half of the method, the DealWithUploads.sh script file can be added to the scheduler on OAC: the existing cron entries are already held in the file /u01/app/oracle/tools/home/oracle/crontab.txt; it is a simple exercise to schedule a call to this new custom script.

A routine such as this would need a good degree of refinement and hardening – the file lists for the transfers should be self-building, passwords need to be encrypted, the MAXL script should only be called if required, the posting locations for files should be content/context sensitive, etc – but in terms of feasibility testing the requirements listed above, it was successful.

This approach places additional directories and files in an environment / structure that could be maintained at any time: it is therefore imperative that some form of code control / release mechanism is employed so that it can be replaced in the event of any unexpected / uncontrollable maintenance taking place on the OAC environment that could invalidate or remove it.

Even once hardened, I think there is a considerable weak spot in this approach in that the rules file seemingly has to be crafted in an on-prem environment and uploaded: as I detailed here, even freshly-uploaded, working rules files error when an attempt is made to verify them. For now, I’ll keep looking for an alternative.

Summary

Whilst a lot of the high-level functionality is in place around data loads, often with multiple methods, I think there are a couple of detailed functionality areas that may currently require workarounds – to my mind, the addition of the ability to select & run an msh format ‘preload’ script when running a dataload Job (eg for clears) would be useful, whilst a fully functional rules file editor strikes me as important. The fact that an FTP connection is available at all is a bonus, but because this is as a non-oracle user, it is not possible to put a file in the correct place directly - the EssCS Upload faciity does this of course, but the seeming absence of an overwrite option or an additional Delete option for EssCS) somewhat limits its usefulness at this point. But can you implement a non attended, scheduled load or incremental load routine ? Sure you can.

After my initial quick pass through Essbase under OAC here, this post looks at the data loading options available in more detail. I used the provided sample database ASOSamp.Basic, which first had to be created, as a working example.

Creating ASOSamp

Under the time-honoured on-prem install of Essbase, the sample applications were available as an install option – supplied data has to be loaded separately, but the applications / cubes themselves are installed as part of the process if the option is selected. This is not quite the same under OAC – some are provided in an easily installable format, but they are not immediately available out-of-the-box.

One of the main methods of cube creation in Essbase under OAC is via the Import of a specifically formatted Excel spreadsheet, and it is via the provision of downloadable pre-built ‘template’ spreadsheets that the sample applications are installed in this version.

After accessing the homepage of Essbase on OAC, download the provided cube creation template – this can be found under the ‘Templates’ button on the home page:

Note that in the case of the sample the ASOSamp.Basic database, the data is not in the main template file – it is held in a separate file. This is different to other examples, such as Sample.Basic, where the data provided is held in a dedicated tab in the main spreadsheet. Download both Aggregate Storage Sample and Aggregate Storage Sample Data:

Return to the home page, and click Import. Choose the spreadsheet downloaded as Aggregate Storage Sample (ASO_Sample.xlsx) and click Deploy and Close.

This will effect all of the detail in the spreadsheet – create the application, create the cube, add dimensions / attribute dimensions and members to the outline, etc:

Loading ASOSamp.Basic

Because the data file is separate from the spreadsheet, the next step is to uploaded this to OAC so that it is available for loading: back on the home page, select the newly-created ASOSamp.Basic (note: not ASOSamp.Sample as with on-prem), and click Files:

In the right-hand window, select the downloaded data file ASOSampleData.txt and click the Upload button:

This will upload the file:

Once the file upload is complete, return to the home page. With the newly-created ASOSamp.Basic still selected, click Jobs:

Choose Data Load as the Job Type, and highlight the required Data File:

Click Execute.

A new line will be added to the Job Monitor:

The current status of the job is shown – in this case, ‘in progress’ – and the screen can be refreshed.

Once complete, the Status field will show the completion state of the job, whilst the Job Details icon on the right-hand side provides more detail – in this case, confirming that 311,795 records were successfully loaded, and 0 rejected:

The success of the load is confirmed by a quick look in Smartview:

Note that a rules file was not selected as part of the job – this makes sense when we look at the data file…

...which is familiar-looking: just what we would expect from a EAS export (MAXL: export database), which can of course just be loaded in a similar no-rules-file way in on prem.

Incidentally, this is different to the on-prem approach to ASOSamp.Sample where a ‘flat’, tab-delimited data file is provided for the sample data, along with a rules file that is required for the load:

...although the end-results are the same:

This ‘standard’ load works in overwrite mode – any new values in the file will be added, but any that exist already will be overwritten: running the load again and refreshing the Smartview report results in the same numbers confirms this.

This can be verified further by running with a changed data file: taking a particular node of data for the Units measure…

One of the constituent data values can be changed in a copy of the data file – in this example, one record (it doesn’t matter which for this purpose) can be increased – in this case, ‘1’ has been increased to ‘103’:

The amended file needs to be saved and uploaded to OAC as outlined above, and the load process repeated, this time using the amended file. After a successful load, the aggregated value on the test Smartview report has increased by the same 102:

Loading flat files

So, how might we load the same sort of flat, tab delimited file of the like supplied as the on-prem ASOSamp.Sample data file ?

As above, files can be uploaded to OAC, so putting the dataload.txt data file from the on-prem release into OAC is straightforward. However, as you’d expect, attempting to run this as a load job without a rules file results in an error.

However, it is possible to run an OAC load with a rules file created in an on-prem version: firstly, upload the rules file (in this case, dataload.rul) in the same way as the data file. When setting up the load job, select the data file as normal, but under Scripts select the rules file required:

The job runs successfully, with the ‘Details’ overlay confirming the successful record count.

As with rules files generated by the Import facility, uploaded rules files can also be edited in text mode:

It would seem logical that changing the dataLoadOptions value at line 215 to a value other than OVERWRITE (eg ADD) might be a quick behavioural change for the load that would be easy to effect. However, making this change resulted in verification errors. Noting that the errors related to invalid dimension names, an attempt was made to verify the actual, unchanged rules file as uploaded…which also resulted in the same verification errors. So somewhat curiously, the uploaded on-prem rules file can be successfully used to load a corresponding data file, but (effectively) can’t be edited or amended.

Loading from Spreadsheet Template

The template spreadsheets used to build applications can also contain one or more data tabs. Unlike the OAC Jobs method or EssCS Dataload, the spreadsheet method gives you the option of a rules file AND the ability to Add (rather than overwrite) data:

Within OAC, this is actioned via the ‘Import’ function on the home page:

Note that we are retaining all data, and have the Load Data box checked. Checks confirm the values in the file are added to those already in the cube.

The data can also be uploaded via the Cube Designer in Excel under Cube Designer / Load Data:

Note that unlike running this method under OAC, the rules file (which was created by the initial import as the Data tab existed in the spreadsheet at that point) has to be selected manually.

Once complete, an offer is made to view the Job Status Viewer (which can also be accessed from Cube Designer / View Jobs):

With further detail for each job also being available:

Use facilities to upload files

Given the ability to upload and run both data and rules files, the next logical step would be to script this for automated running. OAC contains a downloadable utility, the Command Line Tool (aka CLI , EssCS) which is a number of interface tools that can be run locally against an OAC instance of Essbase:

Running locally, a successful EssCS login effectively starts a session that then remains open for other EssCS commands until the session is closed with a logout command.

The login syntax suggests the inclusion of the port number in the URL, but I had no success with this…although it worked without the port reference:

As above, the connection is made and is verified by the successful running of another command (eg version), but the logout command produced an error. Despite this, the logout appeared successful – no other EssCS commands worked until a login was re-issued.

With EssCS installed and working, the Listfiles and Upload facilities become available. The function of these tools is pretty obvious from the name. Listfiles should be issued with at least arguments for the application and cube name:

The file type (csc, rul, txt, msh, xls, xlsx, xlsm, xml, zip, csv) can be included as an additional argument…

…although the file types is a fixed list – for example, you don’t seem to be able to use a wild card to pick up all spreadsheet files.

Whilst there is an Upload (and Download) facility, there does not seem to be the means to delete a remote file…which is a bit of an inconvenience, because using Upload to upload a file that already exists results in an error, and there is no overwrite option. The dataload.txt and dataload.rul files previously uploaded via the OAC front end were therefore manually deleted via OAC, and verified using Listfiles.

The files were then uploaded back to OAC using the Upload option of EssCS:

As you would expect, the files will then appear both in a Listfiles command and via OAC:

Note that the file list in OAC does not refresh with a browser page refresh or any ‘sort’ operation: use Refresh under Actions as above.

With the files now re-uploaded, the data can be loaded. EssCS also contains a DataLoad command, but unfortunately there appears to be no means to specify a rules file – meaning it would seem to be confined to overwrite, ‘export data’ style imports only:

A good point here is that the a DataLoad EssCS command makes an entry to the Jobs table, so success / record counts can be confirmed:

There are some minor differences between them, which may affect which you may wish to use for any particular scenario.

Arguably, given the availability of MAXL, there is a further custom method available as the actual data load can be effected that way too. This will be explored further in the next post that will start to consider how these tools might be used for real scenarios.

Our Data Science team includes physicists, mathematicians, industry veterans and data engineers ready to help you take analytics to the next level while providing expert guidance in the process.

Why use it?

Data is cheaper to collect and easier to store than ever before. But collecting the data is not synonymous with getting value from it. Businesses need to do more with the same budget and are starting to look into machine learning to achieve this.

These processes can take off some of the workload, freeing up people's time to work on more demanding tasks. However, many businesses don't know how to get started down this route, or even if they have the data necessary for a predictive model.

R

Our Data science team primarily work using the R programming language. R is an open source language which is supported by a large community.

The functionality of R is extended by many community written packages which implement a wide variety of statistical and graphical techniques, including linear and nonlinear modeling, statistical tests, time-series analysis, classification, clustering as well as packages for data access, cleaning, tidying, analysing and building reports.

All of these packages can be found on the Comprehensive R Archive Network (CRAN), making it easy to get access to new techniques or functionalities without needing to develop them yourself (all the community written packages work together).

R is not only free and extendable, it works well with other technologies and makes it an ideal choice for businesses who want to start looking into advanced analytics. Python is an obvious alternative, and several of our data scientists prefer it. We're happy to use whatever our client's teams are most familiar with.

Experienced programmers will find R syntax easy enough to pick up and will soon be able to implement some form of machine learning. However, for a detailed introduction to R and a closer look at implementing some of the concepts mentioned below we do offer a training course in R.

Our Methodology

Define
Define a Question

Analytics, for all intents and purposes, is a scientific discipline and as such requires a hypothesis to test. That means having a specific question to answer using the data.

Starting this process without a question can lead to biases in the produced result. This is called data dredging - testing huge numbers of hypotheses about a single data set until the desired outcome is found. Many other forms of bias can be introduced accidentally; the most commonly occurring will be outlined in a future blog post.

Once a question is defined, it is also important to understand which aspects of the question you are most interested in. Associated, is the level of uncertainty or error that can be tolerated if the result is to be applied in a business context.

Questions can be grouped into a number of types. Some examples will be outlined in a future blog post.

Define a dataset

The data you expect to be relevant to your question needs to be collated. Maybe supplementary data is needed, or can be added from different databases or web scraping.

This data set then needs to be cleaned and tidied. This involves merging and reshaping the data as well as possibly summarising some variables. For example, removing spaces and non-printing characters from text and converting data types.

The data may be in a raw format, there may be errors in the data collection, or corrupt or missing values that need to be managed. These records can either be removed completely or replaced with reasonable default values, determined by which makes the most sense in this specific situation. If records are removed you need to ensure that no selection biases are being introduced.

All the data should be relevant to the question at hand, anything that isn't can be removed. There may also be external drivers for altering the data, such as privacy issues that require data to be anonymised.

Natural language processing could be implemented for text fields. This takes bodies of text in human readable format such as emails, documents and web page content and processes it into a form that is easier to analyse.

Any changes to the dataset need to be recorded and justified.

Model
Exploratory Analysis

Exploratory data analysis involves summarising the data, investigating the structure, detecting outliers / anomalies as well as identifying patterns and trends. It can be considered as an early part of the model production process or as a preparatory step immediately prior. Exploratory analysis is driven by the data scientist, enabling them to fully understand the data set and make educated decisions; for example the best statistical methods to employ when developing a model.

The relationships between different variables can be understood and correlations found. As the data is explored, different hypotheses could be found that may define future projects.

Visualisations are a fundamental aspect of exploring the relationships in large datasets, allowing the identification of structure in the underlying dataset.

This is also a good time to look at the distribution of your dataset with respect to what you want to predict. This often provides an indication of the types of models or sampling techniques that will work well and lead to accurate predictions.

Variables with very few instances (or those with small variance) may not be beneficial, and in some cases could even be detrimental, increasing computation time and noise. Worse still, if these instances represent an outlier, significant (and unwarranted) value may be placed on these leading to bias and skewed results.

Statistical Modelling/Prediction

The data set is split into two sub groups, "Training" and "Test". The training set is used only in developing or "training" a model, ensuring that the data it is tested on (the test set) is unseen. This means the model is tested in a more realistic context and will help to determine whether the model has overfitted to the training set. i.e. is fitting random noise in addition to any meaningful features.

Taking what was learned from the exploratory analysis phase, an initial model can be developed based on an appropriate application of statistical methods and modeling tools. There are many different types of model that can be applied to the data, the best tends to depend on the complexity of your data and the any relationships that were found in the exploratory analysis phase. During training, the models are evaluated in accordance with an appropriate metric, the improvement of which is the "goal" of the development process. The predictions produced from the trained models when run on the test set will determine the accuracy of the model (i.e. how closely its predictions align with the unseen real data).

A particular type of modelling method, "machine learning" can streamline and improve upon this somewhat laborious process by defining models in such a way that they are able to self optimise, "learning" from past iterations to develop a superior version. Broadly, there are two types, supervised and un-supervised. A supervised machine learning model is given some direction from the data scientist as to the types of methods that it should use and what it is expecting. Unsupervised machine learning on the other hand, as the name suggests, involves giving the model less information to start with and letting it decide for its self what to value, and how to approach the problem. This can help to remove bias and reduce the number of assumptions made but will be more computationally intensive, as the model has a broader scope to investigate. Usually supervised machine learning is employed in a case where the problem and data set are reasonably well understood, and unsupervised machine learning where this is not the case.

Complex predictive modelling algorithms perform feature importance and selection internally while constructing models. These models can also report on the variable importance determined during the model preparation process.

Peer Review

This is an important part of any scientific process, and effectively utilities our broad expertise in modelling at Rittman Mead. This enables us to be sure no biases were introduced that could lead to a misleading prediction and that the accuracy of the models is what could be expected if the model was run on new unseen data. Additional expert views can also lead to alternative potential avenues of investigation being identified as part of an expanded or subsequent study.

Deploy
Report

For a scientific investigation to be credible the results must be reproducible. The reports we produce are written in R markdown and contain all the code required to reproduce the results presented. This also means it can be re-run with new data as long as it is of the same format. A clear and concise description of the investigation from start to finish will be provided to ensure that justification and context is given for all decisions and actions.

Delivery

If the result is of the required accuracy we will deploy a model API enabling customers to start utilising it immediately.
There is always a risk however that the data does not contain the required variables to create predictions with sufficient confidence for use. In these cases, and after the exploratory analysis phase there may be other questions that would be beneficial to investigate. This is also a useful result, enabling us to suggest additional data to collect that may allow a more accurate result should the process be repeated later.

Support

Following delivery we are able to provide a number of support services to ensure that maximum value is extracted from the model on an on-going basis. These include:
- Monitoring performance and accuracy against the observed, actual values over a period of time. Should there be discrepancies between these values arise, these can be used to identify the need for alterations to the model.
- Exploring specific exceptions to the model. There may be cases in which the model consistently performs poorly. Instances like these may not have existed in the training set and the model could be re-trained accordingly. If they were in the training set these could be weighted differently to ensure a better accuracy, or could be represented by a separate model.
- Updates to the model to reflect discrepancies identified through monitoring, changes of circumstance, or the availability of new data.
- Many problems are time dependent and so model performance is expected to degrade, requiring retraining on more up to date data.

Summary

In conclusion our Insights lab has a clearly defined and proven process for data science projects that can be adapted to fit a range of problems.

Contact us to learn how Insights Lab can help your organization get the most from its data, and schedule your consultation today.
Contact us at info@rittmanmead.com

Recently, I watched an amazing keynote presentation from Amanda Cox at OpenVis. Toward the beginning of the presentation, Amanda explained that people tend to feel and interpret things differently. She went on to say that, “There’s this gap between what you say or what you think you’re saying, and what people hear.”

While I found her entire presentation extremely interesting, that statement in particular really made me think. When I view a visualization or report, am I truly understanding what the results are telling me? Personally, when I’m presented a chart or graph I tend to take what I’m seeing as absolute fact, but often there’s a bit of nuance there. When we have a fair amount of variance or uncertainty in our data, what are some effective ways to communicate that to our intended audience?

In this blog I'll demonstrate some examples of how to show uncertainty and variance in Tableau. All of the following visualizations are made using Tableau Public so while I won’t go into all the nitty-gritty detail here, follow this link to download the workbook and reverse engineer the visualizations yourself if you'd like.

First things first, I need some data to explore. If you've ever taken our training you might recall the Gourmet Coffee & Bakery Company (GCBC) data that we use for our courses. Since I’m more interested in demonstrating what we can do with the visualizations and less interested in the actual data itself, this sample dataset will be more than suitable for my needs. I'll begin by pulling the relevant data into Tableau using Unify.

If you haven't already heard about Unify, it allows Tableau to seamlessly connect to OBIEE so that you can take advantage of the subject areas created there. Now that I have some data, let’s look at our average order history by month. To keep things simple, I’ve filtered so that we’re only viewing data for Times Square.

On this simple visualization we can already draw some insights. We can see that the data is cyclical with a peak early in the year around February and another in August. We can also visually see the minimum number of orders in a month appears to be about 360 orders while the maximum is just under 400 orders.

When someone asks to see “average orders by month”, this is generally what people expect to see and depending upon the intended audience a chart like this might be completely acceptable. However, when we display aggregated data we no longer have any visibility into the variance of the underlying data.

If we display the orders at the day level instead of month we can still see the cyclical nature of the data but we also can see additional detail and you’ll notice there’s quite a bit more “noise” to the data. We had a particularly poor day in mid-May of 2014 with under 350 orders. We’ve also had a considerable number of good days during the summer months when we cleared 415 orders.

Depending upon your audience and the dataset, some of these charts might include too much information and be too busy. If the viewer can’t make sense of what you’re putting in front of them there’s no way they’ll be able to discern any meaningful insights from the underlying dataset. Visualizations must be easy to read. One way to provide information about the volatility of the data but with less detail would be to use confidence bands, similar to how one might view stock data. In this example I’ve calculated and displayed a moving average, as well as upper and lower confidence bands using the 3rd standard deviation. Confidence bands show how much uncertainty there is in your data. When the bands are close you can be more confident in your results and expectations.

An additional option is the use of a scatterplot. The awesome thing about a scatterplots is that not only does it allow you to see the variance of your data, but if you play with the size of your shapes and tweak the transparency just right, you also get a sense of density of your dataset because you can visualize where those points lie in relation to each other.

The final example I have for you is to show the distribution of your data using a boxplot. If you’re not familiar with boxplots, the line in the middle of the box is the median. The bottom and top of the box, known as the bottom and top hinge, give you the 25th and 75th percentiles respectively and the whiskers outside out the box show the minimum and maximum values excluding any outliers. Outliers are shown as dots.

I want to take a brief moment to touch on a fairly controversial subject of whether or not to include a zero value in your axes. When you have a non-zero baseline it distorts your data and differences are exaggerated. This can be misleading and might lead your audience into drawing inaccurate conclusions.

For example, a quick Google search revealed this image on Accuweather showing the count of tornados in the U.S. for 2013-2016. At first glance it appears as though there were almost 3 times more tornados in 2015 than in 2013 and 2014, but that would be incorrect.

Philip begins his article with this chart tweeted by the National Review which appears to prove that global temperatures haven’t changed in the last 100 years. As he goes on to explain, this chart is misleading because of the scale used. The y-axis stretches from -10 to 110 degrees making it impossible to see a 2 degree increase over the last 50 years or so.

The general rule of thumb is that you should always start from zero. In fact, when you create a visualization in Tableau, it includes a zero by default. Usually, I agree with this rule and the vast majority of the time I do include a zero, but I don’t believe there can be a hard and fast rule as there will always be an exception. Bar charts are used to communicate absolute values so the size of that bar needs to be proportional to the overall value. I agree that bar charts should extend to zero because if it doesn’t we distort what the data is telling us. With line charts and scatterplots we tend to look at the positioning of the data points relative to each other. Since we’re not as interested in the value of the data, I don’t feel the decision to include a zero or not is as cut and dry.

The issue boils down to what it is you’re trying to communicate with your chart. In this particular case, I’m trying to highlight the uncertainty so the chart needs to draw attention to the range of that uncertainty. For this reason, I have not extended the axes in the above examples to zero. You are free to disagree with me on this, but as long as you’re not intentionally misleading your audience I feel that in instances such as these this rule can be relaxed.

These are only a few examples of the many ways to show uncertainty and variance within your data. Displaying the volatility of the data and giving viewers a level of confidence in the results is immensely powerful. Remember that while we can come up with the most amazing visualizations, if the results are misleading or misinterpreted and users draw inaccurate conclusions, what’s the point?

I wrote a blog post a while ago describing the catalog validation: an automated process performing a consistency check of the catalog and reporting or deleting the inconsistent artifacts.
In the post I stated that catalog validation should be implemented regularly as part of the cleanup routines and provides precious additional information during the pre and post upgrade phases.

However some time later I noted Oracle's support Doc ID 2199938.1 stating that the startup procedure I detailed in the previous blog post is not supported in any OBI release since 12.2.1.1.0. You can imagine my reaction...

The question then became: How do we run the catalog validation since the known procedure is unsupported? The answer is in catalog manager and the related command line call runcat.sh which, in the server installations (like the SampleApp v607p), can be found under $DOMAIN_HOME/bitools/bin.

How Does it Work?

As for most of command line tools, when you don't have a clue on how it works, the best approach is to run with the -help option which provides the list of parameters to pass.

Catalog Manager understands commands in the following areas:
Development To Production
createFolder Creates folder in the catalog
delete Deletes the given path from the catalog
maintenanceMode Puts the catalog into or out of Maintenance Mode (aka ReadOnly)
...
Multi-Tenancy
provisionTenant Provisions tenants into a web catalog
...
Patch Management
tag Tags all XML documents in a catalog with a unique id and common version string
diff Compares two catalogs
inject Injects a single item to a diff file
...
Subject Area Management
clearQueryCache Clears the query cache

Unfortunately none of the options in the list seems to be relevant for catalog validation, but with a close look at the recently updated Doc ID 2199938.1 I could find the parameter to pass: validate.
The full command then looks like

./runcat.sh -cmd validate

In my previous blog I mentioned different types of validation. What type of validation is the default command going to implement? How can I change the behaviour? Again the -help option provides the list of instructions.

-offline: the catalog validation needs to happen offline. Either with services down or on a copy of the live catalog. Running catalog validation on a online catalog is dangerous especially with "Clean" options since could delete content in use.

-folder: the catalog validation can be run only for a subset of the catalog

Also, '-accounts' cannot be 'None'.: some validations are a prerequisite for others to happen

Default is 'Clean': some validations have a "Clean" as default value, meaning that will solve the issue by removing the inconsistent object, this may be inappropriate in some cases.

As written before, the initial catalog validation should be done with all options set on Report since this will give a log file of all inconsistencies without deleting pieces of the catalog that could still be valuable. In order to do so the command to execute is:

runcat.sh output is displayed direcly in the console, I'm redirecting it to a file called cat_validation.log for further analysis.

If, after the initial run with all options to Report you want the catalog validation utility to "fix" the inconsistent objects, just change the desired options to Clean. Please make sure to take a backup of the catalog before since the automatic fix is done by removing the related objects. Moreover ensure that catalog validation is working on a offline catalog. The command itself can work on top on a online catalog but is never a good idea checking a catalog that could potentially be changed while the tool is running.

The output

Let's see few examples of how Catalog Validation spots inconsistent objects. For the purpose of this test I'll work with Oracle's Sampleapp.

Abandoned and inaccessible homes

Running the validation against the Sampleapp catalog provides some "interesting" results: some homes are declared "abandoned": this could be due to the related user not existing anymore in weblogic console, but that's not the case

E10 saw.security.validate.homes Abandoned home /users/weblogic

Looking deeper in the logs we can see that the same user folders are flagged as

User facing object '/users/weblogic' has no user permissions and is inaccessible

Logging in with the user weblogic doesn't allow me to check the "My Folders" in the catalog. When switching to "Admin View" and trying to open "My Folder" I get the following error

As written in the logs looks like the user folder has permission problems. How can we solve this? One option is to use again the runcat.sh command with the forgetAccounts option to remove the inconsistent homes. However this solution deletes all the content related to the user that was stored under the "My Folders".

In order to keep the content we need to overwrite the folder's permission with an administrator account. Unfortunately, when right-clicking on the folder, the "Permission" option is not available.

As a workaround I found that clicking on Properties and then on Set Ownership of this item and all subitems allows you to grant full access to the administrator which is then able to reset the proper user the relevant access privilege.

Once the workaround is implemented the users is able to check his "My Folder" content, however the the errors are still present in catalog validation. The solution is storing the relevant artifacts in another part of the catalog, run runcat.sh with forgetAccounts option and then reimport the objects if needed.

Inconsistent Objects

The main two reasons generating inconsistent objects are:

Invalid XML: The object (analysis or dashboard) XML code is not valid. This can be caused by errors during the write to disk or problems during migrations.

Broken Links: analysis contained in a dashboard or linked from other analysis have been renamed or deleted.

Let's see how catalog validation shows the errors.

Invalid XML

To test this case I created a simple analysis with two columns and then went to the Advanced tab and deliberately removed an > to make the XML invalid.

When trying to applying the change I got the following error which denied me the possibility to save.

Since I really wanted to ruin my analysis I went directly to the file system under $BI_HOME/bidata/service_instances/ssi/metadata/content/catalog/root/shared/$REQUEST_PATH and changed the XML directly there.

After than I run the catalog validation with only the flag items equal to Report and the rest set to None since I'm looking only at invalid XMLs.
The result as expected is:

Which tells me that my analysis notworkinganalysis is invalid with an unterminated start tag, exactly the error I was expecting. Now I have two choices: either fixing the analysis XML manually or rerunning the catalog validation with option Clean which will delete the analysis since it's invalid. As said before there is no automated fix.

I wanted to do a further example on this, instead of removing the >, i removed a quotation mark " to make the analysis invalid

After clicking to Apply OBIEE already tells me that there is something wrong in the analysis. But since it allows me to save and since I feel masochist I saved the analysis.

But... when running the catalog validation as before I end up seeing 0 errors related to my notworkinganalysis.

The answer to Jackie Chan question is that I got 0 errors since in this second case the XML is still valid. Removing a " doesn't make the XML syntax invalid! In order to find and solve that error we would need to use Oracle's Baseline Validation Tool.

Broken Links

To test the broken links case I created the following scenario:

Analysis SourceAnalysis which has navigation action to TargetAnalysis

Dashboard TestDashboard which contains the TargetAnalysis object.

In order to break things I then deleted the TargetAnalysis.

Running catalog validation with the option links to Report. As expected I get a line

Summarizing the broken link validation reports if missing objects are included in the main definition of other objects (as filters or as parts of dashboards) but doesn't seem to report if the missing object is only linked via an action.

Conclusion

My experiments show that catalog validation finds some errors like invalid homes, XML files and broken links which otherwise users would hit at the run-time and that won't make them happy. But there are still some errors which it doesn't log like analysis with wrong column syntax, luckily for most of the cases other tools like the Baseline Validation can spot them easily so use all you have, use as frequently as possible and if you want more details about how it works and how it can be included in the automatic checks for code promotions don't hesitate to contact us!

Not a long time ago a friend of mine spent a significant amount of time trying to find a flat to rent. And according to what he said it wasn't an easy task. It took him a decent time and efforts to find something that is big enough (but not too big) not too far from a workplace, had required features and affordable at the same time. And as a specialist in data analysis, I prefer to think about this task as a data discovery one (yes, when you have a hammer everything looks like a nail). And I decided to see if a data analysis tool can help me understand the rental market better. I'm sure you've already read the name of this post so I can't pretend I'm keeping intrigue. This tool is Tableau 10.3.

The Data

The friend I was talking before was looking for a flat in Moscow, but I think that this market is completely unknown to the most of the readers. And also I'd have to spend a half of time translating everything into English so for this exercise I tookBrighton and Hove data from http://rightmove.co.uk and got a nice JSON Lines file. JSON Lines files are basically the same JSON as we all know but every file has multiple JSONs delimited by a newline.

{json line #1}
{json line #2}
...
{json line #n}

That could be a real problem but luckily Tableau introduced JSON support in Tableau 10.1 and that means I don't have to transform my data to a set of flat tables. Thanks to Tableau developers we may simply open JSON Lines files without any transformations.

Typical property description looks like this:

It has a few major blocks:

Property name - 2 bedroom apartment to rent

Monthly price - £1,250

Description tab:

Letting information - this part is more or less standard and has only a small number of possible values. This part has Property name: Property value structure ('Date available':'Now').

Key features - this part is an unformalized set of features. Every property may have its own unique features. And it is not a key-value list like Letting information, but a simple list of features.

Full description - simply a block of unstructured text.

Nearest stations - shows three nearest train stations (there could be underground stations too if they had it in Brighton).

School checker - this shows 10 closest primary and 10 secondary schools. For this, I found a kind of API which brought me a detailed description of every school.

And finally, the JSON for one property has the following look. In reality, it is one line but just to make it more easy to read I formatted it to a human readable format. And also I deleted most of the schools' info as it is not as important as it is huge.

The full version is here: 6391 lines, I warned you. My dataset is relatively small and has 1114 of such records 117 MB in total.

Just a few things I'd like to highlight. Letting information has only a small number of fixed unique options. I managed to parse them to fields like furnish, letting_type, etc. Key Features list became just an array. We have thousands of various features here and I can't put them to separate fields. Nearest stations list became an array of name and value pairs. My first version of the scrapper put them to a key-value list. Like this:

but this didn't work as intended. I got around one hundred of measures with names Fishbourne, Chichester, Bosham, etc. Not what I need. But that could work well if I had only a small number of important POIs (airports for example) and wanted to know distances to this points. So I changed it to this and it worked well:

When I started this study my knowledge of the UK property rent market was close to this:

And it's possible or even likely that some of my conclusions may be obvious for anyone who is deep in the topic. In this blog, I show how a complete newbie (me) can use Tableau and become less ignorant.

So my very first task was to understand what kind of objects are available for rent, what are their prices and so on. That is the typical task for any new subject area.

As I said before Tableau 10 can work with JSON files natively but the question was if it could work with such a complex JSON as I had. I started a new project and opened my JSON file.

I expected that I will have to somehow simplify it. But in reality after a few seconds of waiting Tableau displayed a full structure of my JSON and all I had to do was selecting branches I need.

After a few more seconds I got a normal Tableau data source.

And this is how it looked like in analysis mode

First Look at the Data

OK, let's get started. The first question is obvious: "What types of property are available for rent?". Well, it seems that name ('2 bedroom apartment to rent') is what I need. I created a table report for this field.

Well, it gives me the first impression of what objects are offered and what my next step should be. First of all the names are ending with "to rent". This just makes strings longer without adding any value. The word "bedroom" also doesn't look important. Ideally, I'd like to parse these strings into fields one of which is # of bedrooms and the second one is Property type. The most obvious action is to try Split function.

Well, it partially worked. This function is smart enough and removed 'to rent' part. But except for this, it gave me nothing. On other datasets (other cities) it gave me much better results but it still wasn't able to read my mind and did what I wanted:

But I spent 15 seconds for this and lost nothing and if it worked I'd saved a lot of time. Anyway, I'm too old to believe in magic and this almost didn't hurt my feelings.

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

Yes, this string literally asks some regular expressions wizardry.

I can easily use REGEXP_EXTRACT_NTH and get what I want. Group 1 is the number of bedrooms and Group 3 is the property type. Groups 2 and 4 are just constant words.

Explanation for my regular expressionI can describe most of the names in the following way: "digitbedroomproperty typeto rent" and the rest are "property typeto rent. So digit and bedroom are optional and property typeto rent are mandatory. The expression is easy and obvious: ([0-9]*)( bedroom )*(.*)( to rent)

Regular expressions are one of my favourite hammers and helped me a lot for this analysis. And after all manipulations, I got a much better view of the data (I skipped some obvious steps like create a crosstab or a count distinct measure to save space for anything more interesting).

And while this result looks pretty simple it gives me the first insight I can't get simply browsing the site. The most offered are 1 and 2 bedroom properties especially flats and apartments. And if a family needs a bigger something with 4 or 5 bedrooms, well I wish them good luck, not many offers to chose from. Also if we talk about living property only we should filter out things like GARAGE, PARKING or LAND.

I think both charts work pretty well. The first one presents a nice view of how flats and apartments outnumber all other types and the second one gives a much better understanding of how many of 2 bedroom properties offered compared to all others.

And while I'm not a big fan of fancy visualisations but if you need something less formal and more eye-catching try Bubbles chart. It's not something I'd recommend for an analysis but may work well for a presentation. Every bubble represents particular property type, colour shows a number of bedrooms and size shows the number of properties.

Going Deeper

The next obvious question is the price. How much do different properties cost? Is any particular one more expensive than average or less? What influences the price?

As a baseline, I'd like to know what is the average property price. And I obviously don't want just one figure for the city-wide price. It's meaningless. Let's start with a bar chart and see what is the range of prices.

Well, we have a lot of options. Flat share costs less than £700 or we may choose a barn for more than £3600. Again a very simple result but I can't get it directly from the site.

The next obvious question is how the number of bedrooms affects the price. Does the price skyrockets with every additional bedroom or maybe more bedrooms mean smaller rooms and price increases not too fast?

Well, this chart gives me the answer but it looks bad. Mostly because a lot of properties types don't have enough variance in room number. Studio flats have only one bedroom by definition and the only converted barn has 7 bedrooms. I'd like to remove types which don't have at least 3 options and see how the price changes. For this, I created a new computed field using fixed keyword. It counts the number of bedroom options by property type.

And then I use it in the filter 'Bedroom # variance' at least 3. Now I have a much more clean view. And I can see that typically more bedrooms mean significantly higher price with a few exceptions. But in fact, these are not actual exceptions just a problem of a small dataset. I can say that increase in # bedrooms certainly means a significant increase in price. And one more insight. Going above 7 bedrooms may actually double the price.

Averages are good but they hide important information of how prices are distributed. For example, six properties priced £1K and one £200 give average £885. And looking at average only may make you think that with £900 you may choose one of 7 options. It's very easy to build a chart to check this. Just create a new calculation called Bins and use in a chart.

With £100 bins I got the following chart. It shows how many properties have price falling to a particular price range. For example, the £1000 bin shows # of properties with prices £1000-£1100.

The distribution looks more or less as expected but the most interesting here is that £1000-£1100 interval seems to be very unpopular. Why? Let's add # of bedrooms to this chart.

£1000 is too expensive for 1 bedroom and studios but too cheap for two. Simple. What else can we do here before moving further? Converting this chart to a running total gives a cool view.

What can this chart tell us? For example, if we look at the orange line (2 bedrooms) we will find that with £1200 we may choose among 277 of 624 properties. With £1400 budget we have 486 of 624. Further £200 increase in budget won't significantly increase the number of possibilities and if the change from £1200 to £1400 almost doubled the number of possibilities, the next £200 give only 63 new options. I don't have a ready to use insight here, but I got a way to estimate a budget for a particular type of property. With budget £X I will be able to choose one of N properties.

Why It Costs What It Costs

OK, now I know a lot of statistics about prices. And my next question is about factors affecting the price. I'd like to understand does a particular property worth what it cost or not. Of course, I won't be able to determine exact price but even hints may be useful.

The first hypothesis I want to check is if a train station near raises the price or it isn't any important. I made a chart very similar to the previous one and it seems that Pareto principle works perfectly here. 80% or properties are closer than 20% of the maximum distance to a station.

But this chart doesn't say anything about the price it just gives me the understanding of how dense train stations are placed. I'd say that most of the properties have a station in 10-15 minutes of walking reach and therefore this should not significantly affect the price. My next chart is a scatter plot for price and distance. Every point is a property and its coordinates on the plot determined by its price and distance to the nearest station. Colour shows # of bedrooms.

I'd say that this chart shows no clear correlation between price and distance. And a more classical line chart shows that.

The maximum price slightly decreases with distance, minimum price on the contrary increases. Average price more or less constant. I think the hypothesis is busted. There is no clear correlation between the distance a tenant have to walk to a station and the price he has to pay. If you want to rent something and the landlord says that the price is high because of a train station near, tell him that there are stations all around and he should find something more interesting.

What about furnishings? Does it cheaper to get an unfurnished property or a landlord will be happy to meet someone who shares his taste?

Unfurnished property is definitely cheaper. And it's interesting that in some cases partly furnished even cheaper than completely unfurnished. But at least for furnished/unfurnished, we can see a clear correlation. When you see a furnished one for the price of unfurnished this may be a good pennyworth.

Another thing I'd like to check. Can we expect I lower price for a property not available immediately? Or is, on the contrary, the best price is offered for already unoccupied properties?

As always start with a general picture. What is the average time of availability by property types?

For most popular types it is about one month and if you have a house you typically publish it two or three months in advance. And what is about the price? One more chart that I like in Tableau. In the nutshell, it is a normal line chart showing an average price by days before property availability. But the thickness of lines shows the number of properties at the same time. So I can see not only the price but reliance too. A thick line means it was formed by many properties and a thin line may be formed by few properties and move up or down significantly then something changes. It would be very interesting to get a historical data and see how much time properties stay free or how long it takes before the price is reduced, but unfortunately, I don't have this data.

And looking at this chart I'd say that there is no statistically significant dependency for price and availability date. Renting a property available in the distant future won't save you money* (*=statistically).

And the last thing I'd like to investigate is the Key features. What do landlords put as the key features of their properties? How do they affect the price?

The list of popular Key features surprised me.

'Unfurnished' looks good to me, it is a really significant part of the deal. But 'Brighton'? For properties in Brighton? '1 Bedroom'. How many bedrooms can '1 bedroom flat to rent' have? Oh, there is a key feature saying '1 bedroom' now I know. But jokes aside. I had to make a lot of cleaning on this data before I could use it. There are six ways to write 'Modern kitchen'. Make everything upper case, then remove quotes, strip spaces and tabs, remove noisy features like 'stylish 1 bedroom apartment' and so on. After this, I got a slightly better list with approximately 3500 features instead of 4500. Note how all variants of writing 'GAS CENTRAL HEATING' now combined into one most popular feature. But there are still too many features. I'm sure that there should be not more than a hundred of them. Even at this screenshot you may see 'Unfurnished' and 'Unfurnished property' features.

When I need a visualisation for this amount of points, bar charts or tables won't play well. My weapon of choice is Scatter plot. Every point is a particular feature, axes are minimum and average prices of it, size is determined by the number of properties declaring to have this feature and the colour is the maximum price. So if a feature is located high on the plot it means that in average it will be expensive to have it. If this feature at the same time located close to the left side even cheap properties may have it. For example, if you want a swimming pool be ready to pay at least £3000 and £7000 in average. And the minimum price for tumble dryer is £3250 but average £3965. The cheapest property with a dryer is more expensive than with a pool, but in average pools are more expensive. That is how this chart works.

The problems of this chart are obvious. It is littered with unique features. Only one property has 4 acres (the point in top right corner). And actually not so many swimming pools are available for rent in Brighton. I filtered it by "# of properties > 25" and here is how prices for the most popular features are distributed.

Central location will cost you at least £100 and £1195 in average and for Great location be ready to pay at least £445 and £1013 in average. Great location seems to be less valuable than the central one.

And now I can see how a particular feature impacts prices. For example 'GAS HEATING'. I made a set with all variants of heating I could find ('GAS CENTRAL HEATING', 'GAS HEAT' and so on). Now I can analyse how this feature impacts properties. And here is how it impacts the price of flats. Blue circles are properties with gas heating and orange are without.

Very interesting in my opinion. The minimum price of properties with gas heating (blue circles) is higher than without. That is expected. But average price for properties without gas heating is higher.

And here are kitchen appliances. For 1 bedroom flats, they increase both minimum and average prices significantly. But for bigger flats minimum price with appliances is higher and average price is lower. Possible this option is important for relatively cheap properties, but its weight is not that big for the bigger ones.