VEDO Focus: How to Generate Solid Datasets from Social Streams?

Your answer to the above question will most likely contain the word “keywords” somewhere, and as you are saying it, you realize that keywords are just about the only levers you can operate at the base level when you define datasets in today’s social data environment.

Whether you use search engines, social listening tools or go straight to the source via DataSift’s platform, you will most likely — at some point — be thinking about keywords.

Well that is all in the past.

Let me show you something that will make you rethink how you generate datasets – something that will not only save you a lot of time, but also give you entire new perspectives to play with and charter entire new territories for how you work with data streams to generate datasets and insights.

Let’s begin with two common problems that should be all too familiar to anyone tapping into the rich world of social data:

How do I filter signal from noise when searching for tricky brand names – for example Ford the automotive manufacturer or GAP the apparel brand?

How do I set a context for my data or how do I define an industry so that I have a sensible benchmark for my brand analysis? For example, how do I know how much mindshare Ford has in the automotive industry debate?

Sound familiar? With keyword-driven datasets, these are all too common problems, and often something we will spend a disproportionate amount of time dealing with.

Below, I will show you how to overcome these problems in a cool new way. You will be leveraging quite a feat of engineering when doing so. At DataSift, we have gone nuts and implemented something huge that runs in real-time on our pipeline, something that is a true testament to the notion of big data.

We call it VEDO Focus and it leverages a massive taxonomy consisting of more than 450,000 categories that are applied through analysis of more than one billion keywords – all in real-time!

First things first, let’s get some data to work with. Open your CSDL editor and type:

focus.content.levels.level3any"Automotive Vehicles"

That’s it. We’re done.

What we just did was to pull out any data that is tagged as “Automotive Vehicles” by the taxonomy (focus) and thereby leverage all of the underlying keywords, matching logic and rule definitions. No more spending hours defining brand names, researching industry terms and building keyword-based exclusions.
Step 2: Reduce noise
OK, the world isn’t perfect and the taxonomy is pretty rich. When we inspect the output in Live Preview, we realize there is a mix of Tumblr and Facebook posts, but we want Tweets. We also see how “Automotive Vehicles” is a wide-reaching term in the taxonomy, as it has many specific categories.

First, we restrict our data to English Tweets:

// Limit to twitter + English social datainteraction.type=="twitter"and (twitter.lang=="en"ortwitter.retweet.lang=="en")

Then, since we are interested in the Automotive Industry, we will exclude categories that don’t fit that perspective – for example “Armored Cars” and “Trucks”:

How simple is that? We have the data we need and it took us a small percentage of the time it would normally take us to achieve a similar dataset.

Step 3: Analyze the data

All we have to do now is to run our CSDL and look at the output data. You can use any visualization tool – in the following we will look at our automotive industry data in Tableau.

In addition to the Focus augmentation, we are also using DataSift’s platform to augment the data for demographic information, i.e. gender detection and VEDO, to create age groups.

Next, let’s utilize the fact that Focus not only helps us when filtering, it also provides us with a rich augmentation of all the categories that have been matched on the individual tweets.

So to recap
We used Focus to generate an industry dataset for automotive vehicles, but since Focus is running at the fire-hose level, any tweet is augmented and enriched with matching categories. This enables us to pivot the dataset around features – and combined with our demographic data and age groups – we can quickly populate insight on what is front-of-mind with people (across gender and age) when it comes to automotive features.

Above the sky, like a rainbow of Fire and Sunlight,
were Formed the Spirits.
Sang they the glories of the Holy One.
Then from the midst of the Fire came a voice:
Behold the Glory of the first Cause.
I beheld that Light, high above all darkness,
reflected in my own being.
I attained, as it were, to the God of all Gods,
the Spirit-Sun, the Sovereign of the Sun spheres.
There is One, Even the First,
who hath no beginning,
who hath no end;
who hath made all things,
who govern all,
who is good,
who is just,
who illumines,
who sustains.