The personal view on the IT world of Johan Louwers, specially focusing on Oracle technology, Linux and UNIX technology, programming languages and all kinds of nice and cool things happening in the IT world.

Sunday, December 16, 2012

Big data Sentiment analysis

A lot of people who are currently working on, or thinking about, big data do directly start talking about how they can implement a certain things. How much data you will have to store and how much data you will have to process to find meaning in big data. A lot of people start talking about how they will need to store things, how they will need to build implementations of Hadoop and the Hadoop Distributed File System to be able to cope with the load of data.

I recently had a discussion with a couple of people and we where thinking about the fact if we could dynamically route taxis in a city based upon big-data and social media feeds you could receive from the internet. The second thing we where discussing was if we could measure the satisfaction of the user by linking his tweets back to the trip he just made in the taxi.

It is true that for receiving and processing the data coming in which needs processing and finally will result in routing taxi's will be enormous. Secondly it is true that you will need a lot of computing power and yes you will most likely need Hadoop like solutions for processing this. However there is a much more subtle part to this what a lot of people are overlooking.

Finding the need for routing one or more taxi's to a part of the city where we expect to be able to find a lot of customers based upon social media and historical data is challenging however there is an ever bigger and even more challenging part and that is in they end. That is the part we brought in while discussing if we could measure if the customer was happy by the things he published on social media just after he made use of the taxi.

Lets say we know a person called Tim entered a taxi and he is having the twitter account @TimTheTaxiCustomer. He is picked up and is dropped of during rushhour a couple of blocks away from his original location. We are monitoring the twitter feed from Tim and now he is sending a tweet with the following text:

"made it to my meeting, wonder what would happen if I had taken the subway"

He also had possibly could send out a tweet with;

"Made it to my meeting, never had this when is used the subway"

Looking at those text for a human it is, with some luck, possible to find out what Tim is meaning and if he is happy with the service. You will have to have some skills however we can conclude that with the first tweet he is somewhat jokingly referring to the subway and insinuating that he would have never made it with the subway. Meaning good service from the taxi company. The second tweet is however indicating that he had never experienced this with the subway which could mean that he had to run to make it to his meeting.

This process is even for humans complex from time to time and is very driven by local slang, local ways of saying things and culture influences. Having a human however read all the messages of all our potential taxi customers is not feasible due to the amount of messages that need to checked and the speed they are produced by. To be able to cope with the load we will need an algorithm to check the messages and to enable the algorithm to keep up with the pace of messages generate we will have to run this on a cluster which makes use of massive parallel computing.

Creating a algorithm capable of understanding human emotions is the field of sentiment alnalysis.
"Sentiment analysis or opinion mining refers to the application of natural language processing, computational linguistics, and text analytics to identify and extract subjective information in source materials.Generally speaking, sentiment analysis aims to determine the attitude
of a speaker or a writer with respect to some topic or the overall
contextual polarity of a document. The attitude may be his or her
judgment or evaluation (see appraisal theory),
affective state (that is to say, the emotional state of the author when
writing), or the intended emotional communication (that is to say, the
emotional effect the author wishes to have on the reader)."

In the below video Michael Karasick is toutching this subject in quite a good way in his IBM research lab video;

IBM Research - Almaden Vice President and Director Michael Karasick
presents a brief overview of "Managing Uncertain Data at Scale." Part of
the 2012 IBM Research Global Technology Outlook, this topic focuses on
the Volume, Velocity, Variety and Veracity of data, and how to harness
and draw insights from it.