Disclaimer:

These are my personal views and are meant for Informational purpose only. Please verify the Information via Professional help or via Official references before acting upon the information provided in this Blog.

BigData

Neologism means The coining or use of new words – And I believe it’s one of the challenge faced by IT professionals. Nowadays, we put our time & energy trying to get head around “new terms/words/trends”.

Let’s take couple of example(s):

Sometime back, we had cloud computing. Nowadays, its Big Data; In my mind – Big Data has been coined to mean following technologies/techniques under different contexts:

Note: The above image is just for illustration purpose. It does not comprehensively cover every technology that is now called “Big Data”. Feel free to point it out if you think I missed something important.

And Neologism is challenge because:

1) Generally, it’s a new trend and there is little to no consensus on what does it “Exactly” mean

2) It means different things in different context

3) Every person can have their own “interpretation” and no one is wrong.

4) It’s a moving ball. The definition used today will change in future. So we always need a “working” definition for these terms.

Now, Don’t get me wrong, It’s fun trying to figure out what does it all mean and trying to gauge whether it matters to me and my organization or not! What do you think – as a Person in Information Technology, do you think that Neologism is one of the challenges faced by us? consider leaving a reply in the comment section!

The success of companies like Google, Facebook, Amazon, and Netflix, not to mention Wall Street firms and industries from manufacturing to retail and healthcare, is increasingly driven by better tools for extracting meaning from very large quantities of data,” says Tim O’Reilly

In 2002: The Data Warehousing Institute estimates that data quality problems cost U.S. businesses more than $600 billion a year. And of course, over the past 10 years, this number would be bigger. http://bit.ly/TPT9r3

I am always interested in how such advanced computers was built. In case of Watson, It’s fascinating how technologies such as Natural language processing, machine learning & artificial intelligence backed by massive compute & storage power was able to beat two human world champions. And as a person interested in analytic’s and Big Data – I would classify this technology under Big Data and Advanced Data Analytics where computer analyzes lots of data to answer a question asked in a natural language. It also uses advanced machine learning algorithms. To that end, If you’re interested in getting an overview of what went into building WATSON, watch this:

If you’re as amazed as I am, considering sharing what amazed you about this technology via comment section:

In this blog post, I’ll document how I solved the error “The type or namespace name ComplexEventProcessing does not exist in the namespace Microsoft”. Here are the steps:

1. I browsed through other errors/warnings as well – I was also missing assemblies from Reactive Extensions and so I added them first.

2. For my scenario, I had installed StreamInsight 2.0 successfully on my machine but I downloaded the sample that needed assemblies from StreamInsight 2.1 – notice the version mismatch here? That was the problem!

3. One of the message said “Could not locate assembly Microsoft.ComplexEventProcessing version = 21.0.0.0” – notice the version = 21.0.0.0 – it suggested that I needed the assemblies from StreamInsight 2.1

4. So I downloaded “Microsoft® SQL Server® StreamInsight 2.1” and installed it. And it worked!

5. FYI: I found the Microsoft.ComplexEventProcessing assembly on my machine at C:\Windows\Microsoft.NET\assembly\GAC_MSIL\Microsoft.ComplexEventProcessing\*

That’s about it for this post. I hope it helps someone who is having issues with finding the assembly with the right version number to get started working with StreamInsight.

The Hadoop on Azure’s Javascript console has basic graphing functions: Bar, Line & Chart. I think this is great becuase it gives an opportunity to visualize data that’s in HDFS directly from the Interactive Javascript Console! Here’s a screenshot:

In the console, I ran the help(“graph”) command to see how I can use this function:
Draw a graph of data
graph.bar(data, options) Bar graph
graph.line(data, options) Line graph
graph.pie(data, options) Pie chart

2. Create a Hive Table and load the data uploaded in step 1 to the Hive Table

3. Analyze data in Hive via Excel Add-in

Before we begin, I assume you have access to Hadoop on azure, Have your sample data (don’t have one? learn from a blog post), familiar with Hadoop ecosystem and know your way around the Hadoop on Azure Dashboard.

Now, Here are the steps involved:

STEP 1: Upload Twitter Text Data into Hadoop on Azure cluster

1. Have your data to be uploaded ready! I am just going to Copy Paste the File from my host machine to the RDP’ed machine. In this case, the machine that I am going is the Hadoop on Azure cluster.

For the purpose of this blog post, I have a text file having 1500 tweets:

2. Open web browser > Go to your cluster in Hadoop on Azure

3. RDP into your Hadoop on Azure cluster

4. Copy-Paste the File. It’s a small data file so this approach works for now.

Step 2: Create a Hive Table and load the data uploaded in step 1 to the Hive Table

Note that for the purpose of this blog-post, I’ve chose string as data type for all fields. This is something that depends on the data that you have. If I were building a solution, I would spend some more time choosing the right data type.

Step 3. Analyze data in Hive via Excel Add-in

1. Switch to Hadoop on Azure Dashboard

2. Go to the Hive Console and run the show tables to verify that there is a tweetsampletable.

3. Now if you haven’t, Download and Install the Hive ODBC Driver from the Downloads section of your Hadoop on Azure Dashboard.

Suppose the Text is “Hadoop on Azure sample Hadoop is on Windows Azure Hadoop is on Windows server” – Then this is how you can think of what happens to your input when it is processed first by Map function and then by Reduce function:

INPUT

MAP

REDUCE

Hadoop on Azure sample

Hadoop is on Windows Azure

Hadoop is on Windows server

Hadoop

1

Hadoop

3

On

1

Azure

1

on

3

Sample

1

Hadoop

1

Azure

2

Is

1

On

1

Sample

1

Windows

1

Azure

1

Is

2

Hadoop

1

Is

1

Windows

2

On

1

Windows

1

Server

1

Server

1

Conclusion:

In this blog post, we visualized how MapReduce Algorithm operates for a WordCount Example.