Cloud Bazaar

Total Pageviews

Sunday, March 29, 2015

This will be a blog series of all the learnings during the course of developing a tool on Azure Stream Analytics (ASA) called Smurf. This introductory blog will start with bits and pieces we want to create this tool.

I will not describe on what is ASA on what it does. It is described well in msdn sites. I will try to focus on all the other different tools which are required to create this project.

What are the components?

Azure Stream Analytics can work on 3 kinds of inputs.

1. Data from SQL Azure

2. Data from Azure Storage

3. Data from EventHubs.

We will touch each component as we proceed.

Starting with Event hub component

I am starting with the Event hub component first. Why I chose it is because of its impressive success story with ASA. As we know event hubs can work on partitioned data and also helps developer to easily scale up. Therefore a story with Event Hub and ASA will indeed be impressive

To build Event Hub support these are the following components which we will highlight as we go

1. Send data to event hub

2. Process data based on user defined business logic (this seems tricky, let's see how it goes)

3. Receive data from event hub

Send data to Event hub

Sending data to event hub is pretty much straight forward. You need to do the following

Tuesday, December 16, 2014

This article is to parallelize trace investigations using Hadoop Map Reduce and thereby reduce time
and effort in investigations.

Current Problem

1.During on call events
there is a time and effort investment in recognizing the exact issues. There is
no mechanism which can give a prediction of probable issues.

2.Traces are big data.
Huge GB files are scanned to filter out exact traces. Currently we scan through
traces which we download as buffer locally. Here Parallel execution of multiple
filters is putting more load to the system.

3.We are having 10-15
filter strings but executing all of them in one go is not possible in current
scenario.

Proposed Solution

We propose a solution of using
HDInsight to use Hive Query and do parallel execution using map reduce . Map Reduce is a technology to
divide the large data in multiple chunks and send it to mappers. Mappers are
executors which work on small data size and provide output. Outputs from
different mappers are combined and reduced to a cumulative output using Reduceer.

We can execute Hive query
language implementation on the winfab traces to accomplish multiple known
filters consequently on the big data and filter out the traces in text format.

Before going forward here are few
terminology that will help us the technologies used for this parsing.

What is Hive

Hive is a data
warehousing infrastructure based on Hadoop. Hadoop
provides massive scale out and fault tolerance capabilities for data storage
and processing (using the map-reduce programming paradigm) on commodity
hardware.

Hive is designed
to enable easy data summarization, ad-hoc querying and analysis of large
volumes of data. It provides a simple query language called Hive QL, which is
based on SQL and which enables users familiar with SQL to do ad-hoc querying,
summarization and data analysis easily. At the same time, Hive QL also allows
traditional map/reduce programmers to be able to plug in their custom mappers
and reducers to do more sophisticated analysis that may not be supported by the
built-in capabilities of the language.

HDInsight is a Hadoop
distribution powered by the cloud. This means HDInsight was
architected to handle any amount of data, scaling from terabytes to petabytes
on demand. You can spin up any number of nodes at anytime.

This could have been done in a shorter span, but considering
my laziness, the delay was bound to happen J

Following is the story line best to my knowledge.

March 11, 2013 6:00pm

One of the member of the Microsoft’s Most Valued Professional group, saw the Data Visualization section in NetMe and asked me about
my experience in NodeXL.

NodeXL is a free open source set of libraries which helps
you create a graph out of a GraphML or GEXF file format. You can also create
your custom graphs using the libraries. This is an easy to learn feature and
very effective in Data Visualization.

The objective was
that NodeXL comes with an Microsoft Office Excel dependency, he was looking for
an approach how we can remove this dependency and create a windows form app which
will fetch the twitter related nodes and its children and show it in a graphical
format.

NodeXL was totally new to me and creating an app on that seems
to be challenging. I thought to give a try and see how it goes.

March 11, 2013 9:00 pm

I got to know the basic features of what NodeXL do.
Downloaded the source code of the same, but diving into 18 projects in the
source code repository was not only time consuming but also not feasible for
me. So I posted the discussion in NodeXl community forum at http://nodexl.codeplex.com/discussions/436250

To my query, I was really satisfied by the quick response
given by the moderator. At around 10 pm IST I received a response from him that
it was possible and there was a library set that we can download and start
using.

March 13, 2013

Till now I was confident enough about the usage in NodeXL. I
prepared a small POC wherein I created a graph using a sample graphml file. GraphML file format are used to
create directed, undirected and mixed graphs. It is based on XML based file
format. A sample GraphML format could be created as likewise

Previous NodeXL community discussions helped me a lot in
this. I am not highlighting the code as that can be easily searched in the
NodeXL community discussion forum. Rather let me focus on the major problem
statement of how to interact NodeXL with Twitter API.

At first I thought to take help of community member on this
so posted the second thread on the forum

March 14, 2013

To this I got the whole step by step approach by the moderator on how
we can use the NodeXL libraries to fetch twitter data. This was indeed of great help, but at this moment the
Azure Mobile script which I wrote for NetMe flashed up. I thought to reuse the
code.

The code was to fetch top 18-20 tweets with their username for
any particular searched keyword.

// NodeXL is a WPF control. If you
need to host this control in Windows form you need to create an element host
control and assign the control to that.

elementHost_Graph.Child = nodexl_Graph;

POC

This app has been uploaded under the downloads page in http://netme.cloudapp.net/ .To try this app you need to download the zip
folder, extract it and run the msi file. It will get installed in your machine
and you will get to see the NetMe shortcut in your desktop. As this is a POC, I
have not focused much on the look and feel of the app.

Once the app gets installed you will be able to search for
any keyword and see the corresponding graph. You can shift the nodes
accordingly.

The major intent of this blog was to highlight how fast and
easy it is to code using NodeXL libraries and create your own data visualizer
tool within a short time span. Hope you will like itJ

About Me

Microsoft Ceritified professional in Design and Develop in Azure (MS 70-583). Guinness World Record Holder for participating in 18 hours continous Windows 8 Appfest coding marathon. Currently working in Windows Azure Fabric. Past experience in Windows8 Metro Style and Azure Mobile Services and created a BillShare application which is published in Windows8 marketplace Past experience was on CTP build of HadooponAzure. Big data statistics for classifier/clustering and recommendation related algorithms. Big data study and development on classification/clustering/recommendation algorithm in Microsoft Daytona map reduce framework.