You heard the news - Embracing the world of Micro Services and Containers is essential to help you make your organisation's IT operations more agile, providing immediate operational benefits. Let us have a look at few essential concepts to help you start a conversation with your own internal IT, Developers and Vendors.

Micro services

Well, as Martin Fowler put it, "Micro-services - yet another new term on the crowded streets of software architecture”, is what everybody is talking about.

In simple terms, Micro service architecture enables packaging (mentally and physically) each unit of functionality into a service, and you can distribute and scale these services independently. In a traditional monolithic web or enterprise application, if you need to change a simple functionality, you have to rebuild and redeploy the whole application. In a Micro service architecture, you can individually deploy and scale services.

Now, this has got multiple advantages. You can scale only the services you need to distribute the load effectively - i.e, if you see that your customers are using your Order service more than others, you can scale up only the Order service instances from 10 to 20. Though this is nothing new, the evolution of container technologies accelerated Micro service based systems, and enabled organizations to adopt a very agile, continuous delivery based workflow to build and deploy applications faster.

Containers

‘Container’ is probably the most abused term in this year after the term ‘locker room talk’. In it’s original sense as it is used today in Dev Ops, the term emerged from LXC (Linux Containers). LXC is an OS level virtualization method for running multiple isolated Linux systems (containers) on a control host using a single Linux kernel.

What is the difference between LXC and Virtual Machines? As you are aware, standard virtualization systems (like KVM, VirtualBox etc) lets you boot full operating systems of different kinds, even non-Linux systems. The main difference between this and Linux Containers is that virtual machine require a separate kernel instance to run on - i.e almost a full standalone OS.

How ever, multiple LXCs can be deployed on top of the same Kernel (Ah, Microsoft didn’t see that coming oops) - So, LXCs are much cheaper to create and destroy (from a memory and processor foot print perspective) compared to Virtual Machines. One Linux Container (LXC) can run a single process, and as long as you don’t give root permissions for the process you run, you can impart some level of security to the process your container is running. To be fancy - you can group containers to Pods, and run them in Nodes. (Side note - Microsoft recently announced Windows Containers -have a look)

Platforms like Docker provides an easy workflow for developers to package their application to a container ‘image’, to spin off instances of this container later in a very easy way.

Pods

In simple terms - containers that need to co-exist in the same Kernel/Virtual Machine/Node and related run time information is grouped together as a Pod. So, a Pod is essentially a group of containers that should co-exist. And typically an application running in one container can access another container via the ‘localhost’ as long as both the containers are in the same pod. Containers with in the same pod will also mostly share the same storage context - much like two applications running in a virtual machine.

Volumes

A Volume is an abstraction you can use for storage, and can be used by containers to read/write data. So, containers of the same Pod can use a ‘Volume’. From Kubrnetes perspective, Volumes are attached to Pods - so even if a container crashes, the files etc related to restart the container can be kept in the Volume. But when you remove/delete a Pod, normally you throw away the volume related to the Pod as well (in simple scenarios).

Nodes

You can consider a Node as a worker machine (either a Virtual Machine or a bare metal physical machine). Nodes can run Pods and multiple nodes are managed by one or more master nodes to form a cluster.

Clusters

A cluster is a large group of containers, some of them grouped into pods and some of them not. A cluster normally has one or more master nodes that manages the pods/containers deployed in the nodes - the master is responsible for ensuring the requested number of container instances are up and running all the time, and also providing API access to the containers in the cluster.

Cluster Federations

Typically, a cluster runs in a single on premise data center, or in a single availability zone in case of cloud providers - now what if these clusters can be tied to each other and federate them? This will enable interesting use cases like ability to overflow your work loads from one cluster to another. For example, an application can run in a private/one-premise cloud and burst into a public cloud when the demand of compute overflows a specific limit (typically mentioned as Cloud-Busting). The easiest way is to start with Kubernetes Cluster Federations

Why Micro services Pattern Love Containers?

As containers are easy to spin up and down, this became the favorite model of packaging and shipping your micro services. You can create a service, and package it to a container - and deploy them independently. Docker because so popular because of its ability to build, package and deploy applications/services using a light weight container. You can use a Docker image to spin off multiple container instances.

Kubernetes, Docker Swarm etc went one step further, allowing you to define and deploy containers at scale to form a whole cluster of pods with containers. For example, container orchestration engines like Kubernetes will let you specify the whole cluster configuration - including how many containers you need per service/application and how exactly they should talk to each other.

So, start from here and think how to be more agile - and re-architect your own enterprise to build and deliver business benefits faster, in an agile way, embracing containerization .

PB- This is a very evolving space, and there are lot of players and platforms in the market. Most of the time an apple to apple comparison is not possible between the tools and platforms. But if you are looking for going one step further, have a look at the container platforms like OpenShift(https://www.openshift.com/), Cloud Foundry(https://www.cloudfoundry.org/) etc. Macro level platforms/orchestration tools like Fabric8 https://fabric8.io/ are also becoming mature - allowing you to spin off your entire dev-ops pipe line as a platform - and optimize and manage everything using a unified user experience.

What all decisions you made today were influenced by some kind of algorithms?

When you asked google maps to show you the shortest driving route? Or when you asked Siri to show you the hotel for your break fast? Or when you checked Flip board or Twitter to find the recent stories to satisfy your intellectual appetite? Or when you found your date based on those recommendations and profile matching?

Wait, how exactly you found this article and why you are reading it now?

Today, most of our decisions are made with the help of apps. Or in other words, these apps and the related algorithms, may be some where in the cloud wired to them, influence us in almost all our decisions, modifying or seriously impacting our behaviour. They influence our thoughts and decisions by making suggestions, deciding what information we see or don't see - even deciding whom should we follow or date. And every day, a lot of people including me spend most of their time enriching these algorithms, and building new apps using them, to help (influence) all of us, based on social data, past behaviour and what not.

All Hail Google Now, Siri, and Cortana. And those recommendations and ads springing up from every where persuading you . And the algorithms behind all of them.

So, you could theoretically argue that, we are at a point in history where a connected human being's nature is seriously influenced and/or modified by the complex nexus of apps and algorithms.

The technological singularity, or simply the singularity, is a hypothetical moment in time when artificial intelligence will have progressed to the point of a greater-than-human intelligence, radically changing civilisation, and perhaps human nature

This post is not to alert that SkyNet will take over tomorrow. Also, the intention is not to state that it is either "good' or 'bad', which are two mediocre relative terms.

You still have the choice to shut off, but the persuasion to do things easy by delegating it to an app wins almost all the time. And I think that is fine - as long as we carefully exercise our free will to make the final decision.

In my last post, we explored how to create a tiny Roslyn app to compile C# code, to test out some of the new C# 6.0 features.Read it here

Now go ahead and play with some of the preview features. Try things out. Create a file with some C# 6.0 sugar and compile it with our above app. You can explore the language features that are completed in this list, from the Roslyn documentation in CodePlex

Here is some quick code that demos some of the features

1- Support for Primary Constructors and Auto Property Assignments
Primary constructors allow you to specify arguments as part of your class declaration itself. Also, now C# supports assigning to Auto properties. Together, you may use it to initialise classes, as shown below.

Some of the C# 6.0 Features are exciting. And you can try them out now as the new Roslyn preview is out. You can explore the language features that are completed in this list, from the Roslyn documentation in CodePlex. Some of the ‘Done’ features for C#, based on the documentation there include

Here is a quick intro screen cast on Big Data and creating map reduce jobs in C# to distribute the processing of large volumes of data, leveraging Microsoft Azure HDInsight / Hadoop On Azure, based on my virtual tech days presentation.

I just checked in a ScriptCs Templating module to integrate Razor and StringTemplate transformations in ScriptCs workflow. ScrtipCs Templating module can apply a Razor or StringTemplate (ST4) template on top of one or more model files (normally an xml file or json file), for scenarios like code generation or templating. Example below.

You can specify the output file using the -out switch -scriptcs mytemplate.cst -modules template -- -out result.txt (The parameters after -- are the template module parameters according to ScriptCs convention)

Rendering a template Using Models

Template module automagically converts xml files/urls and json files/urls to dynamic models that can be used from your template. Technically, it creates a C# fleuent dynamic object that wraps the xml/json.

Quick example: Create a new folder, and create a model.xml file inside that.

Over the last few weekends I’ve spend some time building a simple robot that can be controlled using Kinect. You can see it in action below.
Ever since I read this Cisco paper that mentions Internet Of Things will crate a whooping 14.4 Trillion $ at stake, I revamped my interests in hobby electronics and started hacking with DIY boards like Arduino and Raspberry Pi. That turned out to be fun, and ended up with the robot. This post provides you the general steps, and the github code may help you build your own.
Even if you don’t have a Kinect for the controller, you can easily put together a controller using your phone (Windows Phone/Android/iOS) as we are using blue tooth to communicate with the controller and the robot.

Now, here is a quick start guide to build your own. We’ve an app running in the laptop that is communicating to the Robot via blue tooth in this case that pumps the commands based on input from Kinect, you could easily build a phone UI as well.
And if you already got the idea, here is the code – You may read further to build the hardware part.

1 – Build familiarity

You need to build some familiarity with Arduino and/or Netduino – In this example I’ll be using Arduino.

Mainly you need to understand the Pins in the Arduino board. You can write simple programs with the Arduino IDE (try the Blink sample to blink an LED in IDE File->Samples). PFB the pins description, from SparkFun website.

GND (3): Short for ‘Ground’. There are several GND pins on the Arduino, any of which can be used to ground your circuit.

5V (4) & 3.3V (5): The 5V pin supplies 5 volts of power, and the 3.3V pin supplies 3.3 volts of power. Most of the simple components used with the Arduino run happily off of 5 or 3.3 volts. If you’re not sure, take a look at Spark Fun’s datasheet tutorial then look up the datasheet for your part.

Analog (6): The area of pins under the ‘Analog In’ label (A0 through A5 on the UNO) are Analog In pins. These pins can read the signal from an analog sensor (like a temperature sensor) and convert it into a digital value that we can read.

Digital (7): Across from the analog pins are the digital pins (0 through 13 on the UNO). These pins can be used for both digital input (like telling if a button is pushed) and digital output (like powering an LED).

PWM (8): You may have noticed the tilde (~) next to some of the digital pins (3, 5, 6, 9, 10, and 11 on the UNO). These pins act as normal digital pins, but can also be used for something called Pulse-Width Modulation (PWM). We have a tutorial on PWM, but for now, think of these pins as being able to simulate analog output (like fading an LED in and out).

AREF (9): Stands for Analog Reference. Most of the time you can leave this pin alone. It is sometimes used to set an external reference voltage (between 0 and 5 Volts) as the upper limit for the analog input pins.

Protocols

2 – Get The Components

Again, you could find an online store to buy these components. Also, you could try the nearby local electronic store and buy some bread boards, jumper wires (get some mail to mail & female to female wires) etc as well. Here is the list of components you need to build CakeRobot.

A Chassis – I used Dagu Magician Chassis – From Spark Fun, Rhydolabz – it comes with two stepper motors that can be controlled by our driver board.

A Arduino board with Motor Driver – I used Dagu Mini Motor Driver – bought from Rhydolabz in India. For other countries you need to search and find. Some description about the board can be found here – It also has a special slot to plug in dagu blue tooth shield. You could also use Micro Magician

You also need a Micro USB cable to connect your PC to the motor driver to upload code.

3 – Programming the components

You need to spend some hours figuring out how to program each of the components.

To start with, play with Arduino a bit, connecting LEDs, switches etc. Then, understand a bit about programming the Digital and Analog pins. Play with the examples

Try programming the ultrasonic sensor if you’ve one using your Arduino, using serial sockets. If you are using Ping sensor, check out this

Try programming the blue tooth module (Code I used for the distance sensor and blue tooth module are in my examples below, but it’ll be cool if you can figure things out yourself).

4 – Put the components together

Assemble the Dagu Magician Chassis, and place/screw/mount the mini motor driver and Bluetooth module on top of the same. Connect the components using jumper wires/plugin as required. A high level schematic below.

Here is a low resolution snap of mine, from top.

5 – Coding the Arduino Mini Driver

You can explore the full code in the Github repo - How ever, here are few pointers. According to the Dagu Arduino Mini driver spec, the following digital pins can be used to control the motors

D9 is left motor speed

D7 is left motor direction

D10 is right motor speed

D8 is right motor direction

To make a motor move, first we need to set the direction by doing a digitalWrite of HIGH or LOW (for Forward/Reverse) to the direction pin. Next set the motor speed by doing an analogWrite of 0~255 to the speed pin. 0 is stopped and 255 is full throttle.
In the Arduino code, we are initiating communication via blue tooth, to accept commands as strings. For example, speedl 100 will set the left motor speed to 100, and speedr 100 will set the right motor speed to 100. Relevant code below.

Have a look at the full code of the quick Arduino client here. Then, compile and upload the code to your mini driver board.

6 – Coding the Controller & Kinect

Essentially, what we are doing is just tracking the Skeletal frame, and calculating the distance of your hand from your hip to provide the direction and speed for the motors. Skeletal tracking details here
We are leveraging http://32feet.codeplex.com/ for identifying the Blue tooth shield to send the commands. Please ensure your blue tooth shield is paired with your PC/Laptop/Phone – you can normally do that by clicking the blue tooth icon in system tray in Windows, and clicking Add Device.

In my last post – we had a look at Interactive Extensions. In this post, we’ll do a recap of Reactive Extensions and LINQ to Event streams.

Reactive Extensions are out there in the wild for some time, and I had a series about Reactive Extensions few years back. How ever, after my last post on Interactive Extensions, I thought we should discuss Reactive extensions in a bit more detail. Also, in this post we’ll touch IQbservables – the most mysteriously named thing/interface in the world, may be after Higgs Boson. Push and Pull sequences are everywhere – and now with the devices on one end and the cloud at the other end, most of the data transactions happen via push/pull sequences. Hence, it is essential to grab the basic concepts regarding the programming models around them.

First Things First

Let us take a step back and discuss IEnumerable and IQueryable first, before discussing further about Reactive IObservable and IQbservable (Qbservables = Queryable Observables – Oh yea, funny name).

IEnumerable<T>

As you may be aware, the IEnumerable model can be viewed as a pull operation. You are getting an enumerator, and then you iterate the collection by moving forward using MoveNext on a set of items till you reach the final item. And Pull models are useful when the environment is requesting data from an external source. To cover some basics - IEnumerable has a GetEnumerator method which returns an enumerator with a MoveNext() method and a Current property. Offline tip - A C# for each statement can iterate on any dumb thing that can return a GetEnumerator. Anyway, here is what the non generic version of IEnumerable looks like.

Now, LINQ defines a set of operators as extension methods, on top of the generic version of IEnumerable – i.e, IEnumerable<T> - So by leveraging the type inference support for Generic Methods, you can invoke these methods on any IEnumerable with out specifying the type. I.e, you could say someStringArray.Count() instead of someStringArray.Count<String>(). You can explore Enumerable class to find these static extensions.

The actual query operators in this case (like Where, Count etc) with related expressions are compiled to IL, and they operate in process much like any IL code is executed by CLR. From an implementation point of view, the parameters of LINQ clauses like Where is a lambda expression (As you may be already knowing, the from.. select is just Syntax sugar that gets expanded to extension methods of IEnumerable<T>), and in most cases a delegate like Func<T,..> can represent an expression from an in memory perspective. But what if you want apply query operators on items sitting some where else? For example, how to apply LINQ operators on top of a set of data rows stored in a table in a database that may be in the cloud, instead of an in memory collection that is an IEnumerable<T>? That is exactly what IQueryable<T> is for.

IQueryable<T>

IQueryable<T> is an IEnumerable<T> (It inherits from IEnumerable<T>) and it points to a query expression that can be executed in a remote world. The LINQ operators for querying objects of type IQueryable<T> are defined in Queryable class, and returns Expression<Func<T..>> when you apply them on an IQueryable<T>, which is a System.Linq.Expressions.Expression (you can read about expression trees here). This will be translated to the remote world (say a SQL system) via a query provider. So, essentially, IQueryable concrete implementations points to a query expression and a Query Provider – it is the job of Query Provider to translate the query expression to the query language of the remote world where it gets executed. From an implementation point of view, the parameters you pass for LINQ that is applied on an IQueryable is assigned to an Expression<T,..> instead. Expression trees in .NET provides a way to represent code as data or kind of Abstract Syntax Trees. Later, the query provider will walk through this to construct an equivalent query in the remote world.

For example, in LINQ to Entity Framework or LINQ to SQL, the query provider will convert the expressions to SQL and hand it over to the database server. You can even view the translation to the target query language (SQL), just by looking at the Or in short, the LINQ query operators you apply on IQueryable will be used to build an expression tree, and this will be translated by the query provider to build and execute a query in a remote world. Read this article if you are not clear about how an expression trees are built using Expression<T> from Lambdas.

Reactive Extensions

So, now let us get into the anatomy and philosophy of observables.

IObservable <T>

As we discussed, objects of type IEnumerable<T> are pull sequences. But then, in real world, at times we push things as well – not just pull. (Health Alert – when you do both together, make sure you do it safe). In a lot of scenarios, push pattern makes a lot of sense – for example, instead of you waiting in a queue infinitely day and night with your neighbors in front of the local post office to collect snail mails, the post office agent will just push you the mails to your home when they arrive.

Now, one of the cool things about push and pull sequences are, they are duals. This also means, IObservable<T> is a dual of IEnumerable<T> – See the code below. So, to keep the story short, the dual interface of IEnumerable, derived using the Categorical Duality is IObservable. The story goes like some members in Erik’s team (he was with Microsoft then) had a well deserved temporal meglomaniac hyperactive spike when they discovered this duality. Here is a beautiful paper from Erik on that if you are more interested – A brief summary of Erik’s paper is below.

Now, LINQ operators are cool. They are very expressive, and provide an abstraction to query things. So the crazy guys in the Reactive Team thought they should take LINQ to work against event streams. Event streams are in fact push sequences, instead of pull sequences. So, they built IObservable. IObservable fabric lets you write LINQ operators on top of push sequences like event streams, much like the same way you query IEnumerable<T>. The LINQ operators for an object of type IObservable<T> are defined in Observable class. So, how will you implement a LINQ operator, like where, on an observer to do some filtering? Here is a simple example of the filter operator Where for an IEnumerable and an IObservable (simplified for comparison). In the case of IEnumerable, you dispose the enumerator when we are done with traversing.

Now, look at the IObservable’s Where implementation. In this case, we return the IDisposable handle to an Observable so that we can dispose it to stop subscription. For filtering, we are simply creating an inner observable that we are subscribing to the source to apply our filtering logic inside that - and then creating another top level observable that subscribes to the inner observable we created. Now, you can have any concrete implementation for IObservable<T> that wraps an event source, and then you can query that using Where!! Cool. Observable class in Reactive extensions has few helper methods to create observables from events, like FromEvent. Let us create an observable, and query the events now. Fortunately, the Rx Team already has the entire implementation of Observables and related Query operators so that we don’t end up in writing customer query operators like this.

You can do a nuget for install-package Rx-Mainto install Rx, and try out this example that shows event filtering.

Obviously, in the above example, we could’ve used Observable.Timer – but I just wanted to show how to wrap an external event source with observables. Similarly, you can wrap your Mouse Events or WPF events. You can explore more about Rx and observables, and few applications here. Let us move on now to IQbservables.

IQbservable<T>

Now, let us focus on IQbservable<T>. IQbservable<T> is the counterpart to IObserver<T> to represent a query on push sequences/event sources as an expression, much like IQueryable<T> is the counterpart of IEnumerable<T>. So, what exactly this means? If you inspect IQbservable, you can see that

You can see that it has an Expression property to represent the LINQ to Observable query much like how IQueryable had an Expression to represent the AST of a LINQ query. The IQbservableProvider is responsible for translating the expression to the language of a remote event source (may be a stream server in the cloud).

Conclusion

This post is a very high level summary of Rx Extensions, and here is an awesome talk from Bart De Smet that you cannot miss.

And let me take the liberty of embedding the drawing created by Charles that is a concrete representation of the abstract drawing Bart did in the white board. This is the summary of this post.

We’ll discuss more practical scenarios where Rx and Ix comes so handy in future – mainly for device to cloud interaction scenarios, complex event processing, task distribution using ISheduler etc - along with some brilliant add on libraries others are creating on top of Rx. But this one was for a quick introduction. Happy Coding!!

Recently while I was giving a C# talk, I realized that a lot of developers are still not familiar with the advantages of some of the evolving, but very useful .NET libraries. Hence, I thought about writing a high level post introducing some of them as part of my Back To Basics series, generally around .NET and Javascript. In this post we’ll explore Interactive Extensions, which is a set of extensions initially developed for Reactive Extensions by the Microsoft Rx team.

Recap

Interactive Extensions, at its core, has a number of new extensions methods for IEnumerable<T> – i.e it adds a number of utility LINQ to Object query operators. You may have hand coded some of these utility extension methods some where in your helpers or utility classes, but now a lot of them are aggregated together by the Rx team. Also, this post assumes you are familiar with the cold IEnumerable model and iterators in C#. Basically, what C# compiler does is, it takes an yield return statement and generate a class out of that for each iterator. So, in one way, each C# iterator internally holds a state machine. You can examine this using Reflector or something, on a method yield returning an IEnumerator<T>. Or better, there is a cool post from my friend Abhishek Sur here or this post about implementation of Iterators in C#

More About Interactive Extensions

Fire up a C# console application, and install the Interactive Extensions Package using install-package Ix-Main . You can explore the System.Linq.EnumerationsEx namespace in System.Interactive.dll - Now, let us explore some useful extension methods that got added to IEnumerable.

Examining Few Utility Methods In Interactive Extensions

Let us quickly examine few useful Utility methods.

Do

What the simplest version of 'Do' does is pretty interesting. It'll lazily invoke an action on each element in the sequence, when we do the enumeration leveraging the iterator.

And the result below. Note that the action (in this case, our Console.WriteLine to print the values) is applied place when we enumerate.

Now, the implementation of the simplest version of Do method is something like this, if you have a quick peek at the the Interactive Extensions source code here in Codeplex, you could see how our Do method is actually implemented. Here is a shortened version.

Scan

Scan will take a sequence, to apply an accumulator function to generate a sequence of accumulated values. For an example, let us create a simple sum accumulator, that'll take a set of numbers to accumulate the sum of each number with the previous one

Conclusion

We just touched the tip of the iceberg, as the objective of this post was to introduce you to Ix. We may discuss this in a bit more depth, after covering few other libraries including Rx. There is a pretty exciting talk from Bart De Smet here that you should not miss. Ix is specifically very interesting because of it’s functional roots. Have a look at the Reactive Extensions repository in Codeplex for more inspiration, that should give you a lot more ideas about few functional patterns. You may also play with Ix Providers and Ix Async packages.

Let us help the Stack Exchange guys to suggest questions to a user that he can answer, based on his answering history, much like the way Amazon suggests you products based on your previous purchase history. If you don’t know what Stack Exchange does – they run a number of Q&A sites including the massively popular Stack Overflow.

Our objective here is to see how we can analyze the past answers of a user, to predict questions that he may answer in future. May Stack Exchange’s current recommendation logic may work better than ours, but that won’t prevent us from helping them for our own learning purposes .

Behind the Scenes

Here we go, let us get into some “data science” woo do first. Cool!! Distributed Machine learning is mainly used for

Recommendations - Remember the Amazon Recommendations? Normally used to predict preferences based on history.

Clustering - For tasks like finding grouping together related documents from a set of documents, or finding like minded people from a community

Classification - For identifying which set of category a new item belongs to. This normally includes training the system first, and then asking the system to detect an item.

“Big Data” jargon is often used when you need to perform operations on a very large data set. In this article, we’ll be dealing with extracting some data from a large data set, and building a Recommender using our extracted data.

What is a Recommender?

Broadly speaking, we can build a recommender either by

Finding questions that a user may be interested in answering, based on the questions answered by other users like him

Finding other questions that are similar to the questions he answered already.

The first technique is known as user based recommendation, and the second technique is known as item based recommendations.

In the first case, taste can be determined by how may questions you answered in common with that user (the questions both of you answered). For example, think about User1, User2, User3 and User4 – Answering few questions Q1, Q2, Q3 and Q4. This diagram shows the Questions answered by the users

Based on the above diagram, User1 and User2 answered Q1, Q2 and Q3. Now, User3 answered Q3 and Q2, but not Q1. Now, to some extent, we can safely assume that User3 will be interested in answering Q1 – because two users who answered Q2 and Q3 with him already answered Q1. There is some taste matching here, isn’t it? So, if you have a array of {UserId, QuestionId} – it seems that data is enough for us to build a recommender.

The Logic Side

Now, how exactly we are going to do build a question recommender? In fact it is quite simple.

First, we need to find the number of times a pair of questions co-occur across the available users. Note that this matrix is having no relations with the user. For example, if Q1 and Q2 is appearing together 2 times (as in the above diagram), co occurrence value at {Q1,Q2} will be 2. Here is the co-occurrence matrix (hope I got this right).

Q1 and Q2 co-occurs 2 times (User1 and User2 answered Q1 ,Q2)

Q1 and Q3 co-occurs 2 times (User1 and User2 answered both Q1, Q3)

Q2 and Q3 co-occurs 3 times (User1, User2 and User3 answered Q2, Q3)

Like wise..

The above matrix just captures how many times a pair of questions co-occurred (answered) as discussed above. There is no mapping with users yet. Now, how we’ll relate this to find a user’s preference? To find out how close a question ‘matches’ a user, we just need to

Find out how often that question co occurs with other questions answered by a that user

Eliminate questions already answered by the user.

For the first step, we need to multiply the above matrix with the user’s preference matrix.

For example, let us Take User3. For User3, the Preference mapping with questions [Q1,Q2,Q3,Q4] is [0,1,1,0] because he already answered Q2 and Q3, but not Q1 and Q4. So, let us multiply this with the above co-occurrence matrix. Remember that this is a matrix multiplication /dot product. The Result indicates how often a Question co-occurs with other questions answered by a user (weightage).

We can omit Q2 and Q3 from the results, as we know the User 3 already answered them. Now, from the remaining, Q1 and Q4 – Q1 has the higher value (4) and hence the higher taste matching with User3. Intuitively, this indicated Q1 co-occurred with the questions already answered by User 3 (Q2 and Q3) more than Q4 co-occurred with Q2 and Q3 – so User3 will be interested in answering Q1 more than Q4. In an actual implementation, note that the User’s taste matrix will be a sparse matrix (mostly zeros) as the user will be answering only a very limited subset of questions in the past. The advantage of the above logic is, we can use a distributed map reduce model for compute with multiple map-reduce tasks - Constructing the co-occurrence matrix, Finding the dot product for each user etc.

Now, let us start thinking about the implementation.

Implementation

From the implementation point of view,

We need to provision a Hadoop Cluster

We need to download and extract the data to analyze (Stack Overflow data)

Job 1 – Extract the Data - From each line, extract {UserId, QuestionId} for all questions answered by the user.

Job 2 – Build the Recommender - Use the output from above Map Reduce to build the recommendation model where possible items are listed against each user.

Let us roll!!

Step 1 - Provisioning Your Cluster

Now remember, the Stack Exchange data is huge. So, we need to have a distributed environment to process the same. Let us head over to Windows Azure. If you don’t have an account, sign up for the free trial. Now, head over to the preview page, and request the HDInsight (Hadoop on Azure) preview.

Once you have the HD Insight available, you can create a Hadoop cluster easily. I’m creating a cluster named stackanalyzer.

Once you have the cluster ready, you’ll see the Connect and Manage buttons in your dashboard (Not shown here). Connect to the head node of your cluster by clicking the ‘Connect’ button, which should open a Remote Desktop Connection to the head node. You may also click the ‘Manage’ button to open your web based management dashboard. (If you want, you can read more about HD Insight here)

What we are interested is in the Posts XML File. Each line represents either a question, or an answer. If it is a question, PostTypeId =1, and if it is an answer, PostTypeId=2.The ParentId represents the question’s Id for an answer, and OwnerUserId represents the guy who wrote the answer for this question.

So, for us, we need to extract the {OwnerUserId, ParentId} for all posts where PostTypeId=2 (Answers) which is a representation of {User,Question,Votes}. The Mahout Recommender Job we’ll be using later will take this data, and will build a Recommendation result.

Now, extracting this data itself is a huge task when you consider the Posts file is huge. For the Cooking site, it is not so huge – but if you are analyzing the entire Stack Overflow, the Posts file may come in GBs. For extraction of this data itself, let us leverage Hadoop and write a custom Map Reduce Job.

Step 3 - Extracting The Data We Need From the Dump (User, Question)

To extract the data, we’ll leverage Hadoop to distribute. Let us write a simple Mapper. As mentioned earlier, we need to figure out {OwnerUserId, ParentId} for all posts with PostTypeId=2. This is because, the input for the Recommender Job we may run later is {user, item}. For this, first load the Posts.XML to HDFS. You may use the hadoop fs command to copy the local file to the specified input path.

Now, time to write a custom mapper to extract the data for us. We’ll be using Hadoop On Azure .NET SDK to write our Map Reduce job. Not that we are specifying the input folder and output folder in the configuration section. Fire up Visual Studio, and create a C# Console application. If you remember from my previous articles, hadoop fs <yourcommand> is used to access HDFS file system, and it’ll help if you know some basic *nix commands like ls, cat etc.

Note: See my earlier posts regarding the first bits of HDInsight to understand more about Map Reduce Model and Hadoop on Azure

Now, Compile and run the above program. The ExecuteJob will upload the required binaries to your cluster, and will initiate a Hadoop Streaming Job that’ll run our Mappers on the cluster, with input from the Posts file we stored earlier in the Input folder. Our console application will submit the Job to the cloud, and will wait for the result. The Hadoop SDK will upload the map reduce binaries to the blob, and will build the required command line to execute the job (See my previous posts to understand how to do this manually). You can inspect the job by clicking Hadoop Map Reduce status tracker from the desktop short cut in the head node.

If everything goes well, you’ll see the results like this.

As you see above, you can find the output in /output/Cooking folder. If you RDP to your cluster’s head node, and check the output folder now, you should see the files created by our Map Reduce Job.

And as expected, the files contain the extracted data, which represents the UserId,QuestionId – For all questions answered by a user. If you want, you can load the data from HDFS to Hive, and then view the same with Microsoft Excel using the ODBC for Hive. See my previous articles.

Step 4 – Build the recommender And generate recommendations

As a next step, we need to build the co-occurrence matrix and run a recommender job, to convert our {UserId,QuestionId} data to recommendations. Fortunately, we don’t need to write a Map Reduce job for this. We could leverage Mahout library along with Hadoop. Read about Mahout Here

RDP to the head node of our cluster, as we need to install Mahout. Download the latest version of Mahout (0.7) as of this writing, and copy the same to the c:\app\dist folder in the head node of your cluster.

Mahout’s Recommender Job has support for multiple algorithms to build recommendations – In this case, we’ll be using SIMILARITY_COOCCURRENCE. The Algorithms Page of Mahout website has lot more information about Recommendation, Clustering and Classification algorithms. We’ll be using the files we’ve in the /output/Cooking folder to build our recommendation.

Time to run the Recommender job. Create a users.txt file and place the IDs of the users for whom you need recommendations in that file, and copy the same to HDFS.

Now, the following command should start the Recommendation Job. Remember, we’ll use the output files from our above Map Reduce job as input to the Recommender. Let us kick start the Recommendation job. This will generate output in the /recommend/ folder, for all users specified in the users.txt file. You can use the –numRecommendations switch to specify the number of recommendations you need against each user. If there is a preference relation with a user and and item, (like the number of times a user played a song), you could keep the input dataset for a recommender as {user,item,preferencevalue} – In this case, we are omitting the preference weightage.

After the job is finished, examine the /recommend/ folder, and try printing the content in the generated file. You may see the top recommendations, against the user Ids you had in the users.txt.

So, the recommendation engine think User 1393 may answer the questions 6419, 16897 etc if we suggest the same to him. You could experiment with other Similarity classes like SIMILARITY_LOGLIKELIHOOD, SIMILARITY_PEARSON_CORRELATION etc to find the best results. Iterate and optimize till you are happy.

For an though experiment here is another exercise - Examine the Stack Exchange data set, and find out how you may build a Recommender to show a ‘You may also like’ questions based on the questions a user favorite?

Conclusion

In this example, we were doing a lot of manual work to upload the required input files to HDFS, and triggering the Recommender Job manually. In fact, you could automate this entire work flow leveraging Hadoop For Azure SDK. But that is for another post, stay tuned. Real life analysis has much more to do, including writing map/reducers for extracting and dumping data to HDFS, automating creation of hive tables, perform operations using HiveQL or PIG, etc. However, we just examined the steps involved in doing something meaningful with Azure, Hadoop and Mahout.

You may also access this data in your Mobile App or ASP.NET Web application, either by using Sqoop to export this to SQL Server, or by loading it to a Hive table as I explained earlier. Happy Coding and Machine Learning!! Also, if you are interested in scenarios where you could tie your existing applications with HD Insight to build end to end workflows, get in touch with me.