When you need it !

Month: March 2015

Today I struggled for 15 minutes before figuring out how to SSH to Linux Server from mac (Being my first time as a previous Putty user from Windows). Mac had some recent changes apparently regarding OpenSSH protocol implementation. ‘M not sure though. All you gotta do is follow the simple steps (if you don’t know a quicker way) in your mac terminal:

So, you want to know what is Hadoop and how can Python be used to perform a simple task using Hadoop Streaming?

HADOOP is a software framework that mainly is used to leverage distributed systems in an intelligent manner and also, perform efficiently operations on big datasets without letting the user worry about nodal failures (Failure in one among the ‘n’ machines performing your task). Hadoop has various components :

Hadoop Common – contains libraries and utilities needed by other Hadoop modules

Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster

Hadoop YARN – a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users’ applications

I will walk you through the Hadoop MapReduce component. For further information on Map Reduce you can either Google. For now I will present to you a brief introduction of it.

What is MapReduce

To know it truly you have to understand your problem first. So, what kind of problems can Map Reduce solve ?

Counting occurrences of digits in a list of numbers

Counting prime numbers in a list

Counting the number of sentences in a text

Counting the Average of 10 million numbers in a database

Counting the name of all people belonging to a particular sales region

Do you think that these are trivial problems ? Yes, they appear as if they are but what if you have millions of records and time for processing the results is very important for you? Thinking again ? You’ll get your answer.
Not just time but multiple dimensions of a task are there and map reduce if implemented efficiently, can help you overcome the risks associated with processing that much data.

Okay, enough of what and why! Now ask me how !!!

A MapReduce ‘system’ consists of two major kinds of functions, namely the Map Function and the Reduce function (not necessarily with the same names but more often with the pre-decided intentions of acting as the Map and Reduce functions). So, how do we solve the simple problem of counting a million numbers from a list quickly and display their sum ? This is, let me tell you, a very long operation though simple (For a complex program in MapReduce using not Hadoop but mincemeat module please go throughthis.

In this particular example the Map Function(s) will go through the list of numbers and create a list(s) of key-value pairs of the format {m,1} for every number m that occur during the scan. The reduce function takes the list from the Map function(s) and forms a list of key-value pairs as {m,list(1+)}. 1+ means 1 or more occurrences of 1.

The complicated expression above is nothing but just the number m encountered in the scan(s) by the Map Function(s) and the 1’s in the value in the Reduce task appear as many times as the number was encountered in the Map Function(s). So, that basically means {m, number of times m was encountered in the Map Phase}.

The next step is to aggregate the 1’s in the value for every m. This means {m,sum(1s)}. The task is almost done now. All we got to do is just display the number and the corresponding sum of the 1s as the count of the number. But wait, still you don’t understand as why this is a big deal right? Anybody can do this. But hey! The Map Functions aren’t just there to take all your load and process alone all of it. Nope! there are in fact many instances of your Map Function working in parallel in different machines in your cluster (if it exists, else just multithread your program to create multiple instances of Map (but why should you when you have distributed systems). Every Map Function running simultaneously work on different chunks of your big list, hence, sharing the task and reducing processing time. See! What if you have a big cluster and many machines running multiple instances your Map Functions to process your list? It’s simple; your work gets done in no time!!! Similarly, the Reduce functions can also run in multiple machines but generally after sorting ( where your mincemeat or hadoop programs will first sort the say,m‘s and distribute distinct such m‘s to different reduce functions in different machines). So, even aggregation task gets quicker and you are ready with your output to impress your boss!

A brief outline of what happened to the list of numbers is as follows:

Map functions counted every occurrence of every number m

Map functions stored every number m in the form {m,1} - as many pairs for any number m

Reduce functions collected all such {m,1} pairs

Reduce functions converted all such pairs as {m,sum(1's)} - Only 1 pair for a number m

Reduce functions finally displayed the pairs or passed it to the main function to display or process

In the part two of the tutorial I will explain how to install Hadoop and do the same program in python using Hadoop Framework.

For a similar program in mincemeat please go through :
import mincemeat import sys

Nope, there’s no harm in running and playing around in Git. Branches? Nope, not an issue! They are easy to master. All you need to keep in mind while thinking of Branching in Git are :

HEAD pointer

Commit history ( A linked chain of commits )

A few commands

To understand branching we should first understand how Git stores data. It is stored in the form of Snap shots that is actually a chain of commits from the first commit to the most recent commit.
So, think of a commit as a trigger. As we make a commit the following events occur :

A new commit object is created

Previous commit object points to the new commit object

HEAD pointer points to the branch with the new commit object

What is HEAD ? It is just a special pointer in Git that’s used to point to the most current branch. So, if you made a commit , say, in branch master,HEAD points to branch master. But, if you perform the following command git checkout -b testing Git will create a new branch testing and will move HEAD to point to the new branch which basically points to the latest commit object of branch A. Similar stuff happens to an existing branch too. Now, if you commit from testing, HEAD will still point to this branch but this branch will move forward in the project. This means that from the previous commit object which was being pointed to by both branch master and testing points to the new commit object which is in branch testing now and is being pointed to by HEAD while branch master still points to the previous commit. Apparently, it means that this doesn’t in any way affect master. But, now if we run the commands :

git checkout master

git add changed_file

git commit -m "Just committed"

a new diverging commit object created. This means that the first commit object now points to two different commit objects belonging to two different branches master and testing where HEAD still points to the most recent branch, in this case master. The image below illustrates a similar situation for two branches master and testing situation.

Now I am quite sure that you are more comfortable with Git Branching than you have been 10 minutes earlier. To view the current status of the branch pointers and HEAD just type the following command :

Every time you’re asked to submit that crucial programming assignment what comes to your mind after completion of the program ? Uh, not a movie. It’s GIT I know ! So, that’s exactly how you proceed towards storing your lovely little program forever so that you can hug your code every time you miss your ex right?

Git is a terrific way to store not just code but almost every kind of file you can think of that can be stored online. It employs altogether a different approach of controlling versions of your files. While other similar tools like SVN are Centralized Revision Control Systems (CRVS), Git employs Distributed Revision Control System(DRVCS). So, every person having access to your repository can clone your code and maintain a local copy of exactly the same data as in GitHub and can make changes locally with full control. So, in some catastrophical situation, god forbid, some client who cloned your code from Peru can help you to recover your data !!!

There can be nothing better than Git’s own website but I am here to help you skip some contrived details and dive right into basic usage of Git.

Basic Commands

git init – Initializes a local Git repository in your current local directory

While on my way to completion of my program in Information Systems at University of Cincinnati, I stumbled upon this very interesting assignment in my Cloud Computing course offered by the computer science department. It was a simple 4 or less character strings password breaker that attacks a given 32 or less characters’ hex string and provides the strings that are in its VALUE BUCKET. For example we have the sample execution :

Attacking d077f…

{‘found’: [‘cat’, ‘gkf9’]}

In the aforementioned example we are attacking the first 5 characters of a 32 digit hashed hex string where the values collide. That’s another topic of interest that I will discuss later.

The program uses mincemeat.py module from https://github.com/bigsandy/mincemeatpy. This is a Python 2.7.x Map Reduce library that can divide map and reduce tasks to distributed clients to make tasks faster. In my upcoming posts I will write about Map Reduce and Hadoop.

Logic

Generate all possible strings of size 1 to 4 using (0-9) and (a-z)

This can be done in various ways like using pre-built libraries or by some fresh logic like generating first the two character strings and then looping them and appending the same two character strings to them. Once ready, we can choose any series from the list starting with any value from 0 to z ,say , 0000 to 0zzz and consider their last three characters as another addition to our main list. Once done, we can take two character strings and append to the main list and finally, one character strings. This way, we have a total list generating all possibilities of strings form 1 to 4 characters of {0-9 and a-z} in any combination.

Build grains using modulus technique and send to map function.

In the original list ‘bigdata’ we find the length of the list as len(bigdata) and find all its factors. Once found, we can think of the possible number of clients that will execute the map functions and divide the list accordingly in a dictionary of lists , say, {[‘0’, ‘list-chunk1’],[‘1′,’list-chunk2’]…} and build a datasource dictionary using this to be sent to the servers.

Since the map function and reduce function cannot use global variables from the parent program we have to pass the input hashed hex string in the datasource itself by a simple technique of sending a dictionary within a dictionary. So, instead of {[‘0’, ‘list-chunk1’],[‘1′,’list-chunk2’]…} the datasource looks like {[‘0’, {‘d077f’,’list-chunk1′}],[‘1’,{‘do77f’,’list-chunk2′]}…} etc where every key value pair is being sent to a separate map function or a different client. This can be unwrapped in the map function to obtain the hashed hex string ‘d077f’ and the list that has to be hashed string wise to check if its first five characters match ‘d077f’ (example).

Send output from map to reduce function

If a match occurs, send the hashed query string ‘d077f’ (example) and the values that hash to it to reduce function.

Send output from reduce function to the parent program

If map sends a match, capture the results and aggregate all such results into a single list. Example, {‘d077f’, [‘cat’,’wtf’]} send it to the parent program.

Capture reduce functions output

Once the parent function receives data from the reduce function the data can be displayed.

Introduction to Virtualization

Wikipedia says that “Virtualization is the creation of a virtual (rather than actual) version of something, such as an operating system, a server, a storage device or network resources.”

Here, the context mainly will be OS virtualization. The three basic layers of a Virtual System are :

Guest System – The system that the user sees

HyperVisor ( VMM or Virtualization Layer in general) – The enabler of virtualization which is generally a software

Host system- The machine on which we host the VMM so as to enable it to host a guest system.

Hypervisors or Virtual Machine Managers

There are two types of Hypervisors – Type 1 and Type 2. Type 1 hypervisors are VMMs which directly interact with the Hardware of the Host and don’t have any mediator in between. Example is Citrix XEN. Type 2 hypervisors are the ones that are pure software and work indirectly with the hardware via a host Operating System. Example – Parallels for Mac. (Don’t be confused between these n Software/Hardware Virtualization)

Has three parts :

Dispatcher – Routes Instructions to the Hardware

Allocator – Allocates resources to the VM

Interpreter – Interprets the instructions and does whatever’s necessary

So, you wanna know what a DHT or a Distributed Hash Table is ? Ok! cool.

For you to understand a DHT you have to know what a Key-Value pair is. It is a way of storing data in a easily reference-able manner. Data is stored in the form of Key Value pair lists. Every List has only unique keys and for every key their has a value which could be anything. Example, for every Roll number in a class there’s a student’s name. A student’s name may be the same as another student’s but the Roll number can’t be. This is the idea. So as not to run out of keys, generally in the tech world keys are long strings of say 160 bits. So, there can be 2^160 different keys which is quite a big number.

In a Distributed Hash Table system of sharing data various nodes or machines participate. They share the Keyspace generated from 2^160 different keys’ set. That means if there are 2^160 keys possible, different keys are distributed among different machines and they act as the owner of that key. So, any data that is to be stored corresponding to that key is being sent to the machine that is the owner of that key and the data is stored there. This is done after the query of ‘who has the key?’ moves from one node to the other in the system till it reaches the destination. This is an interesting process and further details can be found in the wikipedia page about the same.