Monday, December 21, 2015

I am writing this blog post to highlight cool resources to learn programming in R. R language is a widely used programming language for data scientists and engineers who build programmable components in big data solutions.

Sunday, December 20, 2015

In this blog post i am sharing basic understanding concepts of Apache Spark for developers. My target is to educate developers/engineers with no big data experience on what is Apache Spark and the importance of this platform in working with Hadoop big data solutions.

The target developers should have minimum experience in building business applications or products (desktop, web or mobile) using any OOB language such as: C++. Java or C# developers.

What is Apache Spark?
Apache Spark is a distributed computation framework for big data. It is an open source platform for processing large amount of data in Hadoop ecosystem solutions.

Because it is a distributed platform, there are important concepts to solidify such as:

1) Any spark application contains a driver program which is the main entry point for the application that executes the main function and executes various parallel operations in a cluster.

2) Spark provides a resilient distributed dataset (RDD) which is a collection of data elements that are partitioned across different nodes in a cluster that can be operated on in parallel.

3) We can persist RDD in memory to allow it to be reused efficiently across parallel operations.

4) RDDs can automatically recover from nodes failures.

Tip: To start working with RDDs in Spark, RDD starts with a file in HDFS or any Hadoop supported file system.

5) Spark supports shared variables in parallel operations. There are two types of shared variables in Spark, the first is broadcast variables and the second is accumulators.

You can write Spark applications in Scala, Python and Java programming languages.

To connect with Spark, you need to have a Spark context object which requires a Spark configuration object. The Spark configuration object contains information about your application.

Spark contains Scala and Python shells where you can write and execute your code against Spark cluster nodes.

Monday, December 14, 2015

I am writing this blog post to share some important Apache Spark framework for starters topics and foundation understanding points..

Apache Spark is an open-source parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications.

Spark processing engine is built for speed, ease of use, and sophisticated analytics. Spark's in-memory computation capabilities make it a good choice for iterative algorithms in machine learning and graph computations.

You can write applications in Python, Scala and R in Spark clusters. HDInsight contains out of the box notebooks (tools/Dev IDEs) that allows data scientist to write programs in Spark using:

1) Python language using Jupyter , It also supports R, Julia, and Scala

Saturday, December 12, 2015

I had a chance to attend one of the New York city JavaScript events (JS Open NYC) that was hosted at Microsoft NYC office. I had the opportunity to talk and chat with dozens of open source front end developers about web compatibility and interoperability in modern web design and development.

While i was talking with the event attendees, I introduced new scanning tools that every PM, BA or front end engineer could use to test website compatibility issues.

Microsoft developed tools to scan and test your website for free, below is the website to check all the available tools.

In this website, you have four tools to use. I will go through each one of them in this blog post.

1) Quick Scan tool: (my favorite one for technical analysis)
The best tool to do quick scan to your website, it points out all out of date libraries, layouts and things to change in your website to be compatible with most of modern browsers.

I also had a discussion about Chakra (The Core Engine for MS Edge). Since Microsoft announce that the core Edge engine is open source (ChakraCore) and it will be available in GitHub on Jan 2016. Check out the official announcement from Microsoft Edge Dev Team:

Wednesday, December 09, 2015

I'd like to share some useful resource to get started with Apache Storm in this blog post.
Apache storm is a distributed real-time computational system that allows engineers to process streams of data at scale.

Apache storm is one of the major hadoop ecosystem components, where engineers use it to process the sources of data into hadoop ecosystem.

Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.

Imagine you need to process endless source of data (such as: Facebook news feed or Twitter feed) and you are going to process this large volume of information and then store in Hadoop. In this case, you want to build a storm application specifying the topology by defining the sources of information (Spouts) and how to process this chunk of data (Bolts).

Every Storm application contains a topology, Set of spouts and bolts in addition to a specification file for the topology.

I compiled some useful resources to get started and work with Apache storm:

Thursday, December 03, 2015

I have published a video on how to work with HBase tables in HDInsight HBase cluster. The video is a walk-through on the basics of CRUD operations in HBase.

The video covers the following topics:
1) How to connect to hbase shell tool.
2) How to create tables in HBase.
3) How to select, insert, update, records in HBase.
4) Understanding create, put, delete, deleteall commands in HBase.

The video is giving a basic "Order" table structure as an example and execute all the above operations to it.

Tuesday, December 01, 2015

I have published a new channel9 video about HBase Introduction in Azure.

This video covers an introduction to HBase in Azure. It covers what is HDInsight clusters, What are the available cluster types. What Microsoft Azure offers as Hadoop ecosystem components. The video focuses on HDInsight HBase cluster type and the need for HBase in Hadoop ecosystem to store NoSQL data and the available tools (such as: hbase shell) and commands to use to manipulate data within HBase tables.

The video covers the column families concept for engineers who come from RDBMS background.

This video helps any engineer with no Hadoop experience to understand what is the role of HBase in Hadoop and big data applications.