~ Knowing is not enough; we must apply. Willing is not enough; we must do.

Monthly Archives: August 2014

A picture worths a thousand words. So no needs to repeat the title. But the point is that similarity is the core of many machine learning algorithms. Before talking about your mathematical models, go understand your business and problems. Lead the model with your insights (or a priori in terms of machine learning). Don’t be lead by the uninterpretable numbers of black box models.

MapReduce is a good tool for offline, ad-hoc analytics, which often involves multiple successive jobs. A single MapReduce job essentially performs a group-by aggregation in a massively parallel way. However, its programming model is very low level. Custom code has to be written for even simple operations like projection and filtering. It is even more tedious and verbose to implement common relational operators such as join. Several efforts have been devoted to simplify the development of MapReduce programs by providing high level DSLs that can be translated to native MapReduce code. Different from many other projects that bring SQL to Hadoop, Pig is special in that it provides a procedural (data flow) programming language Pig Latin as it was designed for experienced programmers. Continue reading →

In the previous post, we discussed MapReduce. Although it is great for large scale data processing, it is not friendly for iterative algorithms or interactive analytics because the data have to be repeatedly loaded for each iteration or be materialized and replicated on the distributed file system between successive jobs. Apache Spark is designed to solve this problem in means of in-memory computing. The overall framework and parallel computing model of Spark is similar to MapReduce but with an important innovation, reliant distributed dataset (RDD). Continue reading →

MapReduce has been hailed as a revolutionary platform for large scale data processing. Many database vendors including both traditional relational DBMS companies and new NoSQL providers want to ride the wave and thus provide their connectors for Hadoop so that their database can be used as an input source and/or an output destination. A lot of them claim that you can now crunch numbers with MapReduce from their databases. Sound lovely, right? But do NOT do it! Continue reading →

In the Distributed NoSQL series, I reviewed several popular open-source NoSQL solutions. With big data at hand, we will crunch numbers from them. Of course, we have to use some distributed parallel computing frameworks given the large data size. In this series, I will go through several such frameworks. Naturally, MapReduce is our first topic as it started the so called big data analytics. Continue reading →

The 2011 CHRO Survey by Cornell University found out that the top five issues on the CEO’s agenda for HR are

Europe %

U.S. %

Talent

93

92

Cost Control

19

19

Succession Planning

29

19

Employee Engagement

10

18

Culture

20

17

This was based on the responses from 172 U.S. and 44 European chief human resource officers (CHRO) of largest companies. As shown, more than 90% of CHROs identified talent as driving the CEO’s agenda for HR, where talent is interpreted as the attraction, development, and retention of employees in the talent pipeline.

It is no secret that big corporations are hugging big data, which everyone believes of great value. When talking about big data, BIG takes over our brain no matter that you are thinking of data, infrastructure, or applications. Of course, small business is being left out in this new gold rush. Who will connect small and big together?!! But I believe that big data for small business is a huge but overlooked business opportunity.

As I explained in What’s Big Data, big data is neither about big nor about data. It is about proactively learning and understanding our customers, their needs, behaviors, experience, and trends in near real-time and 24×7. Since big data is about customers and competency, why cannot small business leverage it? With cheap clouding computing and open source softwares, the barrier to enter big data is lowering everyday.

However small business doesn’t have an army (infrastructure engineers, data engineers, data scientists, and application developers) to implement big data. It is a big challenge but also a big opportunity for vendors that can integrate various technologies into an easy-to-use, maintenance-free, and one-stop solution. There is a gold mine in front of us. Let’s start digging it!