Notice something here? ADP moves a lot of money than Paypal, but makes less revenue on money movement (less the revenue from other HCM services). It has a smaller market cap too. Why? Well, ADP is in the business of solution shop and value add process while Paypal is a facilitated network.

There are three general types of business models: solution shops, value-adding process businesses, and facilitated networks. Solution shops are institutions structured to diagnose and recommend solutions to unstructured problems. Almost always, solutions shops charge their clients on a fee-for-service basis. The value-adding process transform inputs of resources into outputs of higher value. Because value-adding process organization tend to do their work in repetitive ways, the capability to deliver value tends to be embedded in processes and equipment. The facilitated networks operate systems in which customers buy and sell, and deliver and receive things from other participants. Much of consumer banking is a network business in which customers will make deposits and withdrawals from a collective pool.

When on boarding clients, ADP is in the mode of solution shops where a heavy team executes a time-consuming and highly customized process for each major client. Once a client is on board, ADP performs the repetitive payroll service with computers in every pay cycle. In return, clients pay ADP service fees. No matter how big or small the paycheck is, for CEO or for average Joe, the service fee is the same. In contrast, Paypal is a facilitated network and the service fee is proportional to the transaction size just like credit card services.

Given its dominance in payroll business, it is very challenging for ADP to achieve high growth in this area by grabbing more market share. But high growth is still possible by changing the business model with the above analysis. That is, ADP should become a facilitated network, more specifically a bank!

It sounds ridiculous but ADP has a unique advantage to be a great bank by managing the risk well. The open secret is its massive payroll and HR data. By knowing the incomes in advance, work history, performance metrics, time management data, etc., ADP can reduce the risk a lot with data science. Another great news is that there is a huge market. Many households try to make a go of it week to week, paycheck to paycheck, expense to expense. In fact, 63% Of Americans don’t have enough savings to cover a $500 emergency. Often they have to pay a very high borrow rate to meet a small financial need. With the good risk management based on its data, ADP can potentially help us with much lower rate. It is a win-win for everyone.

There are a lot of NoSQL databases out there. We have used or tried out many of them. We love a lot of cool features they offer. However, we also face many unique challenges in a highly regulated HCM SaaS business. So we have kept looking for the unicorn database to meet our requirements. Unfortunately, none of existing solutions fully address all of our challenges. So we asked ourselves two years ago if we can build our own solution. It was how Unicorn database was born. Unicorn is built on top of BigTable-like storage engines such as Cassandra, HBase, or Accumulo. With different storage engine, we can achieve different strategies on consistency, replication, etc. Beyond the plain abstraction of BigTable data model, Unicorn provides the easy-to-use document data model and MongoDB-like API. Moreover, Unicorn supports directed property multigraphs and documents can just be vertices in a graph. With the built-in document and graph data models, developers can focus on the business logic rather than work with tedious key-value pair manipulations. Of course, developers are still free to use key-value pairs for flexibility in some special cases.

During the past two years, we have learned a lot and made a lot of improvements, which resulted in Unicorn 2.0, which we are excited to open source to the community. Continue reading →

It is right, you don’t read the title wrong. In most people’s mind, Hadoop was almost a synonym of Big Data. Adding the magic word to your resume means more opportunities and higher pay. How possible is its future misty? Let’s get things clear together. Continue reading →

Back to graduate school, I had been working on the so-called small sample size problem. In particular, I was working on linear discriminant analysis (LDA). For high-dimensional data (e.g. images, gene expression, etc.), the within-scatter matrix is singular when the number of samples is smaller than the dimensionality. Therefore LDA cannot be applied directly. You may think that we don’t have such small sample size problems anymore in the era of Big Data. Well, the challenge is deeper than what it looks like. Continue reading →

In YARN, the Resource Manager is a single point of failure (SPOF). Multiple Resource Manager instances can be brought up for fault tolerance but only one instance is Active. When the Active goes down or becomes unresponsive, another Resource Manager has to be elected to be the Active. Such a leader election problem is common for distributed systems with a active/standby design. YARN relays on ZooKeeper for electing the new Active. In fact, distributed systems also face other common problems such as naming service, configuration management, synchronization, group membership, etc. ZooKeeper is a highly reliable distributed coordination service for all these use cases. Higher order constructs, e.g. barriers, message queues, locks, two-phase commit, and leader election, can also be implemented with ZooKeeper. In the rest of book, we will find that many distributed services depend on the ZooKeeper, which is actually the goal of ZooKeeper: implementing the coordination service once and well and shared by many distributed applications. Continue reading →

An inner join operation combines two data sets, A and B, to produce a third one containing all record pairs from A and B with matching attribute value. The sort-merge join algorithm and hash-join algorithm are two common alternatives to implement the join operation in a parallel data flow environment. In sort-merge join, both A and B are sorted by the join attribute and then compared in sorted order. The matching pairs are inserted into the output stream. The hash-join first prepares a hash table of the smaller data set with the join attribute as the hash key. Then we scan the larger dataset and find the relevant rows from the smaller dataset by searching the hash table. Continue reading →

Distributed parallel computing is not new. Supercomputers have been using MPI for years for complex numerical computing. Although MPI provides a comprehensive API for data transfer and synchronization, it is not very suitable for big data. Due to the large data size and shared-nothing architecture for scalability, data distribution and I/O are critical to big data analytics while MPI almost ignores it. On the other hand, many big data analytics are conceptually straightforward and does not need very complicated communication and synchronization mechanism. Based on these observations, Google invented MapReduce to deal the issues of how to parallelize the computation, distribute the data, and handle failures. Continue reading →

Today, most data are generated and stored out of Hadoop, e.g. relational databases, plain files, etc. Therefore, data ingestion is the first step to utilize the power of Hadoop. Various utilities have been developed to move data into Hadoop. Continue reading →