~ See through noise

Monthly Archives: September 2014

Both HCM and CRM are about relationships. We simply want to know our employees as well as we know our customers. Moreover, it does make a lot of sense to connect them together to shorten the communication paths between inside and outside world.

Data is a corporate asset. But it is firstly a debt. The costs of acquisition, hardware, software, operation, and talents are very high. Without right management, it is unlikely for us to effectively extract the value from data. To make big data a success, we must have all the disciplines to manage data as a valuable resource. Data management is much broader than database management. It is a systematic process of capturing, delivering, operating, protecting, enhancing, and disposing of data cost-effectively, which needs the ever-going reinforcement of plans, policies, programs and practices.

The ultimate goal of data management is to increase the value proposition of data. It requires serious and careful consideration and should start with a data strategy that defines a roadmap to meet the business needs in a data-driven approach. Every chief data officer should ask themselves the following questions:

What problem do we try to solve? What value can big data bring in? Big data is hot and thus many corporations are hugging it. However, big data for big data is apparently wrong. Other’s use cases do not have to be yours. To glean the value of big data, a deep understanding of your business and problems to solve is essential.

Who hold the data, who own the data, and who can access the data? Data governance is a set of processes that ensures that important data assets are formally managed throughout the enterprise. Through data governance, we expect data stewards and data custodians to exercise positive control over the data. Data custodians are responsible for the safe custody, transport, and storage of the data while data stewards are responsible for the management of data elements — both the content and metadata.

What data do we need? It may seem obvious, but it is often simply answered with “I do not know” or “Everything”, which indicates a lack of understanding business practices. Whenever this happens, we should go back to answer the first question again. How to acquire the data? Data may be collected from internal system of records, log files, surveys, or third parties. The transactional systems may be revised to collect necessary data for analytics.

Where to store the data and how long to keep them? Due to the variety of data, today’s data may be stored in various databases (relational or NoSQL), data warehouses, Hadoop, etc. Today, database management is way beyond relational database administration. Because big data is also fast data, it is impractical to keep all of the data forever. Careful thoughts are needed to determine the lifespan of data.

How to ensure the data quality? Junk in, Junk out. Without ensuring the data quality, big data won’t bring any values to the business. With the advent of big data, data quality management is both more important and more challenging than ever.

How to analyze and visualize the data? A large number of mathematical models are available for analyzing data. Simply applying mathematical models does not necessarily result in actionable insights. Before talking about your mathematical models, go understand your business and problems. Lead the model with your insights (or a priori in terms of machine learning) rather than be lead by the uninterpretable numbers of black box models. Besides, visualization is extremely helpful to explore data and present the analytic results as a picture is worth a thousand words.

How to manage the complexity? Big data is extremely complicated. To manage the complexity and improve the data management practices, we need to develop the accountability framework to encourage desirable behavior, which is tailored to the organization’s business strategies, strengths and priorities.

We have reviewed Apache Hive and Cloudera Impala, which are great for ad hoc analysis of big data. Today, Facebook’s Hive data warehouse holds 300 PB data with an incoming daily rate of about 600 TB! It is amazing but it does’t mean that most analytics is on that scale (even for Facebook). In fact, queries usually focus on a particular subset or time window and touch only a small number of columns of tables. Continue reading →

In previous post, we discussed Apache Hive, which first brought SQL to Hadoop. There are actually several SQL on Hadoop solutions competing with Hive head-to-head. Today, we will look into Google BigQuery, Cloudera Impala and Apache Drill, which all have a root to Google Dremel that was designed for interactive analysis of web-scale datasets. In a nutshell, they are native massively parallel processing query engine on read-only data. Continue reading →

In previous post, we discussed Apache Pig that provides a data flow DSL Pig Latin to ease the MapReduce programming. Although many statements in Pig Latin look just like SQL clauses, it is a procedural programming language. Today we will discuss Apache Hive that first brought SQL to Hadoop. Similar to Pig, Hive translates its own dialect of SQL (HiveQL) queries to a directed acyclic graph of MapReduce (or Tez since 0.13) jobs. However, the difference between Pig and Hive is not only procedural vs declarative. Pig is a relatively thin layer on top of MapReduce for offline analytics. But Hive is towards a data warehouse. With the recent stinger initiative, Hive is closer to interactive analytics by 100x performance improvement. Continue reading →