A View Inside the World of Big-Data

A PLAIN WHITE PAPER

Overview

With the growth of the internet and social media the term “Big Data” has been promoted and played as this new creature in the industry. In fact, Big Data is nothing new to the industry; for years organizations have been collecting information about their portfolios, processes, and other organizational property. Big companies in the insurance, healthcare, and finance industries capture data down to a very low level of granularity which gets stored in a variety of places (e.g., databases, documents, emails, websites, etc.). Today, we look at Big Data differently because we have huge amounts of unstructured data in the form of blogs, tweets, comments, etc., that needs to be captured, mined, analyzed, and even combined with enterprise structured data. This has lead the industry into a revolutionary opportunity for being able to look at combining information in a way that enables companies to look at their business differently. Large commercial companies such as Wal-Mart and Target use social media as a way to profile and target market their customers to maximize sales. Government agencies can capture live feed data to combine with their structured data. For example, a government housing agency could get a live weather feed and combine this unstructured data with their structured property data, therefore, providing a monitoring and alerting mechanism for natural disaster events that may impact their financial investment almost real-time.

Today Big Data is coined as massive amounts of structured, unstructured, and semi-structured data, and commonly we refer to the concept of the 4 V’s, Volume, Velocity, Variability, and Variety as characteristics of Big Data. Volume is the easy one as it refers to massive data volumes and referenced as exceeding the physical limits of vertical scalability. Velocity is defined based off the speed of the data coming into an organization, such as a RSS feed which can accumulate massive amounts of data at a very fast rate. Variability means that the data can morph into many different meanings based off the context of the information captured, in turn creating variability in the meaning. Finally, Variety is coined because of the many different data formats that are in the industry and how we handle the types of formats and make meaning out of the data.

How do organizations handle Big Data?

With massive amounts of data being generated in today’s society, organizations have concerns revolving around data ingestion, storing the large volumes, building analytics, and providing visualization because few organizations are used to handling these volumes. The days of measuring volumes at the Terabyte and Petabyte levels are now gone; we have arrived at the Yottabyte. Yottabytes are defined as a unit of information equal to one septillion bytes (one quadrillion gigabytes) per Wikipedia! The traditional data warehousing methods cannot handle moving and storing these large volumes, as well as, processing unstructured data efficiently. This is where the MapReduce programming model comes into play handling the processing of large amounts of data.

MapReduce is a library that allows a user to adopt a way of programming that can be easily split and processed against a bunch of machines. The jobs are divided into two parts, a Map, and a Reduce. The Map job takes an input and splits it apart into sub-parts and sends the sub-parts to different machines for processing. The Reduce takes all the sub-parts and reconstructs them to give you an answer. The guts of how this works is that it takes inputs, such as a list of rows, then the rows are split against different machines for processing. The result is a list of intermediate key/value pairs, then the MapReduce library groups all intermediate values together that are associated with the same intermediate keys and passes them to the Reduce function. The Reduce function ingests the intermediate key and the values for those keys, and then merges the similar values together to form a single value. Here is a simple example of how it works:

As you can see the MapReduce programming can significantly improve processing against large volumes of data. This result set improves the query response times by eliminating the need to scan duplicate records and the data sets are significantly smaller. While the MapReduce processing can break down data to its smallest forms for storage and become very complex, this was a very simple example.

Big Data Storage

Large amounts of data are stored in a database typically referred to as a Very Large Database (VLDB). Their major differentiator from traditional databases is that these use MPP (Massive Parallel Processing) procedures. MPP is a coordinated processing of a program by multiple processors that independently work on different parts of a program and operate using its own memory and operating system. The processors communicate with each other using a messaging interface that allows messages to be sent between processors. Configuring and setting up the MPP process is very complex requiring significant thought on how to partition a common database amongst many processors and how the assignments of work will be distributed among the processors. Common references for these types of databases are called “shared nothing” or “loosely coupled” systems. While there are many types of VLDBs, some of the most common include Hadoop, Teradata, EMC Greenplum, and Oracle Exadata. Each of these VLDBs has their own architecture and unique storage methods.

For example, Hadoop is an open source platform administered by Apache Software Foundation that creates a cluster, based off shared-nothing cheap machines. The software includes the Hadoop Distributed File System (HDFS) which splits the user data across servers in the cluster and uses a replication process ensuring node failures will not cause any data loss. This is unique as its self-healing, and detects then compensates for system or hardware problems on any server. The benefit of Hadoop is that it can run very large scale data processing jobs with very high performance using the MapReduce programming model. Below is a typical Hadoop architecture:

Other platforms such as Teradata and Greenplum, which also use MPP and MapReduce, have integrated analytics engines in which the processing happens within the database, making them another unique offering for Big Data. With the analytics co-existing next to the data it alleviates the movement of massive datasets to an analytics engine that could significantly degrade performance, but in this case it significantly improves analytic processing performance.

With technologies in place to handle the ingestion, storage, and processing of Big Data the industry still needs to address issues before realizing the full potential of Big Data. As we capture more and more data there are data policies that should be applied.

These policy issues become increasingly important around security, privacy, intellectual property, and liability. The most common concern around data is the security of sensitive personal information that should be kept private. Data breaches can cause consumers’ personal information, corporate confidential information, and even national security information to be exposed if captured by the wrong hands. Therefore, these big data software and appliance vendors such as Oracle with the Exadata platform have mechanisms built into their technologies for mitigating the risk of breaches. For example, the Exadata box contains Encryption/Masking, Access Control, Auditing/Tracking, and Monitoring/Blocking to fight against breaches. This suite of tools provides protection by isolating each data application using a firewall-like shield preventing the use of a compromised admin account to steal data; control database privileged user access to application data preventing insider attacks, and monitoring database activities for SQL injections. Therefore, these protection mechanisms enable organizations to customize their own security parameters at any level of the applications ensuring a low risk of vulnerability.

Policy issues that are driven by governance are around privacy of data, intellectual property, and liability. In the healthcare industry being able to have access to people’s health records can have significant human benefits by pinpointing the right medical treatment for an individual. Also, having a person’s financial records can provide significant benefit to providing consumers with the right fit financial instruments. If this data would be available for research and usage by industries such as Healthcare and Finance, it would be massive and come with more underlying issues. This leads me to the issue of intellectual property and liability where many legal issues could arise if the governance of this information is not handled properly. Questions will arise such as, “Who owns the dataset, and what rights come with the data?” “Who is responsible if the use of the data leads to a negative consequence?” These questions will have to be answered prior to use to protect the owner and consumer of the data.

Big Data Analytics & Visualization

With Hadoop, Greenplum, Teradata, and Exadata as some of the players who seem to have the data storage and processing modeled into a stable and consistent product, we now need analytical processing and visualization. As the Big Data industry continues to evolve, the need for analytics will be growing exponentially. Some of the vendors have addressed this need, such as Teradata and Oracle. Teradata has nCluster Analytics that is included in the data tier of their database. This provides a huge advantage because the database features, such as fault tolerance and workload management techniques, are equally applied to the data management and the analytical applications. The product uses SQL as well as most of the standard programming languages, such as R, Java, and C++ for analytic processing, as well as, comes with SAS in the base product. Teradata’s “Next Generation Analytics” include Trend Analysis, Fraud Detection, Network Security Analysis, Consumer Behavior, and Portfolio Analytics. Another flavor of the Big Data analytics toolset is Oracle Exalytics. Exalytics is an in-memory system designed for high performance analysis, planning, and modeling to support Business Intelligence and Enterprise Performance Management applications. This product includes Oracle Business Intelligence for visualization and ad hoc reporting, Oracle Times Ten In-Memory Database providing users with rapid query response processing, and Oracle Essbase that delivers multi-dimensional analysis. This product uses columnar compression to reduce the memory footprint for expedited query response times and the analytic engine can run on compressed data, eliminating the need to uncompress, compute analytics, and then compress again for storage. These options are for organizations that need to be able to query against large data volumes while returning the results in a timely manner, irrespective of the language (e.g., NoSQL, SQL, and Java).

Conclusion

Organizations that are looking to develop a Big Data Strategy should define a roadmap that will address the demand for large volumes of data coming in different formats. The ability to address and decision the following areas will ensure the success of the Big Data Strategy:

Technologies to be used for storing, processing, analyzing, and visualizing large amounts of data.

Organizational changes around the way you look at Big Data and have the ability to staff people that understand Big Data and how to use it.

Access the data that will be used for fostering opportunities or driving new business analysis techniques.

At Unissant, we understand the challenges and the complexities involving technology, business, and the value of going to a Big Data Solution. We help our customers realize the benefits and drawbacks of moving to a Big Data based environment through continuous dialog and experiences. The use of Big Data is not necessary for all organizations and with our experience of implementing environments; we are looked upon as a trusted advisor to our clients who are thinking about Big Data initiatives. We have been developing proprietary frameworks for many years around Data Governance, Data Quality, Master Data Management, Business Intelligence, Information Security, Data Classification, and Metadata. Unissant uses the latest Big Data technologies and teams with many of the major vendors such as Oracle, Greenplum, Teradata, and Karmasphere to provide our customers the “right” environment for their business needs.

References & Bibliography

About the Author

Will Rohde is an avid enthusiast of Big Data and Cloud initiatives, with a passion for exploring new technologies. He has over 18 years of experience architecting and implementing Information Management, BI, and Data Warehousing solutions. In his free time he enjoys continuing a 15 year career pastime of competing at doubles Pro Beach Volleyball.

Unissant is an innovative software development and consulting company that manages complex initiatives, solves data challenges, and transforms business. Unissant brings technical excellence and program/project execution best practices that exceed the expectations of our clients in the Banking and Finance, Health and Life Sciences, National Security, and Federal/Civilian sectors.