Also, are you clear on structured and unstructured data, graph databases, in-memory databases, distributed databases, privacy restrictions, security challenges and the emerging field of Internet of Things?

Once you understand and decide on everything above, you are ready to reap Big Data benefits! Isn’t this confusing, daunting and downright scary? I have only taken a small subset of tools and technologies available in the market to make my point. But the world of Big Data is rich with an array of complementary, competing and confusing technologies. It is difficult for a typical business to fully grasp these technologies, conduct research, align them with their requirements, implement and finally start using them.

The confusion is akin to Chaos described in Greek Mythology.

Theogony‘s version of the genesis of the Greek gods begins with just Chaos – a collection of random, unordered things or a void. From Chaos came Gaea (Earth), Eros (Desire), Uranus (Heaven) and Pontus (Sea). Uranus and Gaea’s union produced 12 titans, including the youngest, Cronus.

Uranus feared his children overthrowing him, so he pushed them back into the womb of Gaea. To retaliate, Gaea gave Cronus a sickle. Cronus maimed his father then rescued his 11 siblings. Cronus and Rhea gave birth to several children who came to be known as Olympian Gods.

Like his father, Cronus came to fear his children overthrowing him and likewise started to swallow his children. To save her youngest child, Zeus, Rhea tricked her husband and sent Zeus to Crete for protection.

As prophesied, when he reached manhood, Zeus rescued his five elder siblings and successfully waged war against his father and overthrew Cronus, casting him and the other Titans into the depths of the Underworld. From then on, Olympian gods ruled the world!

I want to contrast this classic story with the rise of the complex Big Data environment that includes several versions and combinations of platforms and hundreds of tools.

How do we make sense of this Chaos? There is a wide spectrum of solutions to address this from DIY (Do It Yourself) to Pre-engineered packaged solutions. Both extremes have their own pitfalls.

Many DIY deployments start with software like Hadoop installed on basic servers with direct-attached-storage. As data and analytics activities grow the storage and computation resources becomes constrained. Many Hadoop clusters come into existence based on the budgeting restrictions and politics of an organization. As a result, the management of many environments becomes expensive and inefficient.

Many pre-engineered solutions require that the existing infrastructure and solutions in an organization are tossed aside and replaced with the new solution. This is impractical on several fronts – many organizations have made considerable investments in the existing infrastructure and also implemented several useful applications with their customizations. On top of this, there is no assurance from the new packaged solutions that they will work and scale per the requirements of the organization.

Organizations are in a fix. On one hand, they need assistance with cost and maintenance of the existing DIY solution, but on the other hand they are reluctant to write off existing investments to enable pre-engineered solutions

What is the way out?

The solution is Dell EMC consulting’s Elastic Data Platform (EDP) approach. It augments the existing infrastructure in an organization to provide an elastic and scalable architecture. It includes containerized compute nodes, decoupled storage and automated provisioning.

Elastic Data Platform: Definition

The EDP solution approach employs containers (Docker) to provide the compute power needed along with a Dell EMC Isilon storage cluster with HDFS. This is all managed using software from BlueData, which provides the ability to spin up instant clusters for Hadoop, Spark, and other Big Data tools running in Docker containers.This allows users to quickly create new containerized compute nodes using predefined templates, and then access their data via HDFS on the Isilon system. With a containerized compute environment, users can quickly and easily provision new Hadoop systems or additional compute nodes to existing systems – limited only by the available physical resources. By consolidating the storage requirement onto an Isilon storage cluster, the need for redundant storage is reduced by a 3x replication factor to a 20% overhead. It further enables the sharing of data across systems and extends enterprise-level features such as: snapshots, multi-site replication, and automated data tiers to move data to appropriate storage tiers as the data intensity changes over time.

After implementing a containerized Big Data environment with BlueData, EDP deploys a centralized security access policy engine (provided by BlueTalon). It then creates and deploys a common set of security policies via enforcement points to all of the applications accessing the data. This ensures the definition and enforcement of a consistent set of rules across all data platforms, ensuring data governance and compliance, by only allowing users or applications access to the data to which they are entitled.

The result is a secured, easy-to-use, and elastic platform for Big Data with a flexible computer layer and consolidated storage layer that delivers performance, management, and cost efficiencies unattainable using traditional Hadoop systems.

Elastic Data Platform: Principles

There are 5 key principles used to guide the deployment of the Elastic Data Platform:

Tailored Work Environments: isolate environments between users to ensure data integrity and reliable compute performance tailored with a variety of tools for many different workloads – assuring quality of service.

Scalability: ensure compute environment performs elastically and scales horizontally to efficiently meet business demands and deliver high quality of service.

Elastic Data Platform: Solution Details

Separating Compute and Storage

Although decoupled storage is not required with the Elastic Data Platform, once the data set becomes large, Dell EMC’s Isilon solution offers a compelling ROI and ease of use with scalability. Isilon provides several capabilities that extend the value of the Elastic Data Platform:

Separation of the storage allows for independent scaling from the compute environment

When deploying a cluster, BlueData enables the allocation of the compute clusters while Isilon HDFS provides the underlying storage for the compute clusters. Clusters can be deployed using standard profiles based on the end users’ requirements (e.g. a cluster could have high compute resources with large memory and CPU and have an average throughput storage requirement).

Decoupling storage and isolating compute provides the organization with an efficient and cost effective way to scale the solution providing dedicated environments suited for the various users and workloads coming from the business.

Tenants within BlueData are logical groupings that can be defined by organization (i.e. Customer Service, Sales, etc.) that have dedicated resources (CPU, Memory etc.), and can then be allocated to clusters. Clusters also have their own set of dedicated resources (coming from the tenant resource pool).

Applications that are containerized via Docker can be made part of the BlueData App store and can be customized by the organization. Those application images are made available to deploy as clusters with various “Flavors” (i.e. different configurations of compute, memory, and storage.)

The data residing on HDFS is partitioned based on rules and definable policies. The physical deployment of the Isilon storage allows for tiering, and the placement of the data blocks on the physical storage is maintained by definable policies in Isilon to optimize performance, scale, and cost.

Isilon is configured to generate read-only snapshots of directories within HDFS based on definable policies.

Users gain access to the data through DataTaps in BlueData. These DataTaps are associated with Tenants and are mapped (aka “mounted”) to directories in Isilon. These DataTaps can be specified as Read-only or Read/Write. DataTaps is configured for connection to both the Isilon Snapshots and writeable scratch-pad space.

Once users have finished their work (based on informing the administrators that they are finished, or based on their environment time being up), the system frees the temporary space on Isilon and adjusts the size of the compute environment so that those resources can be made available to other users.

Centralized Security Policy Enforcement

The difficulty many organizations face with multiple users who access multiple environments, with multiple data analysis tools and data management systems, is the consistent creation and enforcement of data access policies. Often, these systems have different inconsistent authorization methods. For example, a Hadoop cluster may be Kerberized, but the MongoDB cluster may not be, and the Google BigQuery engine would have its own internal engine. This means that administrators must create policies for each data platform and independently update them every time there is a change. In addition, if there are multiple Hadoop clusters or distributions, then the administrator must define and manage the data access of each one independently and risk inconsistency across the systems.

The solution is to leverage a centralized security policy creation and enforcement engine, such as BlueTalon. In this engine, the administrator simply creates policies once by defining the access rules (i.e. Allow, Deny, and Mask) for each of the different roles and attributes for the users accessing the system. Then, distributed enforcement points are deployed to each of the data systems that enforce the centralized policies against the data. This greatly simplifies the overall Big Data environment and allows for greater scalability while maintaining governance and compliance without impacting user experience or performance.

Elastic Data Platform: Conclusion

The Elastic Data Platform approach from Dell EMC is a powerful and flexible solution to help organizations get the most out of their existing Big Data investments while providing scalability, elasticity, and compliance to support the ever-growing needs of the business. Based on key principles, the approach provides the ease and speed provisioning that the business needs, the simplicity of deployment and cost sensitivity IT requires, while ensuring that everything follows the governance and compliances rules required by the organization.

Organizations are ready to move from Chaos to the Olympian Gods. With the Elastic Data Platform any organization can embrace a better solution – without worry about existing investments and valuable customizations.

Some might wonder as to what might be the best time to begin their journey to Olympian Gods. Well, there is a Chinese saying that the best time to plant to tree was twenty years ago! But the next best time to plant a tree is now!

]]>https://infocus.dellemc.com/anil_inamdar/elastic-data-platform-from-chaos-to-olympian-gods/feed/2https://infocus.dellemc.com/wp-content/themes/the-box/images/categories/ai-iot-analytics.pnghttps://infocus.dellemc.com/wp-content/uploads/2017/07/Untitled-design-5.jpgBig Data Challenge Part I – The Digital and Analog Chasmhttps://infocus.dellemc.com/anil_inamdar/big-data-challenge-part-i-the-digital-and-analog-chasm/
https://infocus.dellemc.com/anil_inamdar/big-data-challenge-part-i-the-digital-and-analog-chasm/#commentsTue, 01 Dec 2015 15:04:56 +0000https://infocus.dellemc.com/?p=25402I am sharing my thoughts on Big Data challenges that an organization might face. I have divided the challenges in three broad terms. This blog will discuss the first challenge, Part I – The Digital and Analog Chasm. We live in a digital world or do we? You must be wondering what the meaning of […]

]]>I am sharing my thoughts on Big Data challenges that an organization might face. I have divided the challenges in three broad terms. This blog will discuss the first challenge, Part I – The Digital and Analog Chasm.

We live in a digital world or do we?

You must be wondering what the meaning of the above statement is. Let me start with a simple experiment.

Tell me what you infer from the following string of numbers:

1207194109112001

This is what they call “data”. What can you infer, anything?

Let me help you out a bit. How about now:

12071941 09112001

Still nothing? Let me try again. How about now:

12/07/1941 and 09/11/2001

Cool. You see that these are dates. But can you remember them? Is there anything significant about these dates? Let me make these dates unforgettable:

Attack on Pearl Harbor (World War II): 12/07/1941

Attack on World Trade Center: 09/11/2001

The long string of numbers had no meaning at first, but with few modifications, it creates an association to something meaningful. The string suddenly became unforgettable! That’s the way human brain works.

Wasn’t that cool? I borrowed this example from Joshua Foer’s interesting book on human brain and memory – “Moonwalking with Einstein”. The human brain is a miracle in terms of its memory, reasoning and decision-making capabilities but it is incapable of deciphering large and unrelated datasets. It can recognize patterns, it can find connections and it can interpret new things in old “schemas”. I will explain each in a moment.

Along with the propensity for pattern recognition, finding connections and interpreting “schema”, visual senses make up to 90% of our perception. The size, color and shapes of objects appeal to our sub conscious visual ability. Successful visuals can also represent large amounts of data in a small area. When it comes to brain analyzing data, there is certainly truth to the saying that a “picture is worth a thousand words” (and many thousands of numbers!).

As you can see, the way the human brain interprets data – patterns, connections, schema and visual perception are all analog in nature! Let me spend a few minutes describing each.

Patterns

Our ability to recognize patterns and deduce meaning from it is hardwired into us. Even early, in our evolution, this was crucial for navigating and surviving a dangerous world. Our brain tells us that a pattern that is thin, long and moving in zigzag fashion might be a snake. It could be a rope moving in the wind but the brain does not have time to evaluate and always errors on the safe side. Michael Shermer, author of “The Believing Brain”, describes human brains as “evolved pattern-recognition machines that connect the dots and create meaning out of the patterns we think we see in nature”. Our ability to recognize patterns and find meaning helps us decipher a large amount of data but at the same time it has many shortcomings that one must be cognizant of. Our pattern recognition ability can help us solve a crime, find a new market, identify a challenge with a process and at the same time it may forces us to put people in groups based on religion or race, see “Mother Mary” in a piece of bread, spot a human face on Mars so on and so forth. Following is a list of some of the common examples of visual pattern recognition that we have come to know:

A red car must go fast!

Orange sign means danger

Shape of the dog defines its demeanor.

Clothes may define a profession.. and so on

In the example in the beginning, we were able to recognize the pattern of dates and also interpret them as the two worst attacks on the US.

Connections

Our brain is also good in connecting two or more seemingly unrelated things and creating insight that may have value. A famous example of a useful connection is the invention of printing press by Johann Gutenberg. The idea of printing itself was not new. Chinese had been experimenting with block printing for centuries. What Gutenberg did that revolutionized the world was to join two ideas of wine press and coin punch (in reality he combined several ideas but I am keeping it simple). The coin punch was used to leave an image on a small area while the wine press was used to apply force over a large area to crush grapes. Gutenberg imagined small coin punches arranged in a pattern and pressed by a wine punch and the “movable type” printing press was born. There are several business examples that illustrate connections. A modern example is insightful executive at an insurance company who realized that many of the motorcycle owners are middle-aged men with safe driving records and ride their bikes sparingly on few occasions. There was no point in combining them with traditional high-risk motorcycle riders. Human brain is pretty good in “connecting” middle age (seemingly calm demeanor), safe car driving record and occasional bike riding into a new “segment” they can go after with reduced premiums.

Schema and Visual Perception

Another useful brain characteristic to know is “schema”. It is easier for human beings to understand something new in terms of something known. Let me give you an example from Chip and Dan Heath’s delightful book “Made to Stick”. They give an interesting example of defining a “pomelo”. One could describe a pomelo as a “large citrus fruit with a thick, but soft rind”, or as “a pomelo is basically a super-sized grapefruit with a very thick and soft rind.” When you hear the first explanation that it is a large citrus fruit, you are still struggling to picture what a pomelo is but as soon as you hear that it is a super-sized grapefruit, you have a good mental picture. In this case we used the grapefruit “schema” to describe pomelo.

Visual Perception

Human beings have been collecting and analyzing data for centuries. From Egyptian hieroglyphs to the current blogs, human beings have recorded their thoughts, observation, implications, interpretations and anything and everything under the sun. And, since the beginning of time, we have continued to analyze what was written, be it the Bible, Hammurabi’s code, Machiavelli’s Prince, gold prices, the stock market and so on and so forth. Storing and analyzing information (data) is almost our second nature. But something happened around the turn of the millennium.

The ability to “collect” data has increased exponentially while the ability to “analyze” data has remained more or less same. It is true that many statistical algorithms, analytical models and visualization techniques have emerged that help us translate huge data into “patterns”, “connections” and “schemas” but the fact still remains that data collection has gone “digital” while the ability to interpret it is still “analog”.

This is what I call the Digital and Analog Chasm.

It’s true that we live in a digital world. Technology is taking over all aspects of business and all sorts of business are becoming digital. In 1996 only 1 % of the world’s data was digital, everything else was analog. But by 2007 digital data skyrocketed to 96 % and the analog remained a small 6 % of the total volume.

What happened?

See the graph below and you will see that something happened around the turn of the millennium. Any guesses?

It’s not difficult to guess. Before the year 2000 most of the data was analog – hard to believe but it’s true. We used to read books on paper, listen to music on cassettes, watch movies on video tapes, the TV and cable signals transmitted in analog, and our photographs were on analog films and printed on paper. Sure there were computers storing data in rows and columns but compared to the amount of analog data it was nothing. Let’s not forget that there were no smartphones, Twitter, Facebook and the whole social media.

But soon after year 2000, things started to change. The iPod became a sensation and the music became digital. Digital cameras became popular and our photographs became 0’s and 1’s. Digital camcorders were introduced and soon movies and videos became digital. The TV and cable transmission became digital. Multimedia was now stored on connected hard drives rather than on bookshelves. Kindle was introduced and it revolutionized the publishing business. Suddenly all our books became digital. Google became the premier search engine generating petabytes of web logs. A Harvard drop out launched Facebook. Twitter was introduced and several social media sites exploded all over the net. People started writing blogs. The world got truly connected. The proliferation of devices that are constantly collecting data and are connected to the internet has been growing steadily for last several years. We collectively call them Internet of Things (IoT). See the picture below:

Everything is logged and ready for interpretation. There are various logs – application logs, clickstream data, call logs, system logs, audit logs, sensor logs, blogs, and social media exchanges. These logs are in different formats – audio, video, text, numbers, binary and more. The data is generated at a tremendous speed. A research was done to estimate the data generated from the beginning of time through 2003. The estimated size of this was around 5 Exabyte. We generate that data every 2 days! See the graph below that highlights the digitization of all data.

The exponential growth in generating and storing of digital data has not matched with our ability to access and analyze data. The analytical methods, visualization techniques and our hardwired ability to recognized patterns, find “connections” and interpret “schema” have remained more or less the same. Sure, new tools and techniques will try to offset the gap but our evolutionary limitations of pattern matching will always be the weakest link.

So how do you fill the Digital and Analog Chasm? It is definitely more than statistical techniques and promising visualization tools. Agreed that the Big Data is a digital battle but it will be won the analog way! Yes, our ability to store data is critical but our ability to analyze and visualize is equally important. I will share solutions to overcome the chasm in my upcoming blogs.