Confessions of an Oracle Database Junkie - Arup Nanda
The opinions expressed here are mine and mine alone. They may not necessarily reflect that of my employers and customers - both past or present. The comments left by the reviewers are theirs alone and may not reflect my opinion whether implied or not. None of the advice is warranted to be free of errors and ommision. Please use at your own risk and after thorough testing in your environment.

Pages

Friday, June 07, 2013

Demystifying Big Data for Oracle Professionals

Ever wonder about Big Data and what exactly it means, especially if
you are already an Oracle Database professional? Or, do you get lost
in the jargon warfare that spews out terms like Hadoop, Map/Reduce
and HDFS? In this post I will attempt to explain these terms from
the perspective of a traditional database practitioner but getting
wet on the Big Data issues by being thrown in the water. I was
inspired to write this from the recent incident involving NSA
scooping up Verizon call data invading privacy of the citizens.

Big Data Books
It's not a news anymore that many people are reacting in different
ways to the news of National Security
Agency accessing the phone records of Verizon - cellphone and
land lines - to learn who called who. But one thing is common - all
these reactions are pretty intense. While the debate lingers on
whether the government overstepped its boundaries in accessing the
records of private citizens or whether it was perfectly justified in
the context of the threats our country is facing right now - and
there will be a debate for some time to come - I had a different
thought of my own. I was wondering about the technological aspects
of the case. The phone records must be massive. What kind of tools
and technologies the folks in the "black room", a.k.a Room 641A must
have used to collate and synthesize the records into meaningful
intelligence - the very task NSA is supposed to gather and act upon?
Along with anything it piques the interest from an engineering point
of view.

And that brings the attention to the aspect of computation involving
massive amounts of data. It's one thing to slice and dice a finite,
reasonable amount of dataset; but in the case of the phone records,
and especially collated with other records to identify criminal or
pseudo-criminal activities such as financial records, travel
records, etc., the traditional databases such as Oracle and DB2
likely will not scale well. But the challenge is not just in the
realm of romanticized espionage; it's very much a concern for purely
un-romantic corporations who, among other things, want to track
customer behavior to fine tune their service and product offerings.
Well, it's espionage of a slightly different kind. Perhaps the
sources of data are different - website logs, Facebook feeds as
opposed to phone records; but the challenges are the same - how to
synthesize enormous amounts of seeming unrelated data into
meaningful intelligence. This is the challenge of the "Big Data".

Meet the V's

Several years ago, Yahoo! faced the same issue - how to present the
attractiveness the webpages it puts on its portal and analyze the
pattern of clicking by users to attract advertisers to put relevant
ads on their pages. Google had a similar challenge of indexing the
entire World Wide Web in its servers so that it present search
results very, very quickly. Both of these issues represent the
issues that others probably didn't face earlier. Here are the
relatively unique aspects of this data, which are known as the "Three V's of Big Data".

Volume - the sheer mass of the data made it difficult,
if not impossible, to sort through them

Velocity - the data was highly transient. Website logs
are relevant only for that time period; for a different period
it was different

Variety - the data was not pre-defined and not quite
structured - at least not the way we think of structure when we
think of relational databases

Both these companies realized they are not going to address the
challenges using the traditional relational databases, at least not
in the scale they wanted. So, they developed tools and technologies
to address these very concerns. They took a page from the
super-computing paradigm of divide and conquer. Instead of
dissecting the dataset as a whole, they divided it into smaller
chunks to be processed by hundreds, even thousands of small servers.
This approach solved three basic, crippling problems:

There was no need to use large servers, which typically costs
a lot more than small servers

There was a built-in data redundancy since the data was
replicated between these small servers

But the most important, it could scale well, very well simply
by adding more of those small servers

This is the fundamental concept that have rise to Hadoop. But before
we cover that, we need to learn about another important concept.

Name=Value Pairs

A typical relational database works by logically arranging the data
into rows and columns. Here is an example. You decide on a table
design to hold your customers, named simply CUSTOMERS. It has the
columns CUST_ID, NAME, ADDRESS, PHONE. Later, your organization
decides to provide some incentives to the spouses as well and so you
added another column - SPOUSE.

Everything was well, until the time you discovered customer 1 and
spouse are divorced and there is a new spouse now. However the
company decides to keep the names of the ex-spouses as well, for
marketing analytics. Like the right relational application, you
decide to break SPOUSE away from the main table and create a new
table - SPOUSES, which is a child of CUSTOMERS, joined by CUST_ID.
This requires massive code and database changes; but you survive.
Later you had the same issue with addresses (people have different
addresses - work, home, vacation, etc.) and phone numbers (cell
phone, home phone, work phone, assistant's phone, etc.). So you
decide to break them into different tables as well. Again, code and
database changes. But the changes did not stop there. You had to add
various tables to record hobbies, associates, weights, dates of
birth - the list is endless. Every thing you record requires a
database change and a code change. But worse - not all the tables
will be populated for every customer. In your company's quest to
build a 360 degree view of the customer, you collect some
information; but there is no guarantee that all the data points will
be gathered. you are left with sparse tables. Now, suddenly, someone
says there is yet another attribute required for the customer -
professional associations. So, off you go - build yet another table,
followed by change control, code changes to incorporate that.

If you look at the scenario above, you will find that the real issue
is trying force a structure around a dataset that inherently
unstructured - akin to a square peg in a round hole. The lack of
structure of the data is what makes it agile and useful; but the
lack of structure is also what makes it difficult in a relational
database that demands structure. This is the primary issue
if you wan tto capture social media data - Twitter feeds, Facebook
updates, LinkedIn updates and Pinterest posts. It's impossible to
predict in advance, at least accurately, the exact information you
will expect to see in them. So, putting a structure around the data
storage not only makes life difficult for everyone - the DBAs will
constantly need to alter the structures and the developers/designers
will constantly wait for the structure to be in the form they want -
slowing down capture and analysis of data.

So, what is the solution. If you think about it, think about how we
- human beings - process information. Do we parse information in
form of rows in some table? Hardly. We process and store information
by associations. For instance, let's say I have a friend
John. I probably have nuggets of information like this:

Last Name = Smith

Lives at = 13 Main St, Anytown, USA

Age = 40

Birth Day = June 7th

Wife = Jane

Child = Jill

Jill goes to school = Top Notch Academy

Jill is in Grade = 3

... and so on. Suppose I meet another person - Martha- who tells me
that her child also goes to Grade 3 in Top Notch Academy. My brain
probably goes through a sequence like this:

Search for "Top Notch Academy"

Found it. It's Jill

Search for Jill.

Found it. She is child of John

Who is John's wife?

Found it. It's Jane.

Where do John and Jill live? ...

And finally, after this processing is all over, I say to Martha as a
part of the coversation - "What a coincidence! Jill - the daughter
of my friends John and Jane Smith goes there are well. Do you know
them?". "Yes, I do," replies Martha. "In fact they are in the same
class, that of Mrs Gillen-Heller".

Immediately my brain processed this new piece of information and
filed the data as:

Jill's Teacher = Mrs. Gillen-Heller

Jane's Friend = Martha

Martha's Child Goes to = ...

Months later, I meet with Jane and mention to her that I met Martha
whose child went to Mrs. Gillen-Heller's class - the same one as
Jill. "Glad you met Martha,", Jane says. "Oh, Jill is no longer in
that class. Now she is in Mr. Fallmeister's class."

Aha! My brain probably stored that information as:

Jill's former teacher = Mrs. Gillen-Heller

And it updated the already stored information:

Jill's Teacher = Mr. Fallmeister

This is called storing by a name=value pair. You see I stored the
information as a pair of property and it value. As information goes
on, I keep adding more and more pairs. When I need to retrieve
information, I just get the proper property and by associations, I
get all the data I need. But the storing of data by name=value pairs
gives me enormous flexibility in storing all kinds of information
without modifying any data structures I may currently have.

This is also how the Big Data is tamed for processing. Since the
data coming of Twitter, Facebook, LinkedIn, Pinterest, etc. is
impossible to categorize in advance, it will be practically
impossible to put it all in the relational format. Therefore, a
name=value pair type storage is the logical step in compiling and
collating the data. the name is also known as "key"; so the model is
sometimes called key-value pair. The value doesn't have to have a
datatype. In fact it's probably a BLOB; so anything can go in there
- booking amount, birth dates, comments, XML documents, pictures,
audio and even movies. It provides an immense flexibility in
capturing the information that is inherently unstructured.

NoSQL Database

Now that you know about name value pairs, the next logical question
you may have is - how do we store these? Our thoughts about
databases are typically colored by our long-standing association
with relational databases, making them almost synonymous
with the concept of database. Before relational databases were
there, even as a concept, big machines called mainframes ruled the
earth. The databases inside them were stored in hierarchical format.
One such database from IBM was IMS/DB, which was hierarchical.
Later, when relational databases were up and coming, another type of
database concept - called a network database - was developed to
compete against it. An example of that category was IDMS (now owned
by Computer Associates) developed
for mainframes. The point is, relational databases were not the
answer to all the questions then; and it is clear that they are not
now either.

This leads to the development a different type of database
technologies based on the the key-value model. Relational database
systems are queried by SQL language, which I am sure is familiar to
almost anyone reading this blog post. SQL is a set-oriented language
- it operates on sets of data. In the key-value pair mode, however,
that does not work anymore. Therefore these key-value databases are
usually known as NoSQL, to separate them from the relational
SQL-based counterparts. Since their introduction, some NoSQL
databases actually support SQL, which is why "NoSQL" is not a
correct term anymore. Therefore sometimes it is referred to as "Not
only SQL" databases. But the point is that their structure is not
dependent on relational. But how exactly the data is stored is
usually left to the implementer. Some examples are MongoDB, Dynamo, Big Table
(from Google) etc.

I would stress here that almost any type of non-relational database
can be classified as NoSQL; not just the name-value pair models. For
instance, Object Store,
an object database is also NoSQL. But for this blog post, I am
assuming only key-value pair database as the NoSQL one.

Map/Reduce

Let's summarize what we have learned so far:

The key-value pair model in databases offer flexibility in
data storage without the need for a predefined table structure

The data can be distributed across many machines where they
are independently processed and then collated.

When the system gets a large chunk of data, e.g. a Facebook feed,
the first task is to break it down to these keys and their
corresponding values. After that the values may be collated for a
summary result. The process of dividing the raw data into meaningful
key-value pairs is known as "mapping". Later combining the values to
form summaries, or just eliminating the noise from the data to
extract meaningful information is known as "reducing". For instance you see "Name" and "Customer Name" in the keys. They mean the same thing; so you reduce them to a single key value - "Name". These are
almost always used together; hence the operation is known as Map/Reduce.

Here is a very rudimentary but practical example of Map/Reduce.
Suppose you get Facebook feeds and you are expected to find out the
total of likes for our company's recent post. Facebook feed comes in
the form of a massive dataset. The first task is to divide that
among many servers - a principle described earlier to make the
process scale well. Once the dataset is divided, each machine run
some code to extract and collate the information and then present
the data to some central coordinator. to collate for the final time.
Here is a pseudo-code for the process for each server doing the
processing on a subset of data:

begin
get post
while (there_are_remaining_posts) loop
extract status of "like" for the specific post
if status = "like" then
like_count :=
like_count + 1
else
no_comment :=
no_comment + 1
end if
end loop
end

Let's name this program counter(). Counter runs on all the servers,
which are called Nodes. As shown in the figure, there are three
nodes. The raw dataset is divided into three sub-datasets
which are then fed to each of the three Nodes. A copy of the
subdataset is kept another server as well. That takes care of
redundancy. Each node perform their computation, send their results
to an intermediate result set where they are collated.

Map/Reduce Processing

How does this help? It does; in many ways. Let's see:

(1) First, since the data is stored in chunks and the copy of a
chunk is stored in a different node, there is built-in redundancy.
There is no need to protect the data being fed since there is a copy
available elsewhere.
(2) Second, since the data is available elsewhere, if a node fails,
all it needs to done is that some other nodes will pick up the
slack. There is no need to reshuffle or restart the job.
(3) Third, since the nodes all perform task independently, when the
datasize becomes larger, all you have to do is to add a new node.
Now the data will be divided four ways instead of three and so will
be processing load.

This is very similar to parallel query processes in Oracle
Databases, with PQ servers being analogous to nodes.

There are two very important points to note here:

(1) The subset of data each node gets is not needed to be viewed by
all the nodes. Each node gets its own set of data to be processed. A
copy of the subset is maintained in a different node - making the
simultaneous access to the data unnecessary. This means you can have
the data in a local storage; not in expensive SANs. This is not only
brings cost significantly down but may performs better as well due
to a local access. As cost of Solid State Devices and flash-based
storage plummets, it could also mean the that the storage cost per
performance will be even better.
(2) The nodes need not be super fast. A relatively simple commodity
class server is enough for the processing as opposed to a large
server. Typically servers are priced for their use, e.g. a
Enterprise class server with 32 CPUs is probably roughly equivalent
in performance to eight 4-CPU blades. But the cost of the former is
way more than 8 times the cost of the blade server. This model takes
advantage of the cheaper computers by scaling horizontally; not
vertically.

Hadoop

Now that you know how the processing data in parallel and using a
concept called Map/Reduce allows you to shove in several compute
intensive applications to dissect large amounts of data, you will
often wonder - there are a lot of moving parts to be taken care of
just to empower this process. In a monolithic server environment you
just have to kick off multiple copies of the program. The Operating
System does the job of scheduling these programs on the available
CPUs, taking them off the CPU (paging) to roll in another process,
prevent processes from corrupting each others' memory, etc. Now that
these processes are occurring on multiple computers, there has to be
all these manual processes to make sure they work. For instance, in
this model you have to ensure that the jobs are split between the
nodes reasonably equally, the dataset is split equitably, the queue
for feeding data and getting data back from the Map/Reduce jobs are
properly maintained, the jobs fail over in case of node failure, so
on. In short, you need an operating system of operating systems to
manage all these nodes as a monolithic processor.

What would you do if this operating procedures were already defined
for you? Well, that would make things really easy, won't it? you can
then focus on what you are good at - developing the procedures to
slice and dice the data and derive intelligence from it. This
"procedure", or the framework is available, and it is called Hadoop.
It's an open source offering, similar to Mozilla and Linux; no
single company has exclusive ownership of it. However, many
companies have adopted it and evolved it into their offerings,
similar to Linux distributions such as Red Hat, SuSe and Oracle
Enterprise Linux. Some of those companies are Cloudera, Hortonworks,
IBM, etc. Oracle does not have a Hadoop distribution. Instead it
licenses the one from Cloudera for its own Big Data Appliance. The
Hadoop framework runs on all the nodes of the cluster of nodes and
acts as the coordinator.

A very important point to note here that Hadoop is just a framework;
not the actual program that performs Map/Reduce. Compare that to the
operating system analogy; an OS like Windows does not offer a
spreadsheet. You will need to either develop or buy an off the shelf
product such as Excel to have that functionality. Similarly, Hadoop
offers a platform to run the Map/Reduce programs that you develop
and you put that logic in the code what you "map" and how you
"reduce".

Remember another important advantage you had seen in this model
earlier - the ability to replicate data between multiple nodes so
that the failure of a single node does not cause the processing to
be abandoned. This is offered through a new type of filesystem
called Hadoop Distributed Filesystem (HDFS). HDFS, which is a
distributed (and not a clustered) filesystem, by default
has 3 copies of data on three different nodes - two on the same rack
and the third on a different rack. The nodes communicate to each
other using a HDFS- specific protocol that is built on TCP/IP. The
nodes are aware of the data present on the other nodes, which is
precisely what allows Hadoop job scheduler to divide the work among
the nodes. Oh, by the way, HDFS is not absolutely required for
Hadoop; but as you can see, HDFS is the only way for Hadoop to know
which node has what data for smart job scheduling. Without it, the
division of labor will not be as efficient.

Hive

Now that you learned how Hadoop fills a major void for computations
on a massive dataset, you can't help but see the significance for
datawarehouses where massive datasets are common. Also common are
jobs that churn through this data. However, there is a little
challenge. Remember, NoSQL databases mentioned earlier? That means
they do not support SQL. To get the data you have to write a program
using the APIs the vendor supplies. This may reek of COBOL programs
of yesteryear where you had to write a program to get the data out,
making it inefficient and highly programmer driven (although it did
strut up job-security, especially during the Y2K transition times).
The inefficacy of the system gave rise to 4th Generation Languages
like SQL, which brought the power of queries to common users,
ripping the power away from programers. In other words, it brought
the data and its users closer, reducing the role the middleman
significantly. In datawarehouses, it was especially true since the
power users issued queries to after getting the result from the
previous queries. It was like a conversation - ask a question, get
the answer, formulate your next question - and so on. If the
conversation were dependent on writing programs, it would have been
impossible to be effective.

With that in mind, consider the implications of lack of SQL in these
databases highly suitable for datawarehouses. This requirement to
write a program to get the data everytime would have taken it
straight to the COBOL-days. Well, not to worry, Hadoop has another
product that allows a SQL-like language called HiveQL. Just as users
could query relational databases with SQL very quickly, HiveQL
allows users to get the data for analytical processing directly. It
was initially developed at Facebook.

Comparing to Oracle RAC

When we talk about clustering in Hadoop, you may not help wonder -
shouldn't the same functionality be provided by Oracle Real
Application Cluster? Well, the short answer is - a big resounding
NO. Oracle RAC combines the power of multiple nodes in the cluster
which communicate with one another and transfer data (cache fusion);
but that's where the similarity ends. The biggest difference is the
way the datasets are accessed. In RAC, the datasets are common, i.e.
they must be visible to all the nodes of the cluster. In Hadoop, the
datasets are specific to the individual nodes, which allows them to
be local. The filesystems in RAC can't be local; they have to be
clustered or available globally, either by a clustered filesystem,
shared volumes, clustered volume managers (such as ASM) or by
NFS mounting. In Hadoop the local files are replicated to other
nodes, which means there is no reason to create a RAID lever
protection at the storage level. ASM does provide a software level
mirroring, which may sound similar to Hadoop's replication; but
remember, ASM's mirrors are not node specific. The mirrored copies
of ASM must be visible to all the nodes. There is a preferred node
concept in ASM; but that simply means that data is read by a
specific node from one mirror copy. The mirror copies can't be
local; they must be be globally visible.

Besides, Oracle RAC if for a relational database. The Hadoop cluster
is not one. There are traits such as transnational integrity,
multiple concurrent writes that are innate features of any modern
database system. In Hadoop, these things are not present and not
really necessarily. So while both are technically databases, a
comparison may not be fair to either Hadoop or RAC. They are like
apples and tomatoes. (Tomato is technically a fruit, just in case
you are wondering about this analogy).

The Players

So, who are the players for this new branch of data processing? The
major players are show below with small description. This list is by
no means exhaustive. It simply is a collection of companies and
products I have studied.

Cloudera - they have their distribution called, what else,
Cloudera Distribution for Hadoop (CDH). But perhaps the most
impressive from them is Impala - a real-time SQL like interface
to query data from the Hadoop cluster.

Hortonworks - may of the folks who founded this company came
from Google and Yahoo! where they built or added to the building
blocks of Hadoop

IBM - they have a suite called Big Insights which has their
distribution of Hadoop. This is one of the very few companies
who offer both the hardware and software. the most impressive
feature from IBM is a product called Streams that can mine data
frm a non-structured stream like Facebook in realtime and send
alerts and data feeds to other systems.

Dell - like IBM they also have a hardware/software combination
running a licensed version of Cloudera, along with Pentaho
Business Analytics

EMC

MapR

Conclusion

If the buzzing of the buzz-words surrounding any new technology
annoy you and all you get is tons of websites on the topic but not a
small consolidated compilation of terms, you are just like me. I was
frustrated by the lack of information in a digestible form on these
buzzwords that are too important to ignore but would take too much
time to understand fully. This is my small effort to bridge that gap
and get you going on your quest for more information. If you have
100's or 1000's of questions after reading this, I would
congratulate myself - that is precisely what my objective was. For
instance how HiveQL differs from SQL or how Map/Reduce jobs are
written - these are questions that should be flying in your mind
now. My future posts will cover them and some more topics like
HBase, Zookeeper, etc. that will unravel the mysteries around the
technology that is going to be commonplace in the very near future.

Welcome aboard. I wish you luck in learning. As always, your
feedback will be highly appreciated.