to analyze Instagram posts
and Twitter hashtags to better
understand their customers'
buying habits. The problem is that
these types of unstructured data
don't fit into a standard relational
database. Entering specific aspects
of one of these data types, such
as Twitter hashtags, may provide
some information, but doing so
also substantially reduces the
richness of the format. Worse,
it removes flexibility for future
analysis, taking time and costing
money while limiting potential
future benefit. Concerns about
these issues have led to the
emergence of data lakes.
A data lake is a data repository
that stores raw data of multiple
types, side by side, each in its
native format. Data lakes enable
organizations to leverage synergies
among different types of data.
"The flexibility of data lakes and
modern databases gives us the
ability to augment traditional
relational analytics with things
like text analytics that can provide
context and more meaning for
the data," says Linton Ward,
Distinguished Engineer for
OpenPOWER Solutions. "I can
store data that I may not know
the immediate use of, so I can do
exploration in the future."
Data lakes are different from
data warehouses, which are databases designed
for structured data. The data that goes into a
data warehouse must be cleansed, formatted and
managed. Data lakes ingest data in its raw form.
It's important to note that one doesn't replace the
other. Rather, data lakes are increasingly used to
enhance data warehouses. "The data lake is kind
of that central place-I call it the data plain-that
augments the traditional data warehouse," says
Ward. "It may also be augmented by specialized data
stores, but it's the central place conceptionally of
the data plain, serving the analytics that support the
digital transformation of our clients."
Infrastructure Matters
Because of their relative simplicity compared to data
warehouses, data lakes have always been considered
lower-cost solutions that can be implemented with
commodity hardware infrastructure. In the era of big
data, however, those assumptions no longer hold.
Hadoop, the open-source distributed computing
framework used to manage big data workloads, is
growing increasingly sophisticated. Driven by the
open-source community, including IBM and its partner
Hortonworks, Hadoop has the ability to run processes
in different data and execution environments,
depending on the demands of the workload.
Old-school one-size-fits-all data lakes are no longer
appropriate for the task at hand. Servers and storage
need to be flexible enough to be optimized for the job.
"All nodes in a data lake are not necessarily
performing the same role, so being able to mix and
match, and scale up portions of your data lake to
meet the needs of your workload is a big benefit,"
says Steve Roberts, offering manager for Big Data
on IBM Power Systems. "You might use high-speed
nodes for data ingest, for example. With real-time
analytics becoming a larger and
larger part of every use case,
being able to choose where you
run your machine learning, deep
learning workloads is important."
A collection of x86 boxes
cannot perform to the level
needed, which increasingly
presents problems for enterprises.
"CXOs trying to figure out
their cognitive data flows and
strategies face a big challenge,
which is commodity hardware.
We are talking about a general
piece of hardware," says Dylan
Boday, offering manager for
cognitive infrastructure. "They
are told to use it in every aspect of
their cognitive journey, and it just
doesn't work."
To support data lakes in the
modern cognitive environment,
IBM released the LC921 and
LC922. Both feature the
IBM POWER9* processor,
an open hardware platform
designed specifically for the
cognitive computing era (see
"By the Numbers," below).
In a side-by-side comparison
with commodity servers, the
Power Systems LC922 server
delivered a 2x price performance.
Clients can configure the boxes
with up to 40 TB or 120 TB of
storage, respectively, and up to
2 TB of RAM.
By the Numbers
The LC921 and LC922 servers are the Linux* technology-based machines designed specifically
for data-lake applications.
POWER9
processor
Up to 40 TB
(LC921) or
120 TB (LC922)
of storage
CAPI 2.0 for
high-speed I/O
PCIe 4.0/
Up to 40 cores
20 // JULY 2018 ibmsystemsmag.com
(LC921) or
44 cores
(LC922)
Up to 2 TB
of RAM
Red Hat or
Ubuntu
Linux