Tuesday, May 1, 2018

Understanding the SMACK stack for big data

Just as the LAMP stack revolutionized servers and
web hosting, the SMACK stack has made big data applications viable and
easier to develop. Want to come up to speed? Here are the basics.

Just
as LAMP made it easy to create server applications, SMACK is making it
simple (or at least simpler) to build big data programs. SMACK's role is
to provide big data information access as fast as possible. In other
words, developers can create big data applications without reinventing
the wheel.
We don't discuss the LAMP stack much, anymore. Once a
buzzword for describing the technology underlying server and web hosting
projects, LAMP (Linux, Apache, MySQL, and PHP/Python/Perl) was a
shortcut way to refer to the pieces used for online infrastructure. The
details may change—MariaDB in place of MySQL, Nginx for Apache, and so
on—but the fundamental infrastructure doesn't.
It
had a major impact. LAMP, with its combination of operating system, web
front end, transactional data store, and server-side programming,
enabled the birth of Web 2.0. Nowadays, LAMP doesn’t get a lot of dedicated attention because it’s taken for granted.
The
premise was, and still is, a good one. A well-known set of technologies
designed to easily integrate with one another can be a reliable
starting point for creating larger, complex applications. While each
component is powerful in its own right, together they become more so.
And thus today, Spark, Mesos, Akka, Cassandra, and Kafka (SMACK) has become the foundation for big data applications.
Among the technology influences driving SMACK adoption is the demand for real-time big data analysis. Apache Hadoop architectures, usually including Hadoop Distributed File System, MapReduce, and YARN,
work well for batch or offline jobs, where data is captured and
processed periodically, but they're inadequate for real-time analysis.
SMACK is a registered trademark of By the Bay, but the code of its components is open source software.

SMACK history

Most community tech initiatives begin with a pioneer and a lead innovator. In 2014, Apple engineer Helena Edelson wrote KillrWeather
to show how easy it would be to integrate big data streaming and
processing into a single pipeline. Edelson’s efforts got the attention
of other San Francisco big data developers, some of whom organized tech conferences.
This
quickly transformed into a movement. The programmers in each component
met in 2015 at a pair of West Coast developer conferences, where they defined the SMACK stack by doing and teaching. Among the interested parties was Mesosphere, a container and big data company, which certainly has contributed to popularizing SMACK.
Immediately after those conferences, Mesosphere announced its Mesosphere Infinity product. This pulled together the SMACK stack programs into a whole, with the aid of Cisco.
Mesosphere
Infinity's purpose was to create "an ideal environment for handling all
sorts of data processing needs—from nightly batch-processing tasks to
real-time ingestion of sensor data, and from business intelligence to
hard-core data science."
The SMACK stack quickly gained in
popularity. It's currently employed in multiple big data pipeline data
architectures for data stream processing.

SMACK components

As
with LAMP, a developer or system administrator is not wedded to SMACK's
main programs. You can replace individual components, just as some
original LAMP users swapped out MariaDB for MySQL or Python for Perl.
For instance, a SMACK developer can replace Mesos as the cluster
scheduler with Apache YARN or use Apache Flink for batch and stream processing instead of Akka. But, as with LAMP, it’s a useful starting point for process and documentation as well as predictable toolsets.
Here's are SMACK's basic pieces:Apache Mesos
is SMACK's foundation. Mesos, a distributed systems kernel, abstracts
CPU, memory, storage, and other computational resources away from
physical or virtual machines. On Mesos, you build fault-tolerant and
elastic distributed systems. Mesos runs applications within its cluster.
It also provides a highly available platform. In the event of a system
failure, Mesos relocates applications to different cluster nodes.
This Mesos kernel provides the SMACK applications (and other big data applications, such as Hadoop),
with the APIs they need for resource management and scheduling across
data center, cloud, and container platforms. While many SMACK
implementations use Mesosphere's Mesos Data Center Operating System (DC/OS) distribution, SMACK works with any version of Mesos or, with some elbow grease, other distributed systems.
Next on the stack is Akka. Akka both brings data into a SMACK stack and sends it out to end-user applications.
The
Akka toolkit aims to help developers build highly concurrent,
distributed, and resilient message-driven applications for Java and Scala. It uses the actor model as its abstraction level to provide a platform to build scalable, resilient, and responsive applications.
The
actor model is a conceptual model to work with concurrent computation.
It defines general rules for how the system’s components should behave
and interact. The best-known language using this abstraction is Erlang.
With
Akka, all interactions work in a distributed environment; its
interactions actors use pure message-passing data in an asynchronous
approach.Apache Kafka is a distributed, partitioned, replicated commit log service. In SMACK, Kafka serves to provide messaging system functionality.
In
a larger sense, Kafka decouples data pipelines and organizes data
streams. With Kafka, data messages are byte arrays, which you can use to
store objects in many formats, such as Apache Avro, JSON,
and String. Kafka treats each set of data messages as a log—that is, an
ordered set of messages. SMACK uses Kafka as a messaging system between
its other programs.
In SMACK, data is kept in Apache Cassandra, a well-known distributed NoSQL
database for managing large amounts of structured data across multiple
servers, depended on for a lot of high-availability applications.
Cassandra can handle huge quantities of data across multiple storage
devices and vast numbers of concurrent users and operations per second.
The job of actually analyzing the data goes to Apache Spark.
This fast and general-purpose big data processing engine enables you to
combine SQL, streaming, and complex analytics. It also provides
high-level APIs for Java, Scala, Python, and R, with an optimized
general execution graphs engine.

Running through the SMACK pipeline

The
smart bit, of course, is how all those pieces form a big data pipeline.
There are many ways to install a SMACK stack using your choice of
clouds, Linux distributions, and DevOps tools. Follow along with me as I create one to illustrate the process.
I
start my SMACK stack by setting up a Mesos-based cluster. For SMACK,
you need a minimum of three nodes, with two CPUs each and 32 GB of RAM.
You can set this up on most clouds using any supported Linux
distribution.
Next, I set up the Cassandra database from within Mesos or a Mesos distribution such as DC/OS.
That done, I set up Kafka inside Mesos.
Then I get Spark up and running in cluster mode. This way, when a task requires Spark, Spark instances are automatically spun up to available resources.
That's the basic framework.
But
wait—the purpose here is to process data! That means I need to get data
into the stack. For that, I install Akka. This program reads in
data—data ingestion—from the chosen data sources.
As the data
comes in from the outside world, Akka passes it on to Kafka. Kafka, in
turn, streams the data to Akka, Spark, and Cassandra. Cassandra stores
the data, while Spark analyzes it. All the while, Mesos is orchestrating
all the components and managing system requirements. Once the data is
stored and analyzed, you can query it, using Spark for further analysis
with the Spark Cassandra Connector. You can then use Akka to move the data and analytic results from Cassandra to the end user.
This is just an overview. For a more in-depth example, see The SMACK stack – hands on!

Who needs SMACK

Before you start to build a SMACK stack, is it the right tool?
The
first question to ask is whether you need big data analysis in real
time. If you don't, Hadoop-based batch approaches can serve you well. As
Patrick McFadin, chief evangelist for Apache Cassandra at DataStax, explains in an interview, "Hadoop fits in the 'slow data' space,
where the size, scope, and completeness of the data you are looking at
is more important than the speed of the response. For example, a data
lake consisting of large amounts of stored data would fall under this."
How much faster than Hadoop is SMACK's analysis engine, Spark? According to Natalino Busa, head of data science at Teradata, "Spark's multistage in-memory primitives provides performance up to 100 times faster for certain applications.”
Busa argues that by allowing user programs to load data into a
cluster's memory and query it repeatedly, Spark works well with machine
learning algorithms.
But when you do need fast big data, SMACK can deliver great performance. Achim Nierbeck, a senior IT consultant for Codecentric AG, explains, "Our requirements contained the ability to process approximately 130,000 messages per second.
Those messages needed to be stored in a Cassandra and also be
accessible via a front end for real-time visualization." With 15
Cassandra nodes on a fast Amazon Web Services-based Meos cluster,
Nierbeck says, "processing 520K [messages per second] was easily
achieved."
Another major business win is that SMACK enables you
to get the most from your hardware. As McFadin says, Mesos capabilities
“allow potentially conflicting workloads to act in isolation from each
other, ensuring more efficient use of infrastructure.”
Finally,
SMACK provides a complete, open source toolkit for addressing real-time
big data problems. Like LAMP, it provides all the tools needed for
developers to create applications without getting bogged down in the
details of integrating a new stack.
Today, most people still don't know what SMACK is. Tomorrow, expect it to become a commonplace set of tools.