Users look to real-time streaming to speed up big data analytics

NEW YORK — For more organizations, there’s no time like the present to process and analyze the information flowing into their big data systems. And IT vendors increasingly are releasing technologies that facilitate the real-time streaming analytics process.

Comcast Corp. is among the real-time vanguard. The TV and movie conglomerate is on the verge of expanding a Hadoop cluster used by its data science team from 300 compute nodes to 480. In addition, Comcast plans to upgrade the system to include Apache Kudu, an open source data store designed for use in real-time analytics applications involving streaming data that’s updated frequently.

“For us, the update ability is a very big thing,” said Kiran Muglurmath, executive director of data science and big data analytics at the Philadelphia-based company. The Hadoop cluster, set up earlier this year, already contains more than a petabyte of information — for example, data collected from set-top boxes on the TV viewing activities of Comcast customers and the operations of the boxes themselves. But Muglurmath’s team needs to keep the data as up-to-date as possible for effective analysis, which means updating individual records via table scans as new information comes in.

Sridhar Alla, director of big data architecture at Comcast, said doing so takes “an immense amount of time” in the Hadoop Distributed File System (HDFS) and its companion HBase database — too long to be feasible at petabyte scale. Kudu, on the other hand, has significantly accelerated the process in a proof-of-concept project over the past three months. In one test, for example, it scanned more than two million rows of data per second. “It’s writing the data as fast as the disks can handle,” Alla said during a session at Strata + Hadoop World 2016 here this week.

Real-time waiting game comes to an end

The Kudu technology was created last year by Hadoop vendor Cloudera Inc. and then open sourced. The Apache Software Foundation last week released Kudu 1.0.0, the first production version — a step that Comcast was waiting for before going live with its Kudu deployment.

The expansion of the Cloudera-based Hadoop cluster should be completed by the end of October, Muglurmath said after the conference session. Kudu will be configured on all of the compute nodes along with HDFS, which will continue to be used to store other types of data. The data science team also plans to use Impala, a SQL-on-Hadoop query engine developed by Cloudera, to join together data from HDFS and Kudu for analysis.

Dell EMC, the data storage unit of IT vendor Dell Technologies, is also going down the real-time streaming path to support its internal analytics efforts.

You couldn’t just throw all the data in Hadoop and say ‘Go at it.’ It’s a different thing to take real-time data and do actionable analytics on it. Darryl Smithchief data platform architect, Dell EMC

The IT team is using the Spark processing engine and other data ingestion tools to funnel real-time data on interactions with customers into a combination of databases — Cassandra, GemFire, MemSQL and PostgreSQL. Automated algorithms are then run against the data to generate up-to-the-minute customer experience scores that help guide Dell EMC’s salesforce in selling tech-support subscription renewals, said Darryl Smith, chief data platform architect at the Hopkinton, Mass.-based organization.

The customer interaction data is also fed into a Hadoop data lake, but that’s for longer-term customer profiling and trend analysis. For the customer scoring application, “you couldn’t just throw all the data in Hadoop and say ‘Go at it’ [to the sales reps],” Smith said. “It’s a different thing to take real-time data and do actionable analytics on it.”

That does mean the same data is being processed and stored in different locations within Dell EMC’s big data architecture, but Smith doesn’t see that as a bad thing. “And it’s not just because I work for a storage company,” he joked. “If you’re going to get value out of the data, you’re going to need to store it in multiple places, because you’re going to consume it in different ways.”

One of the real-time streaming processes adopted by Dell EMC uses the open source Kafka message queueing tool to push data into MemSQL, an in-memory database designed for real-time applications. Vendor MemSQL Inc. this week released a version 5.5 update that incorporates the Kafka connectivity into a feature for creating data pipelines with exactly-once semantics — meaning that data transmissions are processed only once, with guaranteed delivery and no data loss along the way. Smith said such a guarantee is “absolutely critical” for the kind of real-time analytics Dell EMC is looking to do.

Living with some real-time data loss

Guaranteed data delivery isn’t a necessity for eBay Inc., though. The online auction and e-commerce company uses Pulsar, an open source stream processing and analytics technology it created, to analyze data on user activities in order to drive personalization of the eBay website for individual visitors. In creating and expanding the real-time architecture over the past three years, eBay’s IT team decided it didn’t have to spend extra development money to build a delivery guarantee into the data pipeline.

“For our use cases, we can afford to lose a little bit of the data,” said Tony Ng, director of engineering for user behavioranalytics and other data services at eBay. But Ng’s team does have to keep on its toes as data flows in. For example, one of the goals is to detect bots on the site and separate out the activity data they generate so it doesn’t skew the personalization process for real users. That requires frequent updates to the bot-detection rules built into eBay’s analytics algorithms, Ng said.

The San Jose, Calif., company’s real-time streaming setup also includes Kafka as a transport mechanism, plus several other open source technologies — Storm, Kylin and Druid — for processing and storing data. Ng noted that the streaming operations are a lot different from the batch data loading eBay does into its Hadoop clusters and Teradata data warehouse for other analytics uses.

“There are some constraints on how much processing you can do on the data,” he said. It is eventually cleaned up and consolidated in batch mode for downstream analytics applications — “but the things that need to be real time, we want to keep real time.”

Putting together a real-time data streaming and analytics architecture is a complicated process in and of itself, said Mark Madsen, president of data management and analytics consultancy Third Nature Inc. in Portland, Ore.

Users can also tap a variety of other streaming technologies — for example, Spark’s Spark Streaming module and Apache Flink, an upstart alternative to Spark that was released in a commercial version this month by lead developer Data Artisans GmbH. But a lot of assembly is typically required to combine different tools into a functional platform. “It’s a build-to-order problem,” Madsen said. “[Individual IT vendors] carve out a piece of the problem, but it’s hard for them to carve out the whole problem.”