I agree to TechTarget’s Terms of Use, Privacy Policy, and the transfer of my information to the United States for processing to provide me with relevant information as described in our Privacy Policy.

Please check the box if you want to proceed.

I agree to my information being processed by TechTarget and its Partners to contact me via phone, email, or other means regarding information relevant to my professional interests. I may unsubscribe at any time.

Please check the box if you want to proceed.

By submitting my Email address I confirm that I have read and accepted the Terms of Use and Declaration of Consent.

captured and handled -- continues to gain attention, as seen at Strata + Hadoop World 2016 in San Jose, Calif., where streaming data tools and technologies were prominent.

The event showed that Hadoop-based big data implementations increasingly include the Apache Kafka messaging system and the Spark data processing engine's Spark Streaming module. Both technologies have become more prevalent, as developers build data pipelines that go beyond original batch-oriented Hadoop designs and approach real-time streaming analytics capabilities.

Kafka and Spark Streaming are often used together, with the former acting as a publish-and-subscribe messaging queue that feeds the latter. Once in Spark, the streamed data is then processed in parallel, sometimes for use in automated analytics applications. And in the bountiful Hadoop ecosystem, there also are other emerging contenders, including open source technologies such as Storm, Samza and Flink.

Streaming's impact on the overall big data space could be notable. Market researcher Wikibon, for example, recently predicted by 2022, the global market for unified streaming analytics technology will be 16% of all big data spending, or about $11.5 billion.

Feeding the data pipeline

The streams Spark can handle may not be as swift as those in, say, stock trading applications, which have long been a hotbed of submillisecond data streaming. But for Internet clickstreams and other common data flows, Spark Streaming's microbatching architecture may be enough to meet stream processing needs, according to some Strata + Hadoop World attendees.

"For 90% of what people want to do, Spark Streaming will fit the purpose," said Mohammad Quraishi, a senior IT principal for big data analytics at medical insurer Cigna Corp., in Bloomfield, Conn.

Cigna uses Spark Streaming along with Kafka as part of a larger Hadoop system, based on Cloudera's distribution of the big data framework. Quraishi, who spoke at the conference, said Kafka enables the insurer to create a "speed layer" for data. "It completes the Lambda architecture for us," he said, referring to an often-discussed architectural approach for managing big data that supports both batch and real-time processing.

"A data pipeline is really important. Having a high-throughput, low-latency pub-sub messaging engine like Kafka will help simplify how you handle data beyond that," Quraishi said.

The Kafka and Spark Streaming combo appeared in other conference presentations, as well. Applications ranged from a real-time fraud detection system created at Netherlands-based financial services company ING Group to a streaming system for handling sensor data transmitted from railroad cars that's in progress at industrial conglomerate Siemens AG, which is based in Munich, Germany.

According to Yvonne Quacken, a senior big data architect and engineer at Siemens, who also spoke at the Strata conference, the need for fast, easy and flexible data streams is high. "Today, we lose time just trying to load data," she said.

"In general, we're hoping to optimize our processes. What we're doing now will enable more in the future," said Quacken, whose team is working with data warehouse and big data vendor Teradata to connect Kafka, Spark and other open source software to handle incoming data from the Internet of Things.

What's driving the data train?

What's driving innovation in streaming data analytics today is the generally faster cadence of digital business overall, according to independent analyst and industry observer Thomas Dinsmore. That cadence is often driven by the growth of Web and mobile applications, Dinsmore said.

He noted that streaming analytics has a long lineage dating back to the development of complex event processing (CEP) technologies by Tibco Software and other vendors. But CEP systems were fairly expensive to implement, Dinsmore said, "so penetration was limited to the strongest use cases, like high-velocity trading and capital markets applications."

Dinsmore said the new generation of open source data streaming tools -- he cited Apex, Flink, Samza, Spark Streaming and Storm among notable frameworks -- "offers the potential to lower costs dramatically, which opens new use cases."

1 comment

Register

Login

Forgot your password?

Your password has been sent to:

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy