Machine learning. Artificial Intelligence

Menu

Big Analytics Roundup (September 28, 2015)

Strata+Hadoop World NYC is upon us. Andrew Brust opines that there will be three themes at Strata this year: (1) Spark “versus” Hadoop; (2) streaming goes mainstream; (3) data governance matters. My take:

“Spark versus Hadoop” is controversy for the sake of people who like controversy. Spark works with Hadoop, and Spark works with other platforms, or by itself. Use cases will determine the best platform.

We’ve been hearing that streaming is mainstream for something like ten years now. There are a half-dozen commercial products in the space, plus multiple open source frameworks.

Data governance is a soporific.

Due to the spate of Spark stories this week, this week’s roundup has four sections: Spark, SQL, Machine Learning and Streaming. The top story is Databricks’ Spark survey, which provoked a flurry of analysis.

Spark

2015 Spark Survey

Databricks released results of its 2015 Spark Survey, available here (registration required); an infographic is here. The “report” is a somewhat informative mashup of survey findings, plus other information, such as the headcount from Spark Summits. (Spoiler:it’s increasing.) On the Databricks blog, Matei Zaharia, Patrick Wendell and Denny Lee summarize key points. Additional analysis here, here, here, here, here, here, here and here.

Analysts, loving controversy, note that Spark users slightly prefer standalone configurations over Spark-on-YARN (e.g. co-located in Hadoop). Andrew Oliver, for example, commenting on Cloudera’s One Platform announcement earlier this month, argues that Databricks is actively marketing against Spark-on-YARN, citing results of this survey. But if you compare these results to the Typesafe/Databricks Spark survey published in January, you will note that respondents to the 2015 survey are slightly less likely to run Spark in a standalone cluster this year compared to last year.

Other analysts, like Tony Baer, note that 11% of respondents run Spark on Mesos, hinting darkly that since the AMPLab team developed both Spark and Mesos, there must be some sort of conspiracy against Hadoop. But in the earlier survey, 26% of respondents said they run on Mesos, so if someone is organizing a secret cabal to compete against Spark-on-YARN, it’s not working out too well.

The biggest news in the survey is the rapid growth of users who use the Python API, from 22% to 58%, and the corresponding decline among those who use Scala or Java. The SQL and R interfaces are too new to compare to the previous survey, but it’s worth noting that in 2015 more respondents use the SQL interface than the Java interface.

Spark as a Service

Google announces Cloud Dataproc, a managed Spark and Hadoop service, currently available in beta. Key benefits claimed: cheap, fast, integrated with the other Google Cloud platform services, easy to manage, simple and familiar. Google claims that they can set up or knock down a cluster in ninety seconds or less. Billing is by the minute, which is cool. Stories here, here, here, here, here, here, here, here, here, here, here, here, here, here, and here.

On the MapR blog, the ubiquitous Jim Scott explains why Spark is a great companion to Hadoop.

In IT Jungle, Alex Woodie wonders what IBM’s embrace of Spark means for the product line IBM now brands as “i-series” and everyone else calls “AS-400”. His answer: nothing, IBM has no plans to put Spark on these tired old boxes.

Writing for American Banker, Tom Groenfeldt interviews Tom Davenport, several vendors (Rob Thomas of IBM, David Wallace of SAS and Abhi Mehta of Tresata) and one banker. Tom Davenport says that bankers use different things, touts Teradata; Rob Thomas talks about IBM’s Spark initiative; David Wallace says that banks use SAS, and the one banker talks about using Accenture. From this muddle, Mr. Groenfeldt concludes that banks are turning to Spark.

In an article titled Retail Gains with Distributed Systems, Daniel Gutierrez talks about Hadoop and Spark, but provides no actual examples of retailers using these platforms.

SQL

On YouTube, a disembodied voice representing Syntelli Solutions offers you a Test Drive using Drill and Spotfire on AWS.

Impala

Cloudera benchmarks Impala with TPC-DS queries, concludes that maximum concurrency with good performance increases with the size of the cluster. This does not seem surprising at all; more nodes in the cluster means more horsepower.

ClearStory Data announces a triumph of branding (“Intelligent Data Harmonization”) and a few new features in a muddled press release.

Machine Learning

Graphlab/Dato

Carlos Guestrin announces that Dato is a big believer in open source software, which will make you feel good when you pay the subscription fees on Dato’s commercial software. Dato has released its SFrame columnar data frame to open source under a BSD license. SFrames are like Pandas or R Frames, with some additional features useful to data scientists, like out-of-memory operations and support for wide datasets.

No doubt SFrames are cool, but the key challenge for companies in this space is to figure out how to make analytics work with mainstream data formats. Any advantages of a new format are offset by the time and cost needed to ingest and export the data.

H2O/H2O.ai

At the Moscow Data Fest, H2O argues that machine learning is the new SQL.

TIBCO’s Kai Wahner presents a nice overview of stream processing frameworks and products. Not surprisingly, he likes Tibco Streambase, but the deck nicely summarizes differences between the commercial and open source options.