Big Analytics Roundup (June 22, 2015)

Anmol Rajpurohit writes KDnuggets’ play-by-play for Day One and Day Two

My preliminary report is here; full report when slides are available from the sessions.

Spark will be one of several technologies featured at the inaugural In-Memory Computing Summit to be held in SFO June 29-30.

On KDnuggets, an interesting story from Gregory Piatetsky-Shapiro and Shashank Iyer. The authors measure association among analytics tools using responses to their recent poll. The strongest associations among the top 10 tools:

Spark and Hadoop

Python and Spark

Excel and SQL

Among the top 20 tools, the top associations are unsurprising:

SAS Enterprise Miner and Base SAS (cannot use the former with the latter)

IBM SPSS Modeler and IBM SPSS Statistics

The low associations are also interesting, if unsurprising:

Alteryx with everything else except Tableau

IBM SPSS with awk/gawk, scikit-learn and Spark

KNIME and Base SAS/Enterprise Miner

RapidMiner and Base SAS/Enterprise Miner

KNIME and RapidMiner are clearly positioned as low-cost SAS alternatives among relatively sophisticated analysts, while the Alteryx/Tableau combo is an entry-level offering for business users.

Analyst Reactions: Spark Summit

Doug Henschen wonders if Databricks will be eclipsed by IBM’s entry, citing IBM’s intent to offer Spark on its cloud platform BlueMix. He fails to note that (1) Databricks Cloud is more than a vanilla Spark service; (2) Databricks already competes with a Spark Service from AWS; and (3) BlueMix is an ankle-biter.

Tony Baer uses Andrew Brust’s blog to flatten a straw man, arguing that Spark isn’t going to replace Hadoop — a position that no serious person has suggested or implied. Even Spark diehards believe that there are use cases where MapReduce/HDFS makes sense.

Joe Panettieri trolls readers, asks if Spark can live up to “Big Data, Real Time Analytics Hype.” From the evidence he presents, the answer is “yes”.

Amazon Web Services

Amazon Web Services announces Apache Spark on Amazon EMR service. Stories here and here. Note that AWS has offered Spark on EC2 for some time, so headlines like “AWS jumps on Spark bandwagon” are misleading.

On the AWS blog, Jeff Smith of Intent Media relates his company’s success with Spark.

Apache Kafka

On the Cloudera Vision blog, Jay Kreps of Confluent contributes the second part of his two-parter on using Kafka for real-time data streams. Part One is here.