It won't be cheap. Recall that the 900TB 4U for $60k was for spinning drives. Given that the 16TB SSDs go for nearly $12k a pop, the 32TB drive that has been slated for later this year would be at least twice that and likely much more initially. Even at $24k for each 32TB SSD, this 1U of 1PB SSD would set you back $800k.

Neo4j, a leader in connected data, announced that it has released the preview version of Cypher for Apache Spark (CAPS) language toolkit.

[...] Until now, data scientists have been using Spark and query tools like GraphX to define extensions to their graphs. Once identified, they would then re-implement and deploy that work within their applications. Now, with Cypher for Apache Spark, these scientists can iterate easier and connect adjacent data sources to their graph applications much more quickly.

[...] This announcement builds on Neo4j’s unveiling of openCypher in October 2015, as an effort to push the whole graph industry forward by tapping into the open source community and making Cypher’s evolution an open exercise while avoiding redundant research.

Video

Slides

Wednesday, June 7, 2017

Spark Summit 2017 was all about Deep Learning. Databricks, which has long offered deep learning with GPUs on its commercial cloud service, announced they are open sourcing a deep learning library Deep Learning Pipelines which seems to lack GPU support. Similarly, Intel open sourced their own deep learning library, BigDL, also without GPU support, because Intel is pushing their FPGA-juiced Xeons for accelerated BLAS for machine learning (which I first blogged about three years ago).

The big announcement the second day (non-training) of the Summit was that Databricks created a serverless version of its commercial cloud service. This should, at least theoretically, significantly reduce the cost for companies making Spark available to their data scientists, thus (finally) offering a compelling use over trying to run Zeppelin, Jupyter, or Spark Shell on-premises.

A year out from Spark Summit 2016, I was surprised to hear about so many real-world uses of GraphX. The only thing I personally heard about GraphFrames was from a Databricks presentation. GraphFrames does still seem to be the future, but even that is not crystal clear, as Ion Stoica in the second day's Fireside Chat touted Tegra for (finally) mutable graphs, which is based on GraphX rather than GraphFrames. (I first blogged about Tegra in my review of last year's Spark Summit.)

There was more natural language processing (NLP) at the Summit than ever before. At the Fireside Chat, Ben Lorica pushed hard on Ion Stoica and Matei Zaharia to incorporate NLP into the Apache Spark distribution. My favorite keynote was by Riot Games on language-agnostic (English, Chinese, Japanese -- it didn't care) chat text messaging abusive language detection. And, of course, my own presentation was on NLP.

Finally, Structured Streaming finally got officially labeled as production-ready, meaning Spark Streaming will eventually destined for the deprecation graveyard. There was a demo of 10ms latency, to compete with Storm and Flink. No more micro-batches!

Headless server

And change zeppelin.server.addr to be either the IP address or the domain name of this server. This is to allow outside connections.

Proxy

Zeppelin seems to need npm from node.js, which in turn needs to know your proxy settings. To get around this, install node.js yourself (instead of relying on what is built in to Zeppelin) and execute npm config to set its proxy settings. Below includes the instructions for installing node.js onto RedHat-type Linux distributions (CentOS, Oracle Linux, etc.). See nodejs.org for other OS's.

Wednesday, January 4, 2017

As I noted in my May 14, 2016 blog post, Spark Structured Streaming, which brings the ability to stream a data source into a DataFrame and query it with SQL in real-time, was announced with much fanfare (along with Spark 2.0) at Spark Summit 2016, but notably absent at the time was its support for Kafka.