Neo4j Blog

Cypher – the SQL for Graphs – Is Now Available for Apache Spark

In case you missed it at GraphConnect New York: The Neo4j team has announced the public alpha release of Cypher for Apache Spark™.

We’ve been building Cypher for Apache Spark for over a year now and have donated it to the openCypher project under an Apache 2.0 license, allowing for external contributors to join at this early juncture of an alpha release. Find the current language toolkit on GitHub here.

Making Cypher More Accessible to Data Scientists

Cypher for Apache Spark will allow big data analysts and data scientists to incorporate graph querying into their workflows, making it easier to leverage graph algorithms and dramatically broadening how they reveal data connections.

Up until now, the full power of graph pattern matching has been unavailable to data scientists using Spark or for data wrangling pipelines. Now, with Cypher for Apache Spark, data scientists can iterate easier and connect adjacent data sources to their graph applications much more quickly.

As graph-powered applications and analytic projects gain success, big data teams are looking to connect more of their data and personnel into this work. This is happening at places like eBay for recommendations via conversational commerce, Telia for smart home, and Comcast for smart home content recommendations.

Cypher for Apache Spark: A Closer Look

Cypher for Apache Spark enables the execution of Cypher queries on property graphs stored in an Apache Spark cluster in the same way that SparkSQL allows for the querying of tabular data. The system provides both the ability to run Cypher queries as well as a more programmatic API for working with graphs inspired by the API of Apache Spark.

Cypher for Apache Spark is the first implementation of Cypher with support for working with multiple named graphs and query composition. Cypher queries can access multiple graphs, dynamically construct new graphs, and return such graphs as part of the query result.

Furthermore, both the tabular and graph results of a Cypher query may be passed on as input to a follow-up query. This enables complex data processing pipelines across multiple heterogeneous data sources to be constructed incrementally.

Cypher for Apache Spark provides an extensible API for integrating additional data sources for loading and storing graphs. Initially, Cypher for Apache Spark will support loading graphs from HDFS (CVS, Parquet), the file system, session local storage, and via the Bolt protocol (i.e., from Neo4j). In the future, we plan to integrate further technologies at both the data source level as well as the API level.

Cypher for Apache Spark also is the first open source implementation of Cypher in a distributed memory / big data environment outside of academia. Property graphs are represented as a set of scan tables that each correspond to all nodes with a certain label or all relationships with a certain type.

Conclusion

We at Neo4j are proud to be contributing Cypher for Apache Spark to the openCypher project to make the “SQL for Graphs” available on Spark and the wider community. This is an early alpha release, and we will help further develop and refine Cypher for Apache Spark until the first public release of 1.0 next year.

Until then, we look forward to your feedback and contributions. The data industry is recognizing the true power of graph technology, and we’re happy to be building the de facto graph query language alongside our amazing community.

New to the world of graph technology?
Click below to get your free copy the O’Reilly Graph Databases book and discover how to harness the power of graph database technology.

About the Author

Philip Rathle , VP of Products

Philip Rathle has a passion for building great products that help users solve tomorrow’s challenges. He spent the first decade of his career building information solutions for some of the world’s largest companies: first with Accenture, then with Tanning Technology, one of the world’s top database consultancies of the time, as a solution architect focusing on data warehousing and BI strategy.

4 Comments

Normally in IDEs one could have plugins for Cypher language based on its BNF. The example here shows it as a string in the Scala snippet. Is there a way to make use of such IDE highlighting with that Zeppelin setup?

Spark SQL is replete with many rich features and we can intra-convert to RDD, DF,DS. Alongside, we have Graphx and we can do inter-conversion too. What features does cypher have that are not there in spark SQL, Graphx? Would like to see the use-cases.

CAPS didn’t provide Cypher capabilities in CSV, you can only create graphs by mapping nodes to each others.
and then the spark graph is immutable you can’t apply cypher queries to change it.
Also user defined function such “apoc” didn’t work with CAPS