Microsoft Azure Stack is an extension of Azure – bringing the agility and innovation of cloud computing to your on-premises environment and enabling the only hybrid cloud that allows you to build and deploy hybrid apps anywhere. We bring together the best of the edge and cloud to deliver Azure services anywhere in your environment.

Working with Azure Cosmos DB Tables

Azure Cosmos DB provides the Table API for applications that need a key-value store with flexible schema, with predictable performance and global distribution. Azure Table storage SDKs and REST APIs can be used to work with Azure Cosmos DB. Azure Cosmos DB supports throughput-optimized tables (informally called "premium tables"), currently in public preview.

To connect Apache Spark to the Azure Cosmos DB Table API, you can use the Spark Connector for Azure Cosmos DB as follows.

Here, we have created Spark DataFrames – one for the airport data (which are the vertices) and one for the flight delay data (which are the edges). The graph we have stored in Azure Cosmos DB can be visually depicted as in the figure below, where the vertexes are the blue circles representing the airports and the edges are the black lines representing the flights between those cities. In this example, the originating city for those flights (edges) is Seattle (blue circle top left of map where all the edges are originating from).

Figure: Airport D3.js visualization of airports (blue circles) and edges (black lines) which are the flights between the cities.

Benefit of Integrating Cosmos DB Graphs with Spark

One of the key benefits of working with Azure Cosmos DB graphs and Spark connector is that Gremlin queries and Spark DataFrame (as well as other Spark queries) can be executed against the same data container (be it a graph, a table or a collection of documents). For example, below are some simple Gremlin Groovy queries against this flights graph stored in our Azure Cosmos DB graph.

The preceding code connects to the tychostation graph (tychostation.graphs.azure.com) to run the following Gremlin Groovy queries:

Using graph traversal and groupCount(), determines the number of flights originating from Seattle to the listed destination cities (e.g. there are 1088 flights from Seattle to Chicago) in this dataset).

Using graph to determine the total delay (in minutes) of the 90 flights from Seattle to Reno (i.e. 963 minutes of delay).

With the Spark connector using the same tychostation graph, we can also run our own Spark DataFrame queries. Following up from the preceding Spark connector code snippet, let’s run our Spark SQL queries – in this case we’re using the HDInsight Jupyter notebook service.

Top 5 destination cities departing from Seattle

%%sql
select a.city, sum(f.delay) as TotalDelay
from delays f
join airports a
on a.iata = f.dst
where f.src = 'SEA' and f.delay < 0
group by a.city
order by sum(f.delay) limit 5

Calculate median delays by destination cities departing from Seattle

%%sql
select a.city, percentile_approx(f.delay, 0.5) as median_delay
from delays f
join airports a
on a.iata = f.dst
where f.src = 'SEA' and f.delay < 0
group by a.city
order by median_delay

With Azure Cosmos DB, you can use both Apache Tinkerpop Gremlin queries AND Apache Spark DataFrame queries targeting the same graph.

Working with Azure Cosmos DB Document data model

Whether you are using the Azure Cosmos DB graph, table, or documents, from the perspective of the Spark Connector for Azure Cosmos DB, the code is the same! Ultimately, the template to connect to any of these data models is noted below:

And, with the Spark Connector for Azure Cosmos DB, data is parallelized between the Spark worker nodes and Azure Cosmos DB data partitions. Therefore, whether your data is stored in the Tables, Graph, or Documents, you will get the performance, scalability, throughput, and consistency all backed by Azure Cosmos DB when you are solving your Machine Learning and Data Science problems with Apache Spark.

Next Steps

In this blog post, we’ve looked at how Spark Connector for Azure Cosmos DB can seamlessly interact with multiple data models supported by Azure Cosmos DB. Apache Spark with Azure Cosmos DB enables both ad-hoc, interactive queries on big data, as well as advanced analytics, data science, machine learning and artificial intelligence. Azure Cosmos DB can be used for capturing data that is collected incrementally from various sources across the globe. This includes social analytics, time series, game or application telemetry, retail catalogs, up-to-date trends and counters, and audit log systems. Spark can then be used for running advanced analytics and AI algorithms at scale and globally on top of the data living in Azure Cosmos DB. With Azure Cosmos DB being the industry’s first globally distributed multi-model database service, the Spark connector for Azure Cosmos DB can work with tables, graphs, and document data models – with more to come!