Friday, October 20, 2017

Visualize SpatialHadoop indexes

I received several requests asking for help in building visualizations for SpatialHadoop indexes. In many of my papers, posters, and presentation, I display a visualization of spatial indexes like the one shown below.

[Click to enlarge] A Quad-tree-based index for a 400 GB dataset that represents the world road network extracted from OpenStreetMap.

There are actually several ways to visualize these indexes and the good news is that all of them are fairly simple. You can choose between them based on your needs.

Prerequisites

You need SpatialHadoop installed and running to be able to build the indexes that we are going to visualize. I assume that you already have a spatial index constructed using SpatialHadoop and you only need to visualize it. For more details on how to setup SpatialHadoop and use it to build distributed indexes for big spatial data, please check the SpatialHadoop website and Wiki pages.

Using QGIS

The most straightforward way to visualize your index is though QGIS. As a side product of the SpatialHadoop index command, a WKT file is generated that describes the shape of the index along with some additional information such as the size of each partition. By loading this small file into QGIS, you can interactively explore the index. Here are more detailed steps.

Build an index in SpatialHadoop using the 'index' command.

In the index directory, you will find a file with the extension '.wkt'. Copy that file to the local machine using 'hdfs dfs -get' command. For example:hdfs dfs -get cemetery.str/_str.wkt .

Start QGIS and use the "Add Delimited Text Layer" button.

Use the "Browse" button and choose the wkt file.

Usually, QGIS can automatically detect the format of the file. In case you need to manually set the options, please do the following:

Choose the "Tab" delimiter.

Check the box "First record has field names"

Choose "Well Known Text (WKT)" geometry definition.

Choose "Boundaries" as a Geometry field.

Press the OK button to import the file.

Set the correct Coordinate Reference System according to your data format. Usually, you can choose "WGS 84".

The index partitions are displayed in QGIS as in the following picture.

In QGIS, you can select any partition to find all the details about it such as the corresponding file name, size in bytes, or the total number of records. You can also interactively zoom in and out or color the partitions based on their attributes. The drawback of this method is that you do not see the raw data. While you can load the original file in QGIS as well, it will be too slow if the input file is large.

Using HadoopViz

HadoopViz is the visualization component of SpatialHadoop. You can use HadoopViz to visualize two separate images for the data and the index, and then overlay them on top of each other. Please follow the steps below assuming the index is in HDFS under the directory 'cemetery.str'.

To plot the data, issue the following command:shadoop gplot cemetery.str cemetery.png shape:osmThe command will produce the image 'cemetery.png' such as the one below.

Issue the following command to plot the index:shadoop gplot cemetery.str/_master.str cemetery_index.png shape:edu.umn.cs.spatialHadoop.indexing.PartitionThe output can be similar tot he image below

All you need to do after this is to overlay the two images on top of each other to get the final picture shown below.

While you will not be able to interactively zoom in and out in this picture, this method can easily scale to very large data. The reason is that the gplot function runs as a MapReduce program in SpatialHadoop and is able to scale to terabytes or more depending on your cluster size.