If we make the assumption that a county with more GDELT events is somehow more politically relevant, we may be interested
in seeing which counties are more dense with GDELT events.

Since GDELT is a data set with point geometries, we do not immediately know which county it belongs to. To resolve this,
we will use a second data set, a FIPS Codes shapefile that denotes the boundary and FIPS code of a county, and join GDELT
points with the county that contains them.

In this case, the number of counties is particularly small, approximately 3000 records, we can make our query much
more efficient by “broadcasting” the counties. “Broadcast” here meaning a
Spark Broadcast.
In a traditional Spark SQL join, data will be shuffled around the executors based on the partitioners of the RDDs,
and since in our case the join key is a geometric field, there is no built-in Spark partitioner that can map data effectively.
The resulting movement of data across nodes is expensive, so we can attain a performance boost by sending (broadcasting)
our entire small data set to each of the nodes once. This ensures that the executors have all the data needed to compute
the join, and no additional shuffling is needed.

Spark provides a means of executing this “broadcast join”, though it should only be used when the data being broadcast
is small enough to fit in the memory of the executors.

For this tutorial, we will assume that your have already ingested the two data sets into the data store of your choosing.
Following this tutorial without having created the necessary tables will lead to errors.

To start working with Spark, we will need a Spark Session initialized, and to apply GeoMesa’s geospatial User Defined
Types (UDTs) and User Defined Functions (UDFs) to our data in Spark, we will need to initialize our SparkSQL extensions.
This functionality requires having the appropriate GeoMesa Spark runtime jar on the classpath when running your Spark job.
GeoMesa provides Spark runtime jars for Accumulo, HBase, and FileSystem data stores. For example, the following would start an
interactive Spark REPL with all dependencies needed for running Spark with GeoMesa version 2.0.0 on an Accumulo data store.

Note the withJTS, which registers GeoMesa’s UDTs and UDFs, and the two config options which tell Spark to
use GeoMesa’s custom Kryo serializer and registrator to handle serialization of Simple Features. These configuration options can
also be set in the conf/spark-defaults.conf configuration file.

The above parameters assume Accumulo as the backing data store, but the rest of the tutorial is independent of which
data store is used. Other supported data stores may be used by simply adapting the above parameters appropriately.

Then we make use of Spark’s DataFrameReader and our SpatialRDDProvider to create a DataFrame with geospatial
types.

Depending on the scale of the data in our data store, and how specific our questions are, we may want to narrow the result
before joining. For example, if we only wanted GDELT events within a one-week span, we could filter the DataFrame as
follows:

Now we’re ready to join the two data sets. This is where we will make use of our geospatial UDFs. st_contains takes
two geometries as input, and it outputs whether the second geometry lies within the first one. For more documentation
and a full list of the UDFs provided by GeoMesa see SparkSQL Functions.

Now we have a DataFrame where each GDELT event is paired with the US county where it occurred.
To turn this into meaningful statistics about the distribution of GDELT events in the US, we
can do a GROUPBY operation and use some of SparkSQL’s aggregate functions.

The above query groups the data based on FIPS code, (which is split into a state and county code),
and counts the number of distinct GDELT events in each one. The result can be used to generate a visualization
of the event density in each county, which we will see in the next section.

If the result can fit in memory, it can then be collected on the driver and written to a file. If not, each executor can
write to a distributed file system like HDFS.

valgeoJsonString=geojsonDF.collect.mkString("[",",","]")

Once we have our data exported as GeoJSON, we can create a Leaflet map, which is an interactive
map in JavaScript that can be embedded into a web page.

Loading and parsing the JSON is simple. In this case we are wrapping the file load in an XMLHttpRequest callback function
for compatibility with a notebook like Jupyter or Zeppelin. If the GeoJSON was exported to a file named aggregate.geojson,
then the following JavaScript will load that a file into a Leaflet map.

This does make use of a few helper functions for setting the color and popup content of each item on the map:

// Create the bins of the histogram, allows for coloring features by valuefunctioncreateBins(json){varmin=Number.MAX_SAFE_INTEGER;varmax=0;json.forEach(function(feature){letaggValue=Number(feature.properties[aggFeature])if(aggValue<min)min=aggValueif(aggValue>max)max=aggValue});varinterval=(max-min)/numBins;for(vari=0;i<numBins;i++){bins.push(i*interval);}}// Get the fill color based on which bin a value is infunctiongetColor(value){varfillColor=colorRange[numBins];for(varx=0;x<numBins;x++){if(Number(value)<bins[x]){fillColor=colorRange[x];break;}}returnfillColor;}// Decorate a feature with a popup of its propertiesfunctiondecorate(feature,layer){feature.properties.popupContent=Object.entries(feature.properties).join("<br/>").toString();layer.bindPopup(feature.properties.popupContent);}