Getting started with spatio-temporal analysis with GeoMesa, Accumulo, and Spark on Amazon Web Services (AWS) is incredibly simple, thanks to the Geodocker project. The following guide describes how to bootstrap a GeoMesa Accumulo cluster using Amazon ElasticMapReduce (EMR) and Docker in order to ingest and query some sample data. This guide assumes you have an Amazon Web Services account already provisioned as well as an IAM key pair. To set up the AWS command line tools, follow the instructions found in the AWS online documentation.

Use the following command to bootstrap an EMR cluster. You will need to change the KeyName to the IAM key pair you intend to use for this cluster. You can also edit the instance types to a size appropriate for your use case. Make sure you adjust the Accumulo cache configuration settings accordingly. For instance, if you use a high memory instance type, you will want to increase the TSERVER_XMX parameter which controls the amount of heap space allocated to the JVM running the Accumulo Tablet Server. Additionally, you should adjust the TSERVER_CACHE_DATA_SIZE and the TSERVER_CACHE_INDEX_SIZE to appropriate fractions of the TSERVER_XMX memory to take advantage of the increase in memory.

After executing that command, you can monitor the state of the EMR bootstrap process
by going to the Management Console. Find the name (as specified in the awsemr command) of the cluster and click through to its details page. Under the Hardware section, you can find the master node and its IP address. Copy the IP address and then run the following command.

$ ssh -i /path/to/key ec2-user@<ip_address>

This should log you into the master node of the EMR cluster you just
started. You can see a list of docker instances by running the following command:

Make sure you leave enough time for the machine to be completely bootstrapped before running the command to find the docker instances.
GeoMesa ships with predefined data models for many open spatio-temporal data sets such as GDELT. To ingest the most recent 7 days of GDELT from Amazon’s public S3 bucket:

You can register GDELT as a layer in the provided GeoServer as well. GeoServer is running on port 9090
of the master node. You can access it at http://<ip_address>:9090/geoserver, where <ip_address> is the
address you looked up before sshing into the master node. To register a GeoMesa layer, you’ll first need
to know the internal URL of the Zookeeper instance. Run the following command:

Save the store and publish the gdelt layer. Set the “Native Bounding Box” and the “Lat Lon Bounding Box” to
-180,-90,180,90. Save the layer. Then, navigate to the preview page at
http://<ip_address>:9090/geoserver/cite/wms?service=WMS&version=1.1.0&request=GetMap&layers=cite:gdelt&styles=&bbox=-180,-90,180.0,90&width=768&height=356&srs=EPSG:4326&format=application/openlayers.

Your bootstrapped spatial analytic environment has an instance of Jupyter notebook configured to analyze data in GeoMesa using SparkSQL and to visualize the results using Leaflet maps and Vegas (Vega-Lite) charts. To start, navigate to http://<ip_address>:8888/ where <ip_address> is the publicly accessible IP address of the master node. You will see a sample GDELT analysis notebook.

Click the GDELT Analysis notebook. Edit the zookeeper value in the first cell by setting it to the zookeeper ip address as described above. Then, select Cell -> Run All from the menu bar. This will execute all the cells in the notebook. Scroll through the sample and you will see some map and chart visualizations at the bottom.