Using Amazon EMR with Greenplum Database installed on AWS

A newer version of this documentation is available. Click here to view the most up-to-date release of the Greenplum 4.x documentation.

Using Amazon EMR with Greenplum Database installed on AWS

Amazon Elastic MapReduce (EMR) is a managed cluster platform that can run big data
frameworks, such as Apache Hadoop and Apache Spark, on Amazon Web Services (AWS) to process
and analyze data. For a Greenplum Database system that is installed on Amazon Web Services
(AWS), you can define Greenplum Database external tables that use the
gphdfs protocol to access files on an Amazon EMR instance HDFS.

In addition to the steps described in One-time HDFS Protocol Installation, you must also ensure
Greenplum Database can access the EMR instance. If your Greenplum Database system is
running on an Amazon Elastic Compute Cloud (EC2) instance, you configure the Greenplum
Database system and the EMR security group.

Configure for communication between Greenplum Database and EMR instance Hadoop data
nodes. Open a TCP/IP port for so that Greenplum Database segments hosts can
communicate with EMR instance Hadoop data nodes.

For example, open port 50010 in
the AWS security manager.

This table lists EMR and Hadooop version information that can be used to configure
Greenplum Database.