MapReduce Executor

The
MapReduce executor starts a MapReduce job in HDFS or MapR FS each time it receives an
event record. Use the MapReduce executor as part of an event stream.

You can use the MapReduce executor to start a custom job, such as a validation job that
compares the number of records in files. You can also use the MapReduce executor to
start a predefined job that converts Avro files to Parquet.

You can use the executor in any logical way, such as running
MapReduce jobs after the Hadoop FS or MapR FS destination closes files.
For example, you might use the executor to convert an Avro file to Parquet after the
Hadoop FS destination closes a file as part of the Data Drift
Synchronization Solution for Hive.

Note that the MapReduce executor starts jobs in an external system. It does not monitor
jobs or wait for the job to complete. The executor becomes available for additional
processing as soon as it successfully submits the job.

When you configure the MapReduce executor, you specify connection information and job
details. For the Avro to Parquet job, you specify details such as the output file
directory and optional compression codec. For other types of jobs, you enter the
key-value pairs to use.

When necessary, you can enable Kerberos authentication and specify a MapReduce user. You
can also use MapReduce configuration files and add other MapReduce configuration
properties as needed.

You can also configure the executor to generate events for
another event stream. For more information about dataflow
triggers and the event framework, see Dataflow Triggers Overview.

Related Event Generating Stages

Use the MapReduce executor in the event stream of a
pipeline. The MapReduce executor is meant to start MapReduce jobs after output files are
written.

Use the MapReduce executor to perform post-processing for files written by the following
destinations:

Hadoop FS destination

MapR FS destination

MapReduce Jobs

The MapReduce
executor can run an Avro to Parquet job or any other MapReduce job that you configure.
When configuring your own MapReduce job, you enter the key-value pairs needed for the
job.

You can use expressions in key-value pairs.

Avro to Parquet Job

The
MapReduce executor includes an Avro to Parquet job to convert Avro files to Parquet.

The Avro to Parquet job processes an Avro file after a destination closes it. That is, a
destination finishes writing an Avro file and generates an event record. The event
record contains information about the file, including the name and location of the file.

When the MapReduce executor receives the event record, it starts a MapReduce job that
converts the Avro file to Parquet. By default, it uses the file name and location in the
"filepath" field of the event record. The executor uses the following expression by
default:

${record:value('/filepath')}

You can change the expression as needed.

When creating the Parquet file, the MapReduce executor creates the file in a user-defined
directory. It uses the existing name for the file.

Event Generation

The MapReduce executor can generate events that
you can use in an event stream. When you enable event generation, the executor generates
events each time it starts a MapReduce job.

MapReduce executor events can be used in any logical way. For example:

With the Email executor to send a custom email
after receiving an event.

Event Records

Event records
generated by the MapReduce executor have the following event-related record header
attributes. Record header attributes are stored as String values:

Record Header Attribute

Description

sdc.event.type

Event type. Uses one of the following types:

job-created - Generated when the executor creates and starts
a MapReduce job.

sdc.event.version

An integer that indicates the version of the event record type.

sdc.event.creation_timestamp

Epoch timestamp when the stage created the event.

Event records generated by the MapReduce executor have the following fields:

Event Field Name

Description

tracking-url

Tracking URL for the MapReduce job.

job-id

Job ID of the MapReduce job.

Kerberos Authentication

You can use Kerberos authentication to connect to
Hadoop services such as HDFS or YARN. When you use Kerberos authentication, Data Collector
uses the Kerberos principal and keytab to authenticate. By default, Data Collector
uses the user account who started it to connect.

The Kerberos principal and keytab are defined in the Data Collector
configuration file, $SDC_CONF/sdc.properties. To use Kerberos
authentication, configure all Kerberos properties in the Data Collector
configuration file, and then enable Kerberos in the MapReduce executor.

Using a MapReduce User

Data Collector
can either use the currently logged in Data Collector user or a
user configured in the
executor to submit jobs.

A Data Collector configuration property can be set that requires using the currently logged in
Data Collector user. When
this property is not set, you can specify a user in the origin. For more
information about Hadoop impersonation and the Data Collector property, see Hadoop Impersonation Mode.

Note that the executor uses a different user account to connect. By default, Data Collector uses
the user account who started it to connect to external systems. When
using Kerberos, Data Collector uses
the Kerberos principal.

To configure a user in the executor, perform the following tasks:

On the external system, configure the user as a proxy user and authorize the
user to impersonate the MapReduce user.

For more information, see the
MapReduce documentation.

In the MapReduce executor, on the MapReduce tab,
configure the MapReduce User property.

Configuring a MapReduce Executor

Configure
a MapReduce executor to start MapReduce jobs each time the executor receives an
event record.

In the Properties panel, on the General tab, configure the
following properties:

General Property

Description

Name

Stage name.

Description

Optional description.

Stage Library

Library version that you want to use.

Produce Events

Generates event records when events occur. Use for event
handling.

Required Fields

Fields that must include data for the record to be passed into the stage.

Tip: You might include fields that the stage uses.

Records
that do not include all required fields are processed based on the error handling
configured for the pipeline.

Preconditions

Conditions that must evaluate to TRUE to allow a record to enter the stage for
processing. Click Add to create additional preconditions.

Records
that do not meet all preconditions are processed based on the error handling configured for
the stage.

On the MapReduce tab, configure the following
properties:

MapReduce Property

Description

MapReduce Configuration Directory

Absolute path to the directory containing the Hive and
Hadoop configuration files. For a Cloudera Manager
installation, enter hive-conf.

The stage uses the
following configuration files:

core-site.xml

yarn-site.xml

mapred-site.xml

Note: Properties in the configuration files are
overridden by individual properties defined in this
stage.

MapReduce Configuration

Additional properties to use.

To add properties, click Add and
define the property name and value. Use the property
names and values as expected by HDFS or MapR FS.

MapReduce User

The MapReduce user to use to connect to the external
system. When using this property, make sure the external
system is configured appropriately.

When not configured, the pipeline uses the currently
logged in Data Collector user.

Not configurable when Data Collector is configured to use
the currently logged in Data Collector user. For more information, see Hadoop Impersonation Mode.

Kerberos Authentication

Uses Kerberos credentials to connect to the external
system.

When selected, uses the Kerberos principal and
keytab defined in the Data Collector configuration file,
$SDC_CONF/sdc.properties.

On the Jobs tab, configure the following
properties:

Job Property

Description

Job Name

Display name for the MapReduce job.

This name displays
in Hadoop web applications and other reporting tools
that list MapReduce jobs.

Job Type

Type of MapReduce job to run:

Custom - Use a custom job creator interface and
key-value pairs to define the job.

Configuration Object - Enter key-value pairs to
define the job.

Convert Avro to Parquet - Use a predefined job to
convert Avro files to Parquet.

Custom JobCreator

MapReduce Job Creator interface to use for custom jobs.

Job Configuration

Key-value pairs of configuration properties to define the
job. You can use expressions in keys and values.