Monitoring the Performance of Hive/Impala Replications

Note: This page contains references to CDH 5 components or features that have been removed from CDH 6. These references are only applicable if you
are managing a CDH 5 cluster with Cloudera Manager 6. For more information, see Deprecated Items.

You can monitor the progress of a Hive/Impala replication schedule using performance data that you download as a CSV file from the Cloudera Manager Admin console. This file contains
information about the tables and partitions being replicated, the average throughput, and other details that can help diagnose performance issues during Hive/Impala replications. You can view this
performance data for running Hive/Impala replication jobs and for completed jobs.

To view the performance data for a running Hive/Impala replication schedule:

HDFS Performance Full – downloads a full performance report of the HDFS phase of the running Hive replication job.

Hive Performance – downloads a report of Hive performance.

To view the data, import the file into a spreadsheet program such as Microsoft Excel.

To view the performance data for a completed Hive/Impala replication schedule:

Go to Backup > Replication Schedules.

Locate the schedule and click Actions > Show History.

The Replication History page for the replication schedule displays.

Click to expand the display of the selected schedule.

To view performance of the Hive phase, click Download CSV next to the Hive Replication Report label and select one
of the following options:

Results – download a listing of replicated tables.

Performance – download a performance report for the Hive replication.

Note: The option to download the HDFS Replication Report might not appear if the HDFS phase of the replication skipped all HDFS files because they
have not changed, or if the Replicate HDFS Files option (located on the Advanced tab when creating Hive/Impala replication
schedules) is not selected.

To view the data, import the file into a spreadsheet program such as Microsoft Excel.

The performance data is collected every two minutes. Therefore, no data is available during the initial execution of a replication job because not enough samples are available to
estimate throughput and other reported data.

The data returned by the CSV files downloaded from the Cloudera Manager Admin console has the following structure:

Hive Performance Report Columns

Hive Performance Data Columns

Description

Timestamp

Time when the performance data was collected

Host

Name of the host where the YARN or MapReduce job was running.

DbName

Name of the database.

TableName

Name of the table.

TotalElapsedTimeSecs

Number of seconds elapsed from the start of the copy operation.

TotalTableCount

Total number of tables to be copied.

The value of the column will be -1 for replications where Cloudera Manager cannot determine the number of tables being changed.

TotalPartitionCount

Total number of partitions to be copied.

If the source cluster is running Cloudera Manager 5.9 or lower, this column contains a value of -1 because older releases do not report this
information.

DbCount

Current number of databases copied.

DbErrorCount

Number of failed database copy operations.

TableCount

Total number of tables (for all databases) copied so far.

CurrentTableCount

Total number of tables copied for current database.

TableErrorCount

Total number of failed table copy operations.

PartitionCount

Total number of partitions copied so far (for all tables).

CurrPartitionCount

Total number of partitions copied for the current table.

PartitionSkippedCount

Number of partitions skipped because they were copied in the previous run of the replication job.

IndexCount

Total number of index files copied (for all databases).

CurrIndexCount

Total number of index files copied for the current database.

IndexSkippedCount

Number of Index files skipped because they were not altered.

Due to a bug in Hive, this value is always zero.

HiveFunctionCount

Number of Hive functions copied.

ImpalaObjectCount

Number of Impala objects copied.

A sample CSV file, as presented in Excel, is shown here:

Note the following limitations and known issues:

If you click the CSV download too soon after the replication job starts, Cloudera Manager returns an empty file or a CSV file that has columns headers only and a message to try later
when performance data has actually been collected.

If you employ a proxy user with the form user@domain, performance data is not available through the links.

If the replication job only replicates small files that can be transferred in less than a few minutes, no performance statistics are collected.

For replication schedules that specify the Dynamic Replication Strategy, statistics regarding the last file transferred by a MapReduce job hide
previous transfers performed by that MapReduce job.

Only the last trace of each MapReduce job is reported in the CSV file.

If this documentation includes code, including but not limited to, code examples, Cloudera makes this available to you under the terms of the Apache License, Version 2.0, including any required
notices. A copy of the Apache License Version 2.0 can be found here.