check-column argument specifies which column to be checked during the import operation. The column can not be *CHAR type, like VARCHAR2 or CHAR.

incremental argument can have two modes: append and lastmodified. Lasmodified argument is usually used with a lastmodified column defined as timestamp.Last-value argument is used to specify a value that new rows with greater than this value will be insert.
or use other ways, suggested by one internet article:
–last-value $($HIVE_HOME/bin/hive -e “select max(idcolumn) from tablename”)

Note 1:
If seeing the following error, make sure the tablename specified in –table argument in UPPERCASE. If it is lowercase, you could see the error below.ERROR tool.ImportTool: Imported Failed: There is no column found in the target table all_objects_inc_test. Please ensure that your table name is correct.

Note 2:
If using –hive-import argument, you could see the following error. It is not supported yet. So have to remove it and build Hive External table after complete the import data to hdfs.ERROR Append mode for hive imports is not yet supported. Please remove the parameter –append-mode.

Interesting. Impala shows 4 rows instead of 11 rows. The reason is that impala does not refresh metadata regularly. So need to do the invalidate metadata to get the last row count. Hive doesn’t seem have this issue.

There are two ways to use Impala to query tables in Hive. One way is to use command line, Impala Shell. Another one is to use Hue GUI. I am going to show both methods one by one.

Use Impala ShellImpala Shell is a nice tool similar to SQL Plus to setup database and tables and issue queries. The speed of ad hoc queries is much faster than Hive’s query, especially for queries requiring fast response time. Here are the steps in using Impala shell.

Note: The prompt shows Not connected. I need to connect the Impala shell to any Data Node with impalad daemon. My cluster is using vmhost2 and vmhost3 as Data Node. So I pick any one of them, use vmhost2 for this test.
[Not connected] > connect vmhost2;
Connected to vmhost2:21000
Server version: impalad version 2.2.0-cdh5.4.3 RELEASE (build 517bb0f71cd604a00369254ac6d88394df83e0f6)
[vmhost2:21000] >

2. Run some queries. Impala can see the same list of databases and tables like Hive does.

Impala
Wow, 41 seconds to get a row count of 22,782 by using Hive. That seem excessive on a cluster no other jobs running. Ok, let’s look at Impala’s result.Note: Impala does not poll frequently for metadata changes. So in case you don’t see the table name after the import, just do the following:invalidate metadata;
show tables;

The above result shows Hive took 41 seconds to get the row count of a table with 22, 782 rows while Impala was significant faster and took 0.12 seconds. I know my cluster is small, not powerful and hive is using Map/Reduce. But getting a total row count of 22,000 needs 45 seconds, it seems too much. On the other hand, Impala’s timing looks more reasonable to me. Obviously Map/Reduce is not my option if I want to run some queries that expect fast response time. But if executing a long running job against a huge dataset, Map/Reduce option might still be on the table if considering job fault tolerance.