LOAD DATA Statement

The LOAD DATA statement streamlines the ETL process for an internal Impala table by moving a data file or all the data files in a directory from an HDFS
location into the Impala data directory for that table.

When the LOAD DATA statement operates on a partitioned table, it always operates on one partition at a time. Specify the PARTITION clauses and list all the partition key columns, with a constant value specified for each.

The loaded data files are moved, not copied, into the Impala data directory.

You can specify the HDFS path of a single file to be moved, or the HDFS path of a directory to move all the files inside that directory. You cannot specify any sort of wildcard to take
only some of the files from a directory. When loading a directory full of data files, keep all the data files at the top level, with no nested directories underneath.

Currently, the Impala LOAD DATA statement only imports files from HDFS, not from the local filesystem. It does not support the LOCAL keyword of the Hive LOAD DATA statement. You must specify a path, not an hdfs:// URI.

In the interest of speed, only limited error checking is done. If the loaded files have the wrong file format, different columns than the destination table, or other kind of mismatch,
Impala does not raise any error for the LOAD DATA statement. Querying the table afterward could produce a runtime error or unexpected results. Currently, the only
checking the LOAD DATA statement does is to avoid mixing together uncompressed and LZO-compressed text files in the same table.

When you specify an HDFS directory name as the LOAD DATA argument, any hidden files in that directory (files whose names start with a .) are not moved to the Impala data directory.

The loaded data files retain their original names in the new location, unless a name conflicts with an existing data file, in which case the name of the new file is modified slightly
to be unique. (The name-mangling is a slight difference from the Hive LOAD DATA statement, which replaces identically named files.)

By providing an easy way to transport files from known locations in HDFS into the Impala data directory structure, the LOAD DATA statement lets you avoid
memorizing the locations and layout of HDFS directory tree containing the Impala databases and tables. (For a quick way to check the location of the data files for an Impala table, issue the
statement DESCRIBE FORMATTED table_name.)

The PARTITION clause is especially convenient for ingesting new data for a partitioned table. As you receive new data for a time period, geographic
region, or other division that corresponds to one or more partitioning columns, you can load that data straight into the appropriate Impala data directory, which might be nested several levels down
if the table is partitioned by multiple columns. When the table is partitioned, you must specify constant values for all the partitioning columns.

If you connect to different Impala nodes within an impala-shell session for load-balancing purposes, you can enable the SYNC_DDL query option to make each DDL statement wait before returning, until the new or changed metadata has been received by all the Impala nodes. See SYNC_DDL Query Option for details.

Important: After adding or replacing data in a table used in performance-critical queries, issue a COMPUTE
STATS statement to make sure all statistics are up-to-date. Consider updating statistics for a table after any INSERT, LOAD DATA,
or CREATE TABLE AS SELECT statement in Impala, or after loading data through Hive and doing a REFRESH table_name in Impala. This technique is especially important for tables that are very large, used in join queries, or both.

Examples:

First, we use a trivial Python script to write different numbers of strings (one per line) into files stored in the cloudera HDFS user account. (Substitute
the path for your own HDFS user account when doing hdfs dfs operations like these.)

Next, we create a table and load an initial set of data into it. Remember, unless you specify a STORED AS clause, Impala tables default to TEXTFILE format with Ctrl-A (hex 01) as the field delimiter. This example uses a single-column table, so the delimiter is not significant. For large-scale ETL jobs, you would
typically use binary format data files such as Parquet or Avro, and load them into Impala tables that use the corresponding file format.

As indicated by the message at the end of the previous example, the data file was moved from its original location. The following example illustrates how the data file was moved into the
Impala data directory for the destination table, keeping its original filename:

The following example demonstrates the difference between the INTO TABLE and OVERWRITE TABLE clauses. The table already
contains 1000 rows. After issuing the LOAD DATA statement with the INTO TABLE clause, the table contains 100 more rows, for a total of
1100. After issuing the LOAD DATA statement with the OVERWRITE INTO TABLE clause, the former contents are gone, and now the table only
contains the 10 rows from the just-loaded data file.

Currently, Impala cannot insert or load data into a table or partition that resides in the Amazon Simple Storage Service (S3). Bring data into S3 using the normal S3
transfer mechanisms, then use Impala to query the S3 data. See Using Impala to Query the Amazon S3 Filesystem for details about using Impala with S3.

Cancellation: Cannot be cancelled.

HDFS permissions:

The user ID that the impalad daemon runs under, typically the impala user, must have read and
write permissions for the files in the source directory, and write permission for the destination directory.

HBase considerations:

The LOAD DATA statement cannot be used with HBase tables.

Related information:

The LOAD DATA statement is an alternative to the INSERT statement. Use
LOAD DATA when you have the data files in HDFS but outside of any Impala table.

The LOAD DATA statement is also an alternative to the CREATE EXTERNAL TABLE statement. Use LOAD
DATA when it is appropriate to move the data files under Impala control rather than querying them from their original location. See External Tables for information about working with external tables.

If this documentation includes code, including but not limited to, code examples, Cloudera makes this available to you under the terms of the Apache License, Version 2.0, including any required
notices. A copy of the Apache License Version 2.0 can be found here.