Managing Disk Space for Impala Data

Although Impala typically works with many large files in an HDFS storage system with plenty of capacity, there are times when you might perform some file cleanup to reclaim space, or
advise developers on techniques to minimize space consumption and file duplication.

You manage underlying data files differently depending on whether the corresponding Impala table is defined as an internal
or external table:

Use the DESCRIBE FORMATTED statement to check if a particular table is internal (managed by Impala) or external, and to see the physical location of the
data files in HDFS. See DESCRIBE Statement for details.

For tables not managed by Impala ("external" tables), use appropriate HDFS-related commands such as hadoop fs, hdfs dfs, or distcp, to create, move, copy, or delete files within HDFS directories that are accessible by the impala
user. Issue a REFRESH table_name statement after adding or removing any files from the data directory of an external table. See
REFRESH Statement for details.

Use external tables to reference HDFS data files in their original location. With this technique, you avoid copying the files, and you can map more than one Impala table to the same
set of data files. When you drop the Impala table, the data files are left undisturbed. See External Tables for details.

Use the LOAD DATA statement to move HDFS files into the data directory for an Impala table from inside Impala, without the need to specify the HDFS path
of the destination directory. This technique works for both internal and external tables. See LOAD DATA Statement for details.

Clean up temporary files after failed INSERT statements. If an INSERT statement encounters an error, and you see a directory
named .impala_insert_staging or _impala_insert_staging left behind in the data directory for the table, it might contain temporary
data files taking up space in HDFS. You might be able to salvage these data files, for example if they are complete but could not be moved into place due to a permission error. Or, you might delete
those files through commands such as hadoop fs or hdfs dfs, to reclaim space before re-trying the INSERT.
Issue DESCRIBE FORMATTED table_name to see the HDFS path where you can check for temporary files.

By default, intermediate files used during large sort, join, aggregation, or analytic function operations are stored in the directory /tmp/impala-scratch . These files are removed when the operation finishes. (Multiple concurrent queries can perform operations that use the "spill to disk"
technique, without any name conflicts for these temporary files.) You can specify a different location by starting the impalad daemon with the --scratch_dirs="path_to_directory" configuration option or the equivalent configuration option in the Cloudera Manager user interface. You can
specify a single directory, or a comma-separated list of directories. The scratch directories must be on the local filesystem, not in HDFS. You might specify different directory paths for different
hosts, depending on the capacity and speed of the available storage devices. In CDH 5.5 / Impala 2.3 or higher, Impala successfully starts (with a warning written to the log) if it cannot create or
read and write files in one of the scratch directories. If there is less than 1 GB free on the filesystem where that directory resides, Impala still runs, but writes a warning message to its log. If
Impala encounters an error reading or writing files in a scratch directory during a query, Impala logs the error and the query fails.

If you use the Amazon Simple Storage Service (S3) as a place to offload data to reduce the volume of local storage, Impala 2.2.0 and higher can query the data directly from S3. See
Using Impala to Query the Amazon S3 Filesystem for details.