oracle-developer.net

external table enhancements in 11g

This article introduces a small number of new features for external tables in Oracle 11g. Read-only external tables were introduced in Oracle 9i and are now commonplace in most database applications that need to "import" flat-file data. Writeable external tables were introduced in Oracle 10g (as part of the Data Pump family of utilities) and they enable us to extract data from the database while at the same time creating an external table to read the data back again. Both types of external table have been enhanced in 11g, as we will see below.

This article covers the following new features:

support for compressed data (read and write);

support for encrypted data (read and write); and

ability to read from a preprocessed input data stream (available from 11.1.0.7).

background reading

Developers who are unfamiliar with external tables can read about the evolution of this feature in the following oracle-developer.net articles:

compressed external tables

In Oracle 10g, writeable external tables (i.e. those using the ORACLE_DATAPUMP access driver) unload data to a proprietory file format that includes both file and projection metadata and the data itself (in an uncompressed format). Given the potential for large dump files and the need to transfer this data across networks, Oracle 11g enables us to compress the dataset as it is written to file. This new feature greatly reduces the amount of data being transmitted, as we will see below.

To see the effect of compression with writeable external tables, we will need to compare it with an uncompressed external dataset. First, however, we need an Oracle directory to write the external table data to. We create this as follows.

SQL> CREATE DIRECTORY xt_dir AS 'D:\Oracle\dir';

Directory created.

We will now create a simple external table by unloading all of the data in the ALL_OBJECTS view, as follows.

The compressed file is roughly 12% of the size of the uncompressed version; a considerable reduction and one which will improve transmission times for data passed around networked systems (whilst also saving on storage).

performance impact: writing compressed data

By compressing data, we expect our CPU consumption to increase (as data is compressed) but the amount of file I/O to decrease (as fewer bytes of data are being written to file). We might also expect the I/O savings to outweigh the CPU costs. To test these assumptions, we will compare the relative performance of the compressed external tables below. First, we will use a variation of Tom Kyte's RUNSTATS utility to compare the overall resource consumption of compressed and uncompressed external tables. Second, we will compare the CPU time taken by each approach. In both cases, we will generate a file of 10 times the volume of ALL_OBJECTS, beginning with the uncompressed version as follows (note that the ALL_OBJECTS view data is cached in advance).

As expected, the compressed external table uses slightly more CPU but writes significantly fewer bytes of data. Despite this, the timings (and latches) are comparable (in fact, the uncompressed external table was slightly faster). This is a surprising result given the reduced I/O that the compressed external table has to perform. We can confirm our results with an external file listing, as follows.

The files sizes match the statistics we saw in Oracle. This time we achieved a much better compression ratio of 96%. This is a remarkable compression ratio and it is related to the nature of the source data. In our example, we created 10 copies of each row in ALL_OBJECTS, and this ordered rowsource will naturally favour a better degree of compression. If, by comparison, we order the same source resultset randomly (using DBMS_RANDOM) we achieve less compression (approximately 75% in tests).

To conclude, although our compressed external table didn't extract any quicker than the uncompressed version, it generates significantly less data to store or transmit to other systems.

reading from compressed external files

When we create a writeable external table, we generate two objects. First, we create a physical file on our filesystem. Second, we are left with an external table that can now be used to read the external file back again (should we wish to do so). Interestingly, we can use an uncompressed external table to read back a compressed file (assuming the column format is the same). Remember in our original example we created a simple uncompressed external file of ALL_OBJECTS. We can query the external file's record count as follows.

SQL> SELECT COUNT(*) FROM all_objects_xt;

COUNT(*)
----------
68789
1 row selected.

We will change the location of this external table to use the compressed file (i.e the one that we created with a compressed version of the external table), as follows.

Despite the fact that our external table was used to create an uncompressed file, we can read the compressed file via the Data Pump driver (the Data Pump files contain sufficient metadata to enable this seamless switch between compressed and uncompressed files). Similarly, we can use the compressed external table to read the uncompressed file, as follows.

To enable Transparent Data Encryption, an encryption key and wallet must first be created. The wallet needs to be opened every time the instance is started and is password protected. See the Advanced Security Administrator's Guide referenced at the end of this article for details.

Similar to compression, the encryption property is enabled in the external table's access parameters as highlighted above. We can confirm that our file was created with a file listing, as follows.

The encrypted file is the same size as the original unencrypted version (to the exact byte). We will look inside the file shortly, but first we'll repeat the performance comparison we ran for compressed external tables.

performance impact: writing encrypted data

As with compressed external tables, we might expect an encrypted external table to be more CPU-intensive than its unencrypted alternative. This time, however, we should expect the data volumes to be the same, meaning that the performance impact of encryption should only be CPU-related (although we discovered that compressed external tables made little time-saving from reduced I/O). We will validate our assumption with a similar RUNSTATS test, beginning with a large unencrypted file of ten copies of ALL_OBJECTS below.

The profile for encrypted external tables is broadly the same as that for compressed tables; that is, increased CPU cost and almost no I/O time savings. We can confirm our file sizes with a directory listing, as follows.

reading encrypted files

As we saw earlier, when we create an external file, we can use the table to read the data back into Oracle. Using our original example, we will read a small number of records from the encrypted file, as follows.

Naturally, the decryption is seamless to us. We can alternatively use an unencrypted external table to read the encrypted data (if the column structure is the same, of course). In the following example, we will set the location of our vanilla external table to read the encrypted file.

As we can see, the metadata in the external dump file enables Oracle to decrypt the data even if it is through an external table that doesn't have encryption enabled.

confirming data encryption

External table dump files are in a proprietory format, but it is still possible to identify that encryption is happening. To demonstrate this, we will create a single row external file with some recognisable strings in the data, as follows.

To prove that encryption is happening, we will look at the contents of each of these files. Remember that they are in a proprietory format that includes some binary data, so we will need to read them in RAW format. To enable us to do this, we will create a "reader" external table that reads the Data Pump files as fixed RAW records.

Note that we are using a fixed readsize of 64 bytes. This divides into our external filesize (12,188) with no remainder (to prevent an external table error message related to having an incomplete last line). For convenience (and display purposes), our recordsize is fixed at the same size. To format the RAW data as VARCHAR2 and to extract only the lines of interest from our external files, we will create a view over the reader external table, as follows.

This time we can see that both our metadata and data are completely illegible, so we can safely assume that the encryption is happening!

combining compression and encryption

Our examples so far have concentrated on either compression or encryption, but not both. It is, however, possible to combine the two features simply by enabling both in the access parameters clause, as follows.

Note the file size as highlighted. It is exactly the same as our original compressed (but unencrypted) file of ALL_OBJECTS data. As we might expect by now, we can use any of our external tables with matching columns to read this data back, as follows.

SQL> ALTER TABLE all_objects_xt LOCATION ('all_objects_both_xt.dmp');

Table altered.

SQL> SELECT COUNT(*) FROM all_objects_xt;

COUNT(*)
----------
68795
1 row selected.

external table preprocessor

External tables are generally superior to SQL*Loader for the majority of standard requirements, but there are certain scenarios where SQL*Loader is still invaluable:

loading from complex file formats;

loading complex data types;

loading from compressed files.

It is the third bullet point that we are interested in. Loading data direct from compressed files is quite a common technique in batch systems (particularly on UNIX). In such systems, compressed files are uncompressed to named pipes and SQL*Loader loads directly from the pipes, rather than files. This technique can save considerable time and prevents us from ever having to uncompress an entire file.

It has always been technically possible to achieve something similar with external tables, although in a much less robust way (which is probably why it has never been published as a technique). However, in both the 11.1.0.7 patchset and the 11.2 release, Oracle has added an external table preprocessor to the ORACLE_LOADER driver that enables us to read directly from compressed files (the preprocessor is invoked as we query the external table). In fact, the preprocessor is intended to support a range of pre-loading operations that convert an input rowsource into the expected format for the external table, but it is compressed files that are the obvious target for this feature, so we will concentrate on this below.

reading a compressed flat-file

To demonstrate the external table preprocessor, we will generate a small flat-file of 1,000 records, compress it and read it using an external table. We will generate the flat-file by using the Data Dump utility from oracle-developer.net as follows.

In addition to the standard read-write Oracle directory that we need for our external table, we also need an additional executable directory object for the preprocessor. This directory defines the location of the executables used by the preprocessor (we will be using gzip below). As far as Oracle is concerned, an executable directory is one that has EXECUTE privileges granted on it (this is an 11g feature specifically to support the preprocessor).

We will use a SYSDBA account to create an executable directory and grant the EXECUTE privilege to user SCOTT below.

SQL> conn / as sysdba

Connected.

SQL> CREATE DIRECTORY bin_dir AS 'C:\Progra~1\GNUWin32\bin';

Directory created.

SQL> GRANT EXECUTE ON DIRECTORY bin_dir TO scott;

Grant succeeded.

The ability to invoke executables from an external tables (via an executable Oracle directory) has security implications and should be tightly controlled. It is recommended that a directory containing only the required preprocessor executables is created for this purpose.

Moving on, we will now create an external table with a preprocessor to uncompress our small compressed file, as follows.

Note the preprocessor syntax. It comprises the executable directory (the directory specification), the executable itself and options as required (the file specification). In our case we have used a call to gzip with the -dc options, but the preprocessor can invoke any executable that generates a rowsource to stdout (including shell/batch scripts which we will describe below).

We will now query a small number of rows from our external table, reading directly from the compressed file, as follows.

Some care should be taken with this feature, however. These examples were run on a Windows XP laptop with a dual-core Intel processor, 2Gb RAM and an 11.1.0.7 database. On this system, the preprocessor works well with small zipped files (remember that the above example used just 1,000 records). However, for larger compressed files, the feature is not so stable (at least on this Windows system). With a larger dataset of approximately 700,000 records (only 2Mb compressed), the query hangs on the external table read wait event. The query returns its data only when the separate gzip process is killed. A range of other symptoms suggest that this feature is not quite suited to this laptop, such as alert log entries, ORA-7445, session coredumps and having to bounce the instance to enable any more connections. Of course, readers might have different experiences with different setups.

using preprocessor scripts

In addition to executables such as gzip, the preprocessor can be used with any shell or batch script that generates a rowsource to stdout (of course the rowsource must be in the record format defined for the external table). The most obvious use for this feature is to enable us to query a list of files in a directory from within Oracle and this is the subject of a separate oracle-developer.net article.

To provide a small demonstration for the purposes of this current article, however, we will wrap our gzip example in a batch script and query the compressed file again. To do this, we will create a batch script (this is a Windows system) named unzip_file.bat containing the following two lines.

@echo off
gzip -dc %1

The first line is mandatory for batch scripts and the script itself must be located in the executable Oracle directory (in our case BIN_DIR). Interestingly, the directory and filename given in the external table's DEFAULT DIRECTORY and LOCATION clauses respectively are accessible to the batch script in the %1 argument as we can see above ($1 in a shell script).

We will now re-create our COMPRESSED_XT external table to use the preprocessor batch script instead of the gzip executable, as follows.

As described above, Oracle supplies the path and filename of the compressed file in the %1 argument to the batch script. This enables the gzip command to unzip the file to stdout and therefore be used as an input data stream for the ORACLE_LOADER driver (and hence the external table).