Using Talend Components for Vertica

Talend provides a number of
Vertica-specific components that you can use to manage your ETL jobs.
Understanding the ETL flow and loading guidelines can help you determine
which components to use, and how best to use them.

Understanding ETL Flow and Loading Guidelines

Understanding the basic ETL
flows helps you decide which Vertica component and options to use. This
section addresses the following load types:

Type of Load

Use this COPY Option

Results

Small Bulk Load COPY (<100mb font="">100mb>

AUTO

Writes to WOS.

Spills to ROS when WOS overflows.

Large Bulk Load COPY

DIRECT

Writes to ROS.

Each commit becomes a new ROS container.

Incremental Load COPY

TRICKLE

Writes to WOS.

Errors when WOS overflows.

Follow these guidelines when loading data into Vertica:

Use COPY to load
large amounts of data. Using COPY avoids fragmenting the WOS and doesn't
require the overhead that using INSERT does.

If your row count is small (fewer than 1000 rows), use INSERT.

Load multiple streams on different nodes.

If you are using the INSERT INTO...SELECT...FROM syntax in ETL jobs with large volumes, be sure to use the following syntax.

=> INSERT /+*direct*/ INTO table SELECT…

Commonly Used Vertica Specific Talend Components

The following Talend 6.4.1 components are specifically for use with Vertica:

Component

Description

tVerticaInput

Extracts data from Vertica.

tVerticaBulkExec

Loads from a file.

tVerticaOutputBulk

Writes to a file.

tVerticaOutputBulkExec

Writes to a file and loads the data.

tVerticaOutput

Inserts or updates rows into a Vertica table.

tVerticaRow

Executes the SQL query stated against the Vertica database.

tVerticaInput

The tVerticaInput component
allows you extract data from Vertica. Give special attention to the DATE
and VARCHAR fields, as follows:

DATE
fields—All DATE fields must be in MM-DD-YYYY format. Check your DATE
fields and correct the format wherever necessary, as shown in the
following dialog box.

VARCHAR
fields—Talend examines a sampling of the data when determining the size
of the VARCHAR field. If the table contains large VARCHAR values,
consider increasing the default column size.

Note

You can also use the tMap component to transform or remap the data type.

tVerticaOutput

tVerticaOutput provides
options that you can change, depending on the size and type of load.
Specifically, you should change the Action on Data and Action on Table options, as described in the following sections.

Basic Settings for Trickle or Incremental Loads (Writing to WOS)

Set the Actionon Table and Action on Data options according to your use case. For trickle or incremental loads, set Action on Data to INSERT.

Advanced Settings for Trickle or Incremental Loads (Writing to WOS)

When writing to the WOS, use the default values for Advanced Settings as follows.To enable fields for INSERT, UPDATE, and DELETE, select the Use field options checkbox. The default settings in the previous graphic generate the following SQL statements:

tVerticaRow

The tVerticaRow component
allows you to specify any valid Vertica SQL statements, including COPY
statements. Use tVerticaRow to load data into Vertica flex tables, or
for other scenarios that require Vertica structures that are not
supported with custom Talend components.

Example: High-Performance COPY

In the following example, the
source file is on the Vertica server and uses the high-performance
Vertica COPY (not COPY FROM LOCAL). Whenever the source files are on the
Vertica cluster, load data using high-performance COPY.

Example: Performing Vertica-to-Vertica COPY

You can also use tVerticaRow
to copy from Vertica to Vertica when no transformations are required. To
do so, you need to perform additional steps to define the export
addresses. Connecting to a public
network requires some configuration. For information about using this
statement to copy data across a public network, see Using Public and Private IP Networks in the Vertica documentation.

Using Generic Talend ELT Components with Vertica

The ELT family of components
group together database connectors and processing components for ELT
mode, where the target DBMS becomes the transformation engine. When
possible, SQL statements are combined and processed in a single query on
the database. The following illustration shows how Talend supports ELT
with Vertica. This example uses the generic components for the entire
job.

Other Components for Bulk Loading

The tVerticaOutputBulk and
tVerticaBulkExec components are generally used together in a two-step
process. First, an output file is generated. In the second step, the
output file is used in the INSERT operation that loads data into a
database.You can also choose to do
both steps at once using the tVerticaOutputBulkExec component. However,
using tVerticaOutputBulk and tVerticaBulkExec allows the data to be
transformed before it is loaded in the database.

Using the Talend SQL Builder

When using the SQL builder, be sure to include the schema name in the query:

=> SELECT * FROM store.store_dimension;

Enabling Parallelization in Talend

You can enable parallelized
data flows. To do so, partition an input data flow of a Talend subjob
into parallel processes and execute these processes simultaneously.

Enabling Parallelization in the Talend Enterprise Edition

If you are using the Talend
Enterprise Edition, you can enable or disable the parallelization with a
single click. Talend Studio then automates the implementation across a
given job. For more information about this feature, click the following
link: