CloverETL 3.4.3

Get Big Data Under Control

Crunching huge volumes of data and resolving big data problems is a new step in the CloverETL evolution. A strong Cluster toolset to grasp parallel data processing with precise control and monitoring, together with the integration of popular Hadoop storage and MapReduce provides users a new view on robust and rigid data integration – with a modern analytics ad-hoc approach.

As Big Data initiatives are maturing into real solutions for practical problems, Hadoop fills in
the role of a brute force as a high performing tool for crunching big data – reducing them into valuable datasets. These enter the
processing and reporting pipeline managed by CloverETL for its monitoring and control capabilities.

You can execute and
monitor Hadoop jobs and access and manipulate data stored in HDFS and Hive. CloverETL helps you build data flows to feed data
into Hadoop and address tasks that are difficult or redundant to perform in MapReduce jobs, such as data format conversions,
joins, and connectivity to variety of data sources.

We have integrated Hadoop across the whole platform. The new Hadoop connection lets you easily manage and share connections to HDFS resources.
This allows you to use any CloverETL component to stream data directly from or into the files stored on HDFS. Support for HDFS in CloverETL
Jobflow components lets you take data prepared by CloverETL and push it to your Hadoop-based processing pipeline. Dedicated
HadoopReader/Writer components let you store and retrieve key-pair values in the native format for MapReduce jobs.

With the same Hadoop connection defined for HDFS, you can run and monitor MapReduce jobs as part of a CloverETL Server jobflow or a
transformation graph. Full support for older (org.apache.hadoop.mapred) and newer (org.apache.hadoop.mapreduce) job APIs lets you define the
job in Java classes and pass it to Hadoop.

CloverETL support for Hadoop is not bound to a particular Hadoop version; all libraries and interfaces are customizable. The CloverETL Hadoop
connection is a centralized place where you do a single configuration for use in all your interactions with Hadoop.

Load Balancing

We're introducing two load balancing ETL components that help you optimize data flows inside a transformation – LoadBalancingPartition and its Cluster counterpart, ClusterLoadBalancingPartition.

In a non-clustered environment, you can use the LoadBalancingPartition component to multiplex data records to different output ports according to the workload of downstream components. This allows a static degree of parallelism by transformation design using multiple processing routes.

In the Server Cluster, the load balancer will dispatch records to nodes based on their current load, again optimizing throughput by evenly distributing the workload among available Cluster nodes – without needing any further configuration.

Components related to this update

Database Connections Proxying from the Designer Through the Server

All database operations during development are carried out on the Server.No need for the Designer to have the database box connectivity.

When working in Server Integration mode, any JDBC-based database connection between the Designer and a database server are transparently
proxied by the CloverETL Server. This provides you with access to database servers that may be hidden behind firewalls (that disallow JDBC traffic but keep HTTP open)
during your job development.

At the same time, you can always be sure that connections are correctly set up and the database server accepts connections from the Server machine, not only your Designer.

Components related to this update

Data Profiler Reporting and Automation

Use profiling metrics as data quality inputs in data integration jobs, share profiling results with the Reporting Console, or build your own data quality reports on top of the Reporting Console API.

The CloverETL Data Profiler has now been fully integrated with CloverETL data integration capabilities. Using the ProfileProbe component allows transparent profiling of any data passing through an ETL job.
Results can be used by the transformation itself or automatically stored into the Data Profiler repository. A separate Jobflow component, ExecuteProfilerJob, allows for the automated execution of data profiling jobs
under CloverETL Server orchestration. Integration with the Server environment allows you to start Profiler jobs using existing automation infrastructure (triggers, scheduler, Server API) for an
automated collection of results.

A Reporting Console module provides a web-based interface for viewing collected results of data profiling. The built-in RSS notification capabilities will inform users responsible for data
quality about new profiling results. To control access to collected metrics and sensitive data the Reporting Console comes integrated with CloverETL Server security settings.

Custom Reporting

To build custom reports on top of collected profiling data or to be able to integrate Reporting Console into larger reporting solutions, the Console provides RESTful API to access
and administer the results repository. It allows enterprise users to not only access profile data gathered from running profiler jobs over time, but also make use of the information in
reporting and analytics. You can also use the API to manage the Profiler repository.

Everything Gets Persisted

You can now store profiling results in the repository no matter which method you use to run the Profiler – be it ProfilerProbe inside a data transformation, running ExecuteProfilerJob in
a jobflow, or a scheduled or manual run. In all cases, just turn on the option for saving results.

Components related to this update

Cluster Control in Your Hands

Take precise control over parallel data flows – spread the load for resource-intensive components, route data based on proximity of resources, and dynamically change the level of parallelism for execution.

Explicit Allocation Setting

In the CloverETL Cluster environment, a single data transformation job can execute in a highly-distributed fashion. Take for example a 3-node cluster
(consiting of node1, node2 and node3) and a graph consiting of Reader – Filter – Writer components. Having such a graph running in the cluster can result in
the Reader running on node1, the Filter on node2, and the Writer on node3 while data is streamed over the network. The "allocation" setting for an ETL component defines on which
nodes of the cluster it can potentially start. Previous versions of CloverETL Cluster derived allocation automatically from settings on partitioned sandboxes.
In CloverETL release 3.4.0, the allocation can be further controlled with explicit allocation settings for any ETL component. The allocation can be defined by:

Listing one or more Cluster nodes where component will execute.

Specifying the level of parallelism, number of instances in which the component will execute

Referencing a partitioned sandbox. This will cause to inherit the allocation and level of parallelism from partitioned sadbox locations.

Routing data to specific resources

Using explicit allocation, the data can be routed to particular nodes of the Cluster which provide specific processing resources. An example can be Ca luster node providing specific software
(e.g. third-party data quality software, antivirus software, digital signature certificates) or a node having special network access (e.g. open access to the internet, FTP/mainframe acces,
connectivity to a database/webservice).

Controlling the level of parallelism

By setting the level of parallelism in the allocation to a value greater than 1 or by listing one Cluster node multiple times will cause the component to be started in multiple instances.
This functionality can be used to increase throughput of long latency operations such as WebService calls, REST calls, database lookups, and third-party native API calls as well as an increase
of CPU utiliziation via parallelization of CPU-intensive tasks (complex regex parsing, string matching etc.)

Jobflow and Parallelism

Jobflow graphs are not affected by allocation settings. As Jobflow's primary role is to orchestrate executions of other jobs, jobflows always execute in a single instance on a single Cluster node.
A jobflow can still launch multiple instances of a single ETL graph from a single jobflow. A new option on ExecuteGraph components allows specifying a cap for number of instances that can
run in parallel. Any execution requests exceeding the capped number will be queued.

Related components:

Further reading

Server User Interface Redesign

We're introducing a new look & feel to the Server Console user interface. It's geared towards continuous improvements
for support personnel – and better insight into the Server processes and a simpler configuration.

Improvement

Usability Improvements

We care about users who need a tool that not only does things right, but also makes life easier. We continually add features and tweaks to do just that.

Designer

Feature

Insert components onto an edge

Drag and drop components from the Palette onto an edge to insert it in between the two existing components, automatically linking it to both sides.Hint: You can also use the Find component dialog (Shift-Space) and then place the selected component onto an edge.

Known Issues & Compatibility

Engine

Compatibility

Changed internal record counters and sequence counters to "long" instead of "integer".
Jobflows that use ExecuteGraph/Jobflow and MonitorGraph/Jobflow components and map the tracking data will need to be updated.
To fix this, just change "integer" to "long" in your metadata. Note that this may also affect those who implement Engine extensions,
custom components, or functions.
(CL-1825, CL-2652)

Engine

Compatibility

DBInputTable now distinguishes between empty strings and NULL values (CL-2748)

Server

Designer

Compatibility

Database connections refactored to be proxied through the Server (see above and CL-2682). This may affect those who've implemented their own connection and use the Engine's Connection interface.

Server

Compatibility

The attribute "redirectErrorOutput" in ExecuteScript can no longer be mapped from input port – it affects graph topology and thus does not make sense to be changed in runtime. (CLO-236)

Engine

Server

Compatibility

Changed structure of error and exception messages. This may affect those who parse the graph output logs in a certain automated way, so please check your matching rules. (CL-2710)