sabato 14 maggio 2011

Talend Cache Management

Background : Talend Open Studio can read/write from/to many different sources, so it is generally easy for a good data integration architect to design solution that "cache" recordsets in db tables, temporary files etc.
However this adds additional steps, sometimes even additional architecture components such as a database, adding complexity to the job itslef.
Sometimes it would be easy to have a simple buffer in memory to store temporary information, maybe populated incrementally with a number of iterations.Or it would be nice to be able to dump this buffer to disk and re-load it whenever needed.

The solution

A couple of components that handle the cache management tasks, both in memory and on disk.
They are easy to use in any Talend Data Integration job, allowing temporary or persistent data storage. Cache files and memory buffers can be loaded incrementally using loops.
It is also possible to "init" a chache form a previously stored file and incrementally append new records.
Storing data in memory can quickly use up the java heap, also because this release of the routines is not yet optimized. Currently there is no data compression involved, this could be included in a future release.I succesfully managed to load a 6 Million records table (4 fields) in memory, but you should account for concurrent processes that also may use heavily the heap.

Current version

Version : 0.2
Release Date : May 11 2011
Status : Beta

Example

The job tCacheDemo demomstrates the usage of the two components

The Cache Load Loop subJob loads the cache buffer in memory and generates a set of cache files.
To demonstrate the incremental load capabilities, a Loop is used to process multiple times a data source (in this examples an RSS feed, any recordset would work).

tLoop_1 This step simply performs 100 iterations.

tRSSInput_1 A sample data source returning 10 records.In the demo job it is configured to read a xml file "news.rss", being a local copy of a Google news RSS Feed.

tCacheOutput_1 This component is set to cache data both on disk and in memory. The global buffer variable name is an arbitrary name that was set to "cache".
The Cache file name is set to context.baseDir+"/cache"+((Integer)globalMap.get("tLoop_1_CURRENT_VALUE"))+".dat" to demonstrate the ability to change the file name at each iteration.Finally the append to file option is unchecked to reset the cache files at each run. Feel free to play around with these settings.
This component is a "DATA_AUTOPROPAGATE" component, meaning that the same flow that arrives in input can also be retrieved in output, allowing the component to be used as a "middle" step in a flow.

The "Record count" subJob simply uses a tJava component to output the two record counters available in the tCacheOutput component, one being the number of records processed in the current iteration and the other being the number of records stored in the memory cache.

Finally the Cache read subjob is activated once all the iterations are terminated (via a "subjob OK" trigger link, originating from tLoop_1) and starts the output to the tLogRow component. In a real world application this would be connected to a destination table, a tBufferOutout or another flow consumer.

tCacheInput_1 By setting this component to read from the memory "cache" buffer, all the records stored with the 100 iterations are returned.
An optional check is set to remove from memory the buffer once it is read. You can decide to leave data in memory if you need to process it again (also with another tCacheInput component).
It is possible to alternatively set the cache source to one of the cache files generated.