Monday, August 29, 2011

Transform XML with XSLT in Talend

Talend Open Studio is an excellent ETL tool that can be used beyond the typical database and CSV manipulation. XML processing for example is today all over the places in the Enterprise.

As a consequence XML transformations are a key skill for those folks doing data transformations.

Even though there are more efficient tools XSLT is a standard which is supported in Talend through the tXSLT component. You just need to provide your XML, XSL and output files and Talend will apply the transformation for you. Talend uses Saxon at the moment so you get the benefit of clear error messages when trying to build your XSL.

Of course XSLT might be a skill that even scares some people, however XSLT is not difficult at all and the more you work with it the better you get as with any other human skill. Do not try to avoid it, if you have XML to process and your ETL tool is Talend then do your homework and learn some XSL.

As there is no better way to teach than providing an example I decided to write this quick showcase that will pivot the data resulting from running an Advent Geneva RSL report (A SOAP service) which comes in the form of key value pairs into a tabular output. I will provide two responses: XML and HTML. The first is probably what you need in Talend while the second is probably what you need if you want to provide a quick add hoc HTML report page.

If you open it with Firefox and you choose "View XPath" from a right click on the body of the page (contextual menu) you could try several XPATH expresions. Together with XSLT skill it comes XPATH which is just a way to address a node, an attribute or textual content in the XML. See below how I tested one of the XPATH using this tool. Pay attention to the namespace definition, I use simple letters to abbreviate more verbose prefixes.

Now that you have a quick tool for finding nodes in the XML document let us see the desired document structure. Here is a screenshot of what we would like to see in HTML:

Here is in the XML we would like to obtain for further processing in Talend:

A quick explanation of what the XSL code does:
1-6 Is just stating it is an XML document which uses a namespace prefixed "xsl" for the transaformation instructions, a soap namespace with prefix "e" and an Advent Geneva namespace with prefix "g". Note that I abreviated the original namespace suffixes for the last two.

7-9 How the final output should look like.

11-19 We use the Muenchian grouping that allows us to have a list of all possible column names.

21-37 XSLT is a functional language which works matching nodes and applying transformations to them. The logic can be affected using xsl:apply-template. We use xsl:element to create our custom nodes: First the headers which come out of the Muenchian key and later the records for which we again use the keys while addressing the correct record through an xsl:foreach nested loop.

@Alisa Could you post your question in http://www.talendforge.org? I promise to at least BUMP it there so let me know the link for it. Embedding resources in talend projects should be provided IMO out of the box.