…is online

XML Data Processing Using Talend Open Studio

For my Netflix Search Engine project,Flicksery, I get data from Netflix as XML files. The catalog file for instant titles is around 600 MB. Following is a sample XML entry for one title/movie.

As you can see it is not very easy to read. Not only the alignment, but the number of elements too. May be we could use a better text editor or an XML editor to see this properly. But it still would be very difficult to play around with the data and do any kind of transformation or analysis. Talend Open Studio is something that can be really useful to analyze the data embedded within the large XML.

In this post we will try and analyze the XML file using some really neat features available in Talend Open Studio.

Let’s get started:

Open up Talend Open Studio and create a new project – NetflixDataAnalysis:

Right click on Job Designs and select Create job – InstantFeedTest

Right click on File xml and select Create file xml:

This brings up a wizard. Enter the details as shown below and click Next:

In step 2 of the wizard select Input XML and click, Next.

In step 3 of the wizard select the XML file. For this test, I took only part of the XML file as loading the entire 600 MB file would cause Java Heap issues and would prevent the file from loading correctly. As we just want to analyze and see the different fields available in the XML, a sample should be sufficient. Once the file is selected you should see the schema of the XML file in the Schema Viewer.

Step 4 is where you start to see the real power of Talend Open Studio. The Source Schema list on the left displays the schema of the XML file. The Target Schema section provides you with a way of defining an output schema for the XML. Using XPath you can now define the required elements from the input XML file. You can drag the element which will repeat itself in the XML to Xpath loop expression section. In this case the element catalog_title is the element that embeds all information for a single movie/title.

Next, you can traverse through all the nodes on the left and drag the required elements to the right under the Fields to extract section. You can also provide custom column names under the Column Name section. Once you are done dragging all the required fields, click on Refresh Preview to see a preview of the data. This preview helps one get a quick idea of how the data will be parsed. Click Finish.

Double click on the job, InstantFeedTest to open it up in the workspace. Drag the newly created XML Metadata, NetflixInstantFeed. Also, drag the tFileOutputDelimited_1 component from the Palette on the right.

Right click on the NetflixInstantFeed node and select, Row->Main and join it to tFileOutputDelimited_1 node. Once joined it should look like the image below:

Select the tFileOutputDelimited_1 node and go to the “Component” tab at the bottom of the workspace. Update the configurations, Field Separator to “,” and set the File Name to the desired path and name. We are now ready to test out our job. Click on the Run icon on the toolbar to execute the job. Once executed you should see processing status as shown below:The above job is going to read the XML file, extract the fields and generate a comma separated text file with the extracted data.

http://api-public.netflix.com/catalog/titles/movies/780726,The Mummy,http://cdn0.nflximg.net/images/0270/2990270.jpg,When British archaeologists uncover the ancient sarcophagus of a mummified Egyptian priest (Boris Karloff), they foolishly ignore its warning not to open the box. Now brought back to life, the mummy tries to resurrect the soul of his long-dead love.,,Top 100 Thrills nominee,,,1346482800,4102444800,NR,MPAA,4388,1.77:1,1932,Classic Movies,3.5,1387701023800,,,

As you can see the big XML node has now more readable as a simple comma separated record. This was a simple one-to-one mapping to from XML to CSV. Talend Open Studio is way more powerful than this. You can add a new component to the job to apply transformations to data coming in from the XML.

As you see in the above record the first column/value in the comma separated record is a link. All I am interested is in the last 6 digits of the link. I want my final output to have only 6 digits and not the entire link. To do this delete the link between NetflixInstantFeed and tFileOutputDelimited_1 nodes. Next, drag tMap_1 component from the Pallete to the Job workspace. Right click on NetflixInstantFeed, select, Row->Main and join it to tMap_1. Next, right click on tMap_1, select, Row->New Output (Main) and join it to tFileOutputDelimited_1 node. You will be prompted to enter a name. The name entered for this example is processed_op. Once done, the job should now look as shown below:

Select the tMap_1 component and click on the Map Editor button on Component tab at the bottom of the workspace. The Map Editor opens up with the metadata information from the XML file on the left and desired output on the right. As you see below I have dragged all the columns from the left to the right. The only modification I have done is for the “id” column. I have applied a function to get only the last 6 digits from the right.

As you can see we can easily apply functions to transform the input data. The function shown above is a StringHandling function. There are several other functions that can be applied using the Expression Builder as shown below:

After you are done applying your function, click OK to close the screen. Now, you can re-run the job to see if the transformation has been applied correctly. After a successful run you should see the results of the job as shown below:Let us look at the output file to see the effect of the transformation:

780726,The Mummy,http://cdn0.nflximg.net/images/0270/2990270.jpg,When British archaeologists uncover the ancient sarcophagus of a mummified Egyptian priest (Boris Karloff), they foolishly ignore its warning not to open the box. Now brought back to life, the mummy tries to resurrect the soul of his long-dead love.,,Top 100 Thrills nominee,,,1346482800,4102444800,NR,MPAA,4388,1.77:1,1932,Classic Movies,3.5,1387701023800,,,

We have successfully built a job to transform an XML file to a comma separated file. This job can be exported and run as a standalone job on any environment running Java. Also, we chose to output the data to CSV file, however, Talend Open Studio can read from multiple data formats, databases and also write to different file formats or directly insert into databases.

This was just a quick introduction (tip of the iceberg) to experience the usefulness of Talend Open Studio for data processing. The features of this tool are vast and can’t be covered in a blog post. But this should get you started on using Talend. Hope you found this fast paced tutorial useful.

ABOUT THE AUTHOR

Rohit Menon

I am an author and software developer who loves to learn and build new things. To share my learning I blog here and have also built Hadoop Screencasts (www.hadoopscreencasts.com), that hosts screencasts on Apache Hadoop and its components.

Yes, this is possible.
You will need to extract the column as an XML file and then parse the contents as just another XML component. In this case you will not create an XML file from the column.
All this can be done as part of a single job.

Hello Rohit,
Thanks for your blog. The blogs show your passion to learn and excel in technology. A good inspiration to techies.

A quick question. How did you get the data from Netflix. Did you contact them ? If you ran your own crawler to download them (I hope that you did not do that) were you not run into any legal issues with Netflix.

We have a requirement to handle XML file. We need to read the XML file (Which is stored in MySQL DB as Blob) and modify the Values (Data) in couple of Tags in the XML using Look-up (this Look-up tables will be in MySQL) and then load the modified XML as a Blob in a Table.

XML file (Which is stored in MySQL DB as Blob)–> Modify Data in Some tags –>Load Modified XML in Table as Blob.

We are looking for the options to perform the above operations without storing the XML data in any tables because XML itself a Structured one. This XML data Transformation needs to happen in the Talend Components itself.

Is this Possible? or we need to extract the data in to table and do the data transformation and then Construct the XML ?