Best practice for creating Talend metadata connections to NoSQL DBs and Hadoop

Starting with Talend 5.6.2, it is now possble to create metadata connections for NoSQL databases and Hadoop platforms using the metadata feature in the Studio. Even better, the Studio now allows automatic discovery of these properties using Hadoop properties site-*.xml files.

When I started writing this blog post, this feature was not complete on the Talend side. It was possible to create a metadata connection to Hadoop, BUT it was not possible to convert or export the connection to context variables. And without that, the connection was not very useful. Gladly, this has been fixed - and metadata connections now work as expected!

Whether or not you use TAC (Talend Admin Center) to deploy and execute jobs, creating a Hadoop metadata connection is a great way to manage connections as jobs move from development to QA to Production. Actually, more important than managing jobs between various lifecycles, Hadoop metadata connections are especially useful when Hadoop upgrades occur. This is important because the design of Talend components is such that the target Hadoop Distribution and version are configured at design time and not at run time and certainly not dynamically (at least, not yet!). When the Hadoop Distribution and version are selected at design time, Talend determines and configures the job to use a particular set of JARs and to generate code to match the selection. Unlike other data stores like RDBMs where you can usually connect to one version of a database using another version of the client drivers, you usually can't do so with Hadoop, else you run into errors due to mismatched libraries and other issues.

Why a whole blog post on metadata connections in 2015? Surely, if you've been using Talend for some time, you'll know that using metadata connections is standard practice. Right? Well, here's the big deal: if the Hadoop distribution changes and you have 1,000 jobs in Production and need to update them, you have a lot of work on your hands. Even when you're using TAC to deploy these jobs, you still have to update the metadata connection in each project to use the right Hadoop distribution and version, before regenerating / rebuilding the jobs from TAC. In the case where TAC is not used for deployment and execution, the work load is even more because you have to update then manually or automatically (using Talend Commandline) export the jobs, and deploy them on the execution servers in QA and PR.

How do you minimize your effort in using Hadoop metadata connections? Better than simply creating a connection per Talend Project, create a reference project, and create the metadata connections in the reference project. Then assign the reference project to all projects that need to connect to Hadoop. If and when something changes about Hadoop, you simply open 1 project (the reference project), update your connections - change the versions etc... - save them, and regenerate all [1,000] jobs. Because the metadata connections in all jobs are sourced from the reference project, they will automatically be updated to use the right JARs and generate the right code for the target distribution. After creating the connections, don't forget to convert / export your metadata connection values / settings to Contexts. Create a context group and create values for your various Hadoop lifecycles - Dev, QA, PR etc...

So, rather than reinventing the wheel - and since Talend did such a great job documenting how to create metadata connections for both NoSQL and Hadoop - I'm deferring to them. Go to the Talend Help article - and hit Next Page to read the setup for all the various pieces. And as explained above, do so in a reference project.

Subsequently, I will write on how to create a job to manually read the site-*.xml files to generate a context file that can be used for jobs that do not use TAC for deployment and execution of jobs.

Will Munji is a seasoned data integration, data warehousing and business intelligence (BI) architect & developer who has been working in the DW/BI space for a while. He got his start in BI working on Brio SQR (later Hyperion SQR) and the Crystal Decisions stack (Reports, Analysis & Enterprise) and SAP BusinessObjects / Microsoft BI stacks. He currently focuses on Talend Data Management Suite, Hadoop, SAP BusinessObjects BI stack as well as Jaspersoft and Tableau. He has consulted for many organizations across a variety of industries including healthcare, manufacturing, retail, insurance and banking. At Kindle Consulting, Will delivers DW/BI/Data Integration solutions that range from front-end BI development (dashboards, reports, cube development, T-SQL/ PL/SQL ...) to data services (ETL/ DI development), data warehouse architecture and development, data integration to BI Architecture design and deployment.