Automating Hadoop Jobs Using Rundeck

Rundeck is an open source software that helps in automating a set of procedures. It provides features to automate a certain set of things. Rundeck is developed on GitHub as a project called Rundeck SimplifyOps by the Rundeck community.

Following are some of its exciting features:

Web API

Distributed command execution

Pluggable execution system (SSH by default)

Multistep workflows

Job execution with on-demand or scheduled runs

Graphical web console for command and job execution

Role-based access control policy with support for LDAP/ActiveDirectory

History and auditing logs

Open integration with external host inventory tools

Command line interface tools

In our previous blog, we have shown how to schedule a Hadoop job in Rundeck. In this blog, we will give you a demo of how to automate a Hadoop/Hive/Pig job using Rundeck. This will allow your job to run on a daily or even on a monthly basis.

We recommend our users to go through our previous blogs on Rundeck for steps on installation and on how to schedule a Hadoop job.

Let us start with project creation. We will create a list of Hive queries in a file, after which we will configure the job for it to run automatically every day.

To create a new project, Click On the New project,and provide the necessary details, like project_name, description as shown in the screenshot below.

Also, check the option Require File Exists in the Resource Model Source and click on Save.

Now, scroll toward the end of the page and click on Create. Your Rundeck project will get created and you will be able to see the project screen as shown below.

Now click on Create Job at the Right corner and click on the New job.Fill necessary details like Job name, description as shown in the screenshot below:

For scheduling, come to the Workflow section and select options of your choice. We have selected the following options:

If a step fails: Stop at the failed step

Strategy: Sequential

To provide a job or a query, go the Add step section, and select the option Command.

Here, you need to provide the Hive query file containing a set of Hive queries. Below is our hive query.

We will get our employee details inside the file emp.csv on a daily basis in our HDFS. So, we are creating an hql file with the following content. We have named it as hive_query.hql.

The command used to run this script in the command line is shown below:

hive -f hive_query.hql

After entering the command, click on Save, and if you want to run another query after this, you can do that by adding another step.

After loading the data, I wish to count the number of employees present. We can do this by using the following Hive query:

select count(*) from employee.employee_test

We will save this query in a file with the name hive_emp.hql. Towards the end, we have added: >emp_cnt.txt. So, the above query will write the output into the file: emp_cnt.txt. We will enter this query as the next step in our workflow as shown in the screen shot below:

hive -f hive_emp.hql>emp_cnt.txt

For automating this job, select the following option:

Schedule or Run Repeated: Yes

You will get two kinds of automation: one is simple and the other is using the Unix crontab. After selecting the necessary option, scroll to the last and click on Create.

After clicking on Create,you will be redirected to the Job page. Beside your job, you can see the countdown left to run it.

You can also see your job definition in the Definition tab as shown below:

In this demo, we have changed the time and we have only 2 min left to run the job. After 2 min, this job will automatically run.

You can track the job status in the Activity for this job section below. Here, we have four options: running, recent, failed, and by you.

After 2 min, in the running tab, you can see that your job is running.

Once your job gets deployed, you will get the deployment or execution number, using which you can track the job running status and its complete console output. In the screen shot displayed below, you can see that our job is running.

And the deployment number is 21. In the recent tab, we can see the list of all the succeeded and failed jobs. Now, we will check for the execution number 21 and then find the console output.

We can see that our job has run successfully. We can check for the output in Log Output tab. Here, you can see the console output for both the jobs.

Now, we will check for the output in the file emp_cnt.txt.

In the above screen shot, you can see that there are 6000 employees in that company till date. As scheduled, the same job will run automatically the next day, and the count will be saved.

Once the job gets completed successfully, you can see the next deployment countdown as shown in the screen shot below:

We hope this blog helped you in automating your Hadoop jobs using Rundeck. Keep visiting our website, www.acadgild.com, for more updates on Big data Training and other technologies.

Related

4 Comments

Do you mind if I quote a couple of your articles as long as I provide credit
and sources back to your blog? My website is in the exact
same area of interest as yours and my users would truly benefit from some
of the information you present here. Please let me know if this alright with
you. Many thanks!

Hey, I think your blog might be having browser compatibility
issues. When I look at your blog site in Chrome,
it looks fine but when opening in Internet Explorer,
it has some overlapping. I just wanted to give you a quick heads up!
Other then that, terrific blog!

I don’t even know how I ended up right here, but I thought this submit was
once great. I don’t recognise who you’re however definitely you
are going to a famous blogger should you are not already.
Cheers!