Want to know one of the hardest part for me installing Hadoop with Ambari?

Setting up Passwordless ssh for all nodes so that Ambari Agent could do the install. Looking back it might be a trivial thing to get right, but at that time my Linux skills were lacking. Plus I had been cast into the Hadoop Administrator role after only being a Pig Developer after a month.

Having a background in Linux is very beneficial to excelling as Data Engineer or Hadoop Administrator. However if you have just been thrown into the role or looking to build your first cluster from scratch check out the video below on Setting Up Passwordless SHH for Ambari Agent.

Transcript – Setup Passwordless SSH for Ambari Agent

One step we need to do before installing Ambari is to set up passwordless SSH on our Ambari boxes. So, what we’re going to do is we’re actually going to generate a key on our master node and send those out to our data nodes. I wanted to caution you that this sounds very easy, and if you’re familiar with Linux and you’ve done this a couple of times, you understand that it might be trivial. But if it’s something you haven’t done before or if it’s something you haven’t done in a while, you want to make sure that you walk through this step. One of the reasons that you really want to walk through this before we install anything Ambari-related or Hadoop-related is because this is going to help us troubleshoot problems that we might have with permissions. So, if we know that this piece works, we can eliminate all the other problems.

No problem if you haven’t set it up before. We’re actually going to walk through that in the demo here. But first, let’s just look at it from an architectural perspective. So, what we’re going to do is, on our master node, we’re going to generate both a public and a private key. Then we’re going to share out that public key with all the data nodes, and what this is going to do is this is going to allow for the master node to log in via SSH with no password into data node 1, 2, and 3. So, since master node can actually login to these, we only have to install Ambari on the master node and then allow the master node to run all the installation on all the other nodes. You’ll see more of that once we get into installing Ambari and Ambari Agent, but just know that we have to have this public key working in order to have passwordless SSH.

The steps to walk through it are pretty easy. So, what we’re going to do is we’re going to login to our master node and we’re going to create a key. So, we’ll type in ssh-keygen. From there, it’ll generate the public and private key, and then we will copy the public key to data node 1, 2 and 3. Next, we’re going to add that key to the authorized list on all the data nodes. We’re going to test from our main node into data node 1, 2, and 3 just to make sure that our passwordless SSH works and that we can log in as root. Now let’s step through that in a demo.

Now we’re ready to set up passwordless SSH in our environment. So, in my environment, I have node one, which will be my master node, and I’m going to set up passwordless SSH on node 2, 3, and 4. But in this demo, we’re just going to walk through doing it on node 1 and node 2, and then we can just replicate it—the same process—on the other nodes.

So, the first thing we need to do is, on our master node or node 1, we’re going to generate our public key and our private key. So, ssh-keygen. We’re going to keep it defaulted to go into the .ssh folder. I’m not going to enter anything for my passphrase.

You can see there’s a random image and we can run an ll on our .ssh directory, and we see that we have both our public and our private key. Now what we need to do is we need to move that public key over to our data node 1, and then we’ll be able to login without using a password. So, I’m going to clear out the screen, and now what we’re going to do is we’re going to use scp and just move that public key over to node 2.

So, since we haven’t set up our passwordless SSH, it will prompt us for a password here. So, we got the transfer complete and now I’m going to login to node 2. Still haven’t used that password. If we run a quick ll, we can see we have our public key here, and now all we need to do is set up our .ssh directory and add this public key to the authorized keys. So, we’re going to make that directory, where it’s inside that directory, and we can see nothing is in it. Now it’s time to move that public key into this .ssh directory.

Then we have our public key, and now let’s just cat that file. We’re going to create an authorized keys. And we have two files here, so we have our public key, and then we’ve also written an authorized keys, which is going to be that public key. So, we’re going to exit out. As you can see, I’m back in node 1. So, now we should be able to just SSH in and not be prompted for a password. And you can see now we’re here in node 2.

So that’s how you set up your passwordless SSH. We’ll need to do this for all data nodes that we’re going to add to the cluster, and this will allow that, once we have Ambari installed on our main node, Ambari will be able to go and make changes to all the data nodes and do all the updates and upgrades all at one time so that you’re not having to manage each individual upgrade, each individual update.

Today there are so many applications and frameworks in the Hadoop ecosystem, most of which are written in Java. So does this mean anyone wanting to become a Hadoop developer or Big Data Developer must learn Java? Should you go through hours and weeks of training to learn Java to become an awesome Hadoop Ninja or Big Data Developer? Will not knowing Java hinder your Big Data career? Watch this video and find out.

Transcript Of The Video

Thomas Henson:

Hi, I’m Thomas Henson with thomashenson.com. Today, we’re starting a new series called “Big Data, Big Questions.” This is a series where I’m going to answer questions, all from the community, all about big data. So, feel free to submit your questions, and at the end of this episode, I’ll show you how. So, today, the first question I have is a very common question. A lot of people ask, “Do you need to know Java in order to be a big data developer?” Find out the answer, right after this.

So, do you need to know Java in order to be a big data developer? The simple answer is no. Maybe that was the case in early Hadoop 1.0, but even then, there were a lot of tools that were being created like Pig, and Hive, and HBase, that are all using different syntax so that you can extrapolate and kind of abstract away Java. Because the key is, if you’re a data analyst or a Hadoop administrator, most of those people aren’t going to have Java skills. So, for the community to really move forward with this big data and Hadoop, we needed to be able to say that it was a tool that not only Java developers were going to be able to use. So, that’s where Pig, and Hive, and a lot of those other tools came. Now, as we start to look into Hadoop 2.0 and Hadoop 3.0, it’s really not the case.

Now, Java is not going to hinder you, right? So, it’s going to be beneficial if you do know it, but I don’t think it’s something that you would want to run out and have to learn just to be able to become a big data developer. Then, the question is, too, when you say big data developer, what are we really talking about? So, are we talking about somebody that’s writing MapReduce jobs or writing Spark jobs? That’s where we look at it as a big data developer. Or, are we talking about maybe a data scientist, where a data scientist is probably using more like R, and Python, and some of those skills, to pull their insights back? Then, of course, your Hadoop administrators, they don’t need to know Java. It’s beneficial if they know Linux and some of the other pieces, but Java’s not really necessary.

Now, I will say, in a lot of this technology… So, if you look at getting out of the Hadoop world but start looking at Spark – Spark has Java, so you can write your Spark jobs in Java, but you can also do it in Python and Scala. So, it’s not a requirement for people to have Java. I would say that there’s a lot of developers out there that are big data developers that don’t have any Java skills, and that’s quite okay. So, don’t let that hinder you. Jump in, join an open-source community project, do something to expand your big data knowledge and become a big data developer.

Well, that’s all we have today. Make sure to submit your questions. So, I’ve got a space on my blog where you can submit the questions or just submit them here, in the comments section, and I’ll answer your big data big questions. See you again!

What happens when you need a duplicate file in two different locations?

It’s not a trivial problem you just need to copy that file to the new location. In Hadoop and HDFS you can copy files easily. You just have to understand how you want to copy then pick the correct command. Let’s walk though all the different ways of copying data in HDFS.

HDFS dfs or Hadoop fs?

Many commands in HDFS are prefixed with the hdfs dfs – [command] or the legacy hadoop fs – [command]. Although not all hadoop fs commands and hdfs dfs are interchangeable. To ease the confusion, below I have broken down both the hdfs dfs and hadoop fs copy commands. My preference is to use hdfs dfs prefix vs. the hadoop fs.

Copy Data in HDFS Examples

The example commands assume my HDFS data is located in /user/thenson and local files are in the /tmp directory (not to be confused with the HDFS /tmp directory). The example data will be loan data set from Kaggle. Using the data set or same file structure isn’t necessary it’s just for a frame of reference.

Hadoop fs Commands

Hadoop fs cp – Easiest way to copy data from one source directory to another. Use the hadoop fs -cp [source] [destination].

1

hadoop fs-cp/user/thenson/loan.csv/loan.csv

Hadoop fs copyFromLocal – Need to copy data from local file system into HDFS? Use the hadoop fs -copyFromLocal [source] [destination].

HDFS dfs Commands

HDFS dfs CP – Easiest way to copy data from one source directory to another. The same as using hadoop fs cp. Use the hdfs dfs cp [source] [destination].

1

hdfs dfs-cp/user/thenson/loan.csv/loan.csv

HDFS dfs copyFromLocal -Need to copy data from local file system into HDFS? The same as using hadoop fs -copyFromLocal. Use the hdfs dfs -copyFromLocal [source] [destination].

1

hdfs dfs-copyFromLocal/tmp/loan.csv/user/thenson/loan.csv

HDFS dfs copyToLocal – Copying data from HDFS to local file system? The same as using hadoop fs -copyToLocal. Use the hdfs dfs -copyToLocal [source] [destination].

1

hdfs dfs-copyToLocal/user/thenson/loan.csv/tmp/loan.csv

Hadoop Cluster to Cluster Copy

Distcp used in Hadoop – Need to copy data from one cluster to another? Use the MapReduce’s distributed copy to move data with a MapReduce job. For the listed command below the original data exist on cluster namenode in the /user/thenson directory and is being transferred to the newNameNode cluster. Make sure to use the full hdfs url in command. Command hadoop -distcp [source] \ [destination].

It’s the Scale that Matters..

While copying data is a simple matter in most application, everything in Hadoop is more complicated because of the scale. Make sure when copying data in HDFS to understand the use case and scale, then choose one of the commands above.

Moving data around in Hadoop is easy when using the Hue interface but when it comes to learning to do this from the command line it gets hard.

In my Pluralsight course HDFS Getting Started you can get up to speed on moving data around from the Hadoop command line in under 3 hours. The first two modules of this course will get you up to speed on using Hadoop from the command line using the hdfs dfs commands.

Not Just Hadoop from the Command Line

Don’t just stop at learning Hadoop from the command line, let’s focus on the other Hadoop frameworks as well. The last few modules of this course will focus on using the following Hadoop frameworks from the command line.

Hive & Pig

Two of my favorite Hadoop Framework works tools. Both of these tools allow for administrators to write MapReduce jobs without having to write in Java. After learning about Hadoop Pig and Hive are two tools EVERY Hadoop Admin should know. Let’s break down each one.

Hive fills the space for the structured data in Hadoop and acts similar to a database. Hive uses a syntax called HiveQL that is very similar to SQL syntax. The closeness of Hive and SQL database was intentional because most analyst know SQL not Java.

Pig’s motto is it eats anything, which means it process unstructured, semi structured, and structured data. Pig doesn’t care how the data is structured it can process it. Pig uses a syntax called Pig Latin, insert pig latin joke here,
which is less SQL like than HiveQL. Pig latin is also a procedural or step by step programming language.

HBase

Learning to interact with HBase from the command line is hot skill for Hadoop Admins. HBase is used when you need real time read and writing in Hadoop. Think very large data sets like billions of rows and columns.

Configuring and setting up HBase is complicated but in HDFS from the Command Line you will learn to setup a development environment to start using HBase. Configurations changes in HBase are all done from the command line.

Sqoop

Hadoop is about unstructured data but what about data that lives in Relational Databases? Sqoop allows Hadoop administrators to import and export data from traditional database systems. Offloading structured data in data warehouses is one of Hadoop’s biggest use case. Hadoop allows for DBAs to offload frozen data into Hadoop for 10x the cost. The frozen or unused data can then be analyzed in Hadoop and bring about new insights. In my HDFS Getting Started course I will walk through using Sqoop to import and export data in HDFS.

Stuck trying to manipulate a string in Hadoop and don’t want to use Java?

No Problem use Pig’s built in String Functions.

Why Pig for ETL?

Using Apache Pig in Hadoop is a must for ETL transactions. Pig allows for developer to quickly write a Pig Script to transform data in Hadoop. In Pig the String Functions are shipped with Pig and learning them is a time saver for ETL. So whether you are trying to covert case in a string or use a regular expression to extract data the Pig String Functions has you covered.

What’s Covered?

In this series I will walk through using the String Functions in a quick 5 minutes tutorial broken down by each function. Each video will build off the previous function but it’s not essential to wathc in order. I wanted each video be able to stand alone for quick reference for each String Function.

All the source code and files can be found on my Pig Example Github page. So you can follow along through the tutorial or grab the code after watching. Feel free to use and abuse the code. As a developer sometimes it’s easier to have something to start with rather than a blank screen.

If you already have your Hadoop development environment then you are ready to start.

If you are just starting out with Hadoop and Pig you might want to start here to learn about Pig. I’ve written a lot of post and published a couple videos on getting started with Pig Latin. So you’ll want to be familiar with those as you step through this series.