20 Big Data & Hadoop Questions to excel in your Interview

Attending a Big Data Hadoop interview, here are Top 20 questions and capabilities you should be prepared with to excel in your interview.

1. What is the difference between Big Data and Hadoop?

Big Data is an umbrella term used for a set of technologies to handle Big Data. Hadoop is a Big Data tool (or software).

2. How indexing is done in HDFS?

Hadoop has its own way of indexing. Depending upon the block size, once the data is stored, HDFS will keep on storing the last part of the data which will say where the next part of the data will be.

3. What happens to a NameNode that has no data?

There does not exist any NameNode without data. If it is a NameNode then it should have some sort of data in it.

4. What is the best quality of Hadoop if we want to use it for File storage purpose?

Hadoop does not require us to define the schema before storing the data on it. We can simply dump lot of files on Hadoop and should define schema only when we want to read this data. Data Lakes work on this concept.

5. Explain how do ‘map’ and ‘reduce’ works?

Namenode takes the input and divide it into parts and assign them to data nodes. These datanodes process the tasks assigned to them and make a key-value pair and returns the intermediate output to the Reducer. The reducer collects this key value pairs of all the datanodes and combines them and generates the final output.

6. What is DistributedCache in Hadoop?

If you want some files to be shared across all nodes in the Hadoop Cluster then you can define it as a DistributedCache. DistributedCache is configured with Job Configuration and it will share the read only data files to all nodes on the cluster.

7. How to make a large cluster smaller by taking out some of the nodes?

Hadoop offers the decommission feature to retire a set of existing data-nodes. The nodes to be retired should be included into the exclude file, and the exclude file name should be specified as a configuration parameter dfs.hosts.exclude.

The decommission process can be terminated at any time by editing the configuration or the exclude files and repeating the -refreshNodes command

8. What are the methods in Mapper Interface?

The Mapper contains the run() method, which call its own setup() method only once, it also call a map() method for each input and finally calls the cleanup() method. However all these methods we can override in our code.

9. What is speculative execution in Hadoop?

If a node appears to be running slow, the master node can redundantly execute another instance of the same task and first output will be taken .this process is called as Speculative execution.

10. What is the difference between HIVE, PIG and MapReduce Java Programs?

HIVE provides us a query language similar to SQL by which we can query the set of files stored on HDFS. PIG provides a scripting language which can be used to transform the data. HIVE and PIG both languages are converted to Java Map Reduce programs before they get submitted to Hadoop for processing. Java Map Reduce programs can be written to customize input formats or run some specific functions available in Java.

ORDER BY performs a total ordering of the query result set. This means that all the data is passed through a single reducer, which may take an unacceptably long time to execute for larger data sets.

SORT BY orders the data only within each reducer, thereby performing a local ordering, where each reducer’s output will be sorted. You will not achieve a total ordering on the dataset. Better performance is traded for total ordering.

12. Assume you have a sales table in a company and it has sales entries from salesman around the globe. How do you rank each salesperson by country based on their sales volume in Hive?

Hive support several analytic functions and one of the functions is RANK() and it is designed to do this operation.

Lookup details on other window and analytic functions – https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics

Illustration

Hive>SELECT

rep_name, rep_country, sales_volume,

rank() over (PARTITION BY rep_country ORDER BY sales_volume DESC) as rank

FROM

salesrep;

13. What is the difference between an InputSplit and a Block?

Block is a physical division of data and does not take in to account the logical boundary of records. Meaning you could have a record that started in one block and ends in another block. Where as InputSplit considers the logical boundaries of records as well.

14. What is the process to perform an incremental data load in Sqoop?

The process to perform incremental data load in Sqoop is to synchronize the modified or updated data (often referred as delta data) from RDBMS to Hadoop.

The delta data can be facilitated through the incremental load command in Sqoop.

Incremental load can be performed by using Sqoop import command or by loading the data into hive without overwriting it. The different attributes that need to be specified during incremental load in Sqoop are-

Mode (incremental) –The mode defines how Sqoop will determine what the new rows are. The mode can have value as Append or Last Modified.

Col (Check-column) –This attribute specifies the column that should be examined to find out the rows to be imported.

Value (last-value) –This denotes the maximum value of the check column from the previous import operation.

15. Explain about the different channel types in Flume. Which channel type is faster?

The 3 different built in channel types available in Flume are-

MEMORY Channel – Events are read from the source into memory and passed to the sink.

FILE Channel –File Channel writes the contents to a file on the file system after reading the event from a source. The file is deleted only after the contents are successfully delivered to the sink.

MEMORY Channel is the fastest channel among the three however has the risk of data loss. The channel that you choose completely depends on the nature of the big data application and the value of each event.

16. What all modes Hadoop can be run in?

Hadoop can run in three modes:

Standalone Mode: Default mode of Hadoop, it uses local file stystem for input and output operations. This mode is mainly used for debugging purpose, and it does not support the use of HDFS. Further, in this mode, there is no custom configuration required for mapred-site.xml, core-site.xml, hdfs-site.xml files. Much faster when compared to other modes.

Pseudo-Distributed Mode (Single Node Cluster): In this case, you need configuration for all the three files mentioned above. In this case, all daemons are running on one node and thus, both Master and Slave node are the same.

Fully Distributed Mode (Multiple Cluster Node): This is the production phase of Hadoop (what Hadoop is known for) where data is used and distributed across several nodes on a Hadoop cluster. Separate nodes are allotted as Master and Slave.

17. How to compress mapper output but not the reducer output?

To achieve this compression, you should set:

conf.set(mapreduce.map.output.compress, true)

conf.set(mapreduce.output.fileoutputformat.compress, false)

18. What is the default input type/format in MapReduce?

By default, the type input type in MapReduce is text.

19. What happens when two clients try to access the same file on the HDFS?

HDFS supports exclusive writes only.

When the first client contacts the Namenode to open the file for writing, the Namenode grants a lease to the client to create this file. When the second client tries to open the same file for writing, the Namenode will notice that the lease for the file is already granted to another client, and will reject the open request for the second client

20. How do reducers communicate with each other?

This is another tricky question. The MapReduce programming model does not allow reducers to communicate with each other. Reducers run in isolation.

Attending a Big Data Hadoop interview, here are Top 20 questions and capabilities you should be prepared with to excel in your interview.

1. What is the difference between Big Data and Hadoop?

Big Data is an umbrella term used for a set of technologies to handle Big Data. Hadoop is a Big Data tool (or software).

2. How indexing is done in HDFS?

Hadoop has its own way of indexing. Depending upon the block size, once the data is stored, HDFS will keep on storing the last part of the data which will say where the next part of the data will be.

3. What happens to a NameNode that has no data?

There does not exist any NameNode without data. If it is a NameNode then it should have some sort of data in it.

4. What is the best quality of Hadoop if we want to use it for File storage purpose?

Hadoop does not require us to define the schema before storing the data on it. We can simply dump lot of files on Hadoop and should define schema only when we want to read this data. Data Lakes work on this concept.

5. Explain how do ‘map’ and ‘reduce’ works?

Namenode takes the input and divide it into parts and assign them to data nodes. These datanodes process the tasks assigned to them and make a key-value pair and returns the intermediate output to the Reducer. The reducer collects this key value pairs of all the datanodes and combines them and generates the final output.

6. What is DistributedCache in Hadoop?

If you want some files to be shared across all nodes in the Hadoop Cluster then you can define it as a DistributedCache. DistributedCache is configured with Job Configuration and it will share the read only data files to all nodes on the cluster.

7. How to make a large cluster smaller by taking out some of the nodes?

Hadoop offers the decommission feature to retire a set of existing data-nodes. The nodes to be retired should be included into the exclude file, and the exclude file name should be specified as a configuration parameter dfs.hosts.exclude.

The decommission process can be terminated at any time by editing the configuration or the exclude files and repeating the -refreshNodes command

8. What are the methods in Mapper Interface?

The Mapper contains the run() method, which call its own setup() method only once, it also call a map() method for each input and finally calls the cleanup() method. However all these methods we can override in our code.

9. What is speculative execution in Hadoop?

If a node appears to be running slow, the master node can redundantly execute another instance of the same task and first output will be taken .this process is called as Speculative execution.

10. What is the difference between HIVE, PIG and MapReduce Java Programs?

HIVE provides us a query language similar to SQL by which we can query the set of files stored on HDFS. PIG provides a scripting language which can be used to transform the data. HIVE and PIG both languages are converted to Java Map Reduce programs before they get submitted to Hadoop for processing. Java Map Reduce programs can be written to customize input formats or run some specific functions available in Java.

ORDER BY performs a total ordering of the query result set. This means that all the data is passed through a single reducer, which may take an unacceptably long time to execute for larger data sets.

SORT BY orders the data only within each reducer, thereby performing a local ordering, where each reducer’s output will be sorted. You will not achieve a total ordering on the dataset. Better performance is traded for total ordering.

12. Assume you have a sales table in a company and it has sales entries from salesman around the globe. How do you rank each salesperson by country based on their sales volume in Hive?

Hive support several analytic functions and one of the functions is RANK() and it is designed to do this operation.

Lookup details on other window and analytic functions – https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics

Illustration

Hive>SELECT

rep_name, rep_country, sales_volume,

rank() over (PARTITION BY rep_country ORDER BY sales_volume DESC) as rank

FROM

salesrep;

13. What is the difference between an InputSplit and a Block?

Block is a physical division of data and does not take in to account the logical boundary of records. Meaning you could have a record that started in one block and ends in another block. Where as InputSplit considers the logical boundaries of records as well.

14. What is the process to perform an incremental data load in Sqoop?

The process to perform incremental data load in Sqoop is to synchronize the modified or updated data (often referred as delta data) from RDBMS to Hadoop.

The delta data can be facilitated through the incremental load command in Sqoop.

Incremental load can be performed by using Sqoop import command or by loading the data into hive without overwriting it. The different attributes that need to be specified during incremental load in Sqoop are-

Mode (incremental) –The mode defines how Sqoop will determine what the new rows are. The mode can have value as Append or Last Modified.

Col (Check-column) –This attribute specifies the column that should be examined to find out the rows to be imported.

Value (last-value) –This denotes the maximum value of the check column from the previous import operation.

15. Explain about the different channel types in Flume. Which channel type is faster?

The 3 different built in channel types available in Flume are-

MEMORY Channel – Events are read from the source into memory and passed to the sink.

FILE Channel –File Channel writes the contents to a file on the file system after reading the event from a source. The file is deleted only after the contents are successfully delivered to the sink.

MEMORY Channel is the fastest channel among the three however has the risk of data loss. The channel that you choose completely depends on the nature of the big data application and the value of each event.

16. What all modes Hadoop can be run in?

Hadoop can run in three modes:

Standalone Mode: Default mode of Hadoop, it uses local file stystem for input and output operations. This mode is mainly used for debugging purpose, and it does not support the use of HDFS. Further, in this mode, there is no custom configuration required for mapred-site.xml, core-site.xml, hdfs-site.xml files. Much faster when compared to other modes.

Pseudo-Distributed Mode (Single Node Cluster): In this case, you need configuration for all the three files mentioned above. In this case, all daemons are running on one node and thus, both Master and Slave node are the same.

Fully Distributed Mode (Multiple Cluster Node): This is the production phase of Hadoop (what Hadoop is known for) where data is used and distributed across several nodes on a Hadoop cluster. Separate nodes are allotted as Master and Slave.

17. How to compress mapper output but not the reducer output?

To achieve this compression, you should set:

conf.set(mapreduce.map.output.compress, true)

conf.set(mapreduce.output.fileoutputformat.compress, false)

18. What is the default input type/format in MapReduce?

By default, the type input type in MapReduce is text.

19. What happens when two clients try to access the same file on the HDFS?

HDFS supports exclusive writes only.

When the first client contacts the Namenode to open the file for writing, the Namenode grants a lease to the client to create this file. When the second client tries to open the same file for writing, the Namenode will notice that the lease for the file is already granted to another client, and will reject the open request for the second client

20. How do reducers communicate with each other?

This is another tricky question. The MapReduce programming model does not allow reducers to communicate with each other. Reducers run in isolation.