Archive

HDFS or Hadoop Distributed File System is the distributed file system provided by the Hadoop Big Data platform. The primary objective of HDFS is to store data reliably even in the presence of node failures in the cluster. This is facilitated with the help of data replication across different racks in the cluster infrastructure. These files stored in HDFS system are used for further data processing by different data processing engines like Hadoop Map-Reduce, Hive, Spark, Impala, Pig etc.

–> Here we will talk about different types of file formats supported in HDFS:

1. Text (CSV, TSV, JSON): These are the flat file format which could be used with the Hadoop system as a storage format. However these format do not contain the self inherited Schema. Thus with this the developer using any processing engine have to apply schema while reading these file formats.

2. Parquet: file format is the Columnar oriented format in the Hadoop ecosystem. Parquet stores the binary data column wise, which brings following benefits:
– Less storage, efficient Compression resulting in Storage optimization, as the same data type is residing adjacent to each other. That helps in compressing the data better hence provide storage optimization.
– Increased query performance as entire row needs not to be loaded in the memory.

Parquet file format can be used with any Hadoop ecosystem like: Hive, Impala, Pig, Spark, etc.

3. ORC: stands for Optimized Row Columnar, which is a Columnar oriented storage format. ORC is primarily used in the Hive world and gives better performance with Hive based data retrievals because Hive has a vectorized ORC reader. Schema is self contained in the file as part of the footer. Because of the column oriented nature it provide better compression ratio and faster reads.

4. Avro: is the Row oriented storage format, and make a perfect use case for write heavy applications. The schema is self contained with in the file in the form of JSON, which help in achieving efficient schema evolution.

–> Now, Lets take a deep dive and look at these file format through a series of videos below:

Author/Speaker Bio: Viresh Kumar is a v-blogger and an expert in Big Data, Hadoop and Cloud world. He has an experience of ~14 years in the Data Platform industry.

Despite plenty of opportunities for Hadoop professionals, getting a good job may seem tedious. This is because cracking the Hadoop Admin Interview is a challenge and you must prepare for it to get a good job. At Koenig Solutions, candidates not only acquire Hadoop administration certification, but also get to prepare for the interview to start a challenging yet lucrative career.

Q1. What daemons are required to run a Hadoop cluster?
A. DataNode, NameNode, JobTracker and TaskTracker are required for the process.

Q2. How would you restart a NameNode?
A. The easiest way – click on stop-all.sh (to run the command to stop running shell script). After this, click start-all.sh to restart the NameNode.

Q3. What are different schedulers available in Hadoop?
A. a. COSHH: Considers the workload, cluster and the user heterogeneity for scheduling decisions.
b. FIFO Scheduler: Doesn’t consider heterogeneity, but orders the job on the basis of arrival time in queue.
c. Fair Sharing: Defines a pool for each user. Users can use their own pools to execute the job.

Q5. What’s the purpose of jps command?
A. It is used to confirm whether the daemons running Hadoop cluster are working or not. The output of jps command reveals the status of DataNode, NameNode, Secondary NameNode, JobTracker and TaskTracker.

Q6. How many NameNodes can be run on single Hadoop cluster?
A. Only one.

Q7. What will happen when the NameNode on the Hadoop cluster is down?
A. Whenever the NameNode is down, the file system goes offline.

Q8. Detail crucial hardware considerations when deploying Hadoop in product environment.
A. Operating System: 64-bit operating system
Capacity: Larger form factor (3.5”) disks allow more storage and costs less.
Network: Two TOR switches per rack for better redundancy.
Storage: To achieve high performance and scalability, it is better to design a Hadoop platform by moving the compute activity to data.
Memory: System’s memory requirements vary based on the application.
Computational Capacity: Can be determined by the total count of MapReduce slots existing across nodes within a Hadoop cluster.

Q9. Which command will you use to determine if the HDFS (Hadoop Distributed File System) is corrupt?
A. Hadoop FSCK (File System Check) command.

Q10. How a Hadoop job can be killed?
A. using command: Hadoop job –kill jobID.

Q11. Can filed be copied across multiple clusters? If yes, how?
A. Yes, it is possible using distributed copy. DistCP command can be used for intra or inter cluster copying.

Q12. Recommend the best Operating System to run Hadoop.
A. Ubuntu or Linux is the best. Although Windows can be used, it can lead to several problems.

Q13. How often the NameNode should be reformatted?
A. Never, as it can lead to complete data loss. It is formatted only once, in the beginning.

Q14. What are Hadoop configuration files and where are they located?
A. Hadoop has 3 different configuration files – mapred-site.xml, hdfs-site.xml, and core-site.xml – which are located in “conf” sub directory.

Author Bio: Michael Warne is a tech blogger and an expert in Hadoop certification training. He has an experience of 5 years in the Hadoop professionals industry, and has worked as a certified Hadoop for top-notch IT companies.