6. How distributed cache works in Mapreduce Framework ?

When a mapreduce job is submitted with distributed cache options, the node managers copies the the files specified by the -files , -archives and -libjars options from distributed cache to a local disk. The files are said to be localized at this point.

local.cache.size property can be configured to setup cache size on local disk of node managers. Files are localized under the ${hadoop.tmp.dir}/mapred/local directory on the node manager nodes.

7. What will hadoop do when a task is failed in a list of suppose 50 spawned tasks ?

It will restart the map or reduce task again on some other node manager and only if the task fails more than 4 times then it will kill the job. The default number of maximum attempts for map tasks and reduce tasks can be configured with below properties in mapred-site.xml file.

mapreduce.map.maxattempts

mapreduce.reduce.maxattempts

The default value for the above two properties is 4 only.

8. Consider case scenario: In Mapreduce system, HDFS block size is 256 MB and we have 3 files of size 256 KB, 266 MB and 500 MB then how many input splits will be made by Hadoop framework ?

9. Why can’t we just have the file in HDFS and have the application read it instead of distributed cache ?

Distributed cache copies the file to all node managers at the start of the job. Now if the node manager runs 10 or 50 map or reduce tasks, it will use the same file copy from distributed cache.

On the other hand, if a file needs to read from HDFS in the job then every map or reduce task will access it from HDFS and hence if a node manager runs 100 map tasks then it will read this file 100 times from HDFS. Accessing the same file from node manager’s Local FS is much faster than from HDFS data nodes.

10. What mechanism does Hadoop framework provides to synchronize changes made in Distribution Cache during run time of the application ?

Distributed cache mechanism provides service for copying just read-only data needed by a mapreduce job but not the files which can be updated. So, there is no mechanism to synchronize the changes made in distributed cache as changes are not allowed to distributed cached files.

A few more hadoop mapreduce interview questions and answers for experienced will be published in the upcoming posts in this category.

Post navigation

Review Comments

I have attended Siva’s Spark and Scala training. He is good in presentation skills and explaining technical concepts easily to everyone in the group. He is having excellent real time experience and provided enough use cases to understand each concepts. Duration of the course and time management is awesome. Happy that I found a right person on time to learn Spark. Thanks Siva!!!

DharmeswaranETL / Hadoop DeveloperSpark Nov 2016September 21, 2017

I really like your explanations.

Sylvain Nzeyanghadoop developer December/2016November 23, 2016

Siva , your teaching's are great and indeed very useful for the people who are interested in hadoop. Your sessions are more close to real-time and helps every one to get clear in interviews. Thanks for your support.

kalpana BhemireddyHadoop developerSpark jul/2016September 26, 2016

Course content is well structured. I like Siva's explanation of topics using slide decks & virtual machine (CDH cluster) at the same time,this will help audience to learn not only theory behind a topic but also practical aspect of it. Overall, I would recommend this course.

KumarBig Data DeveloperHadoop&Aug/2016September 26, 2016

Course content is well structured. I like Siva's explanation of topics using slide decks & virtual machine (CDH cluster) at the same time,this will help audience to learn not only theory behind a topic but also practical aspect of it. Overall, I would recommend this course.

KumarBig Data DeveloperHadoop&Aug/2016September 26, 2016

One of the best trainer is Siva Kumar, his way of communication and explantion superb,he teaches excellent as theratical and practically also,I suggest he is the Excellent trainer for Spark and Scala.

purushothamSr.Software EngineerSpark August/2016September 15, 2016

Here is 2 cents
1. Got More exercises and provide feedback. (also a final project)
2. Support (may be you need a part time person)

LexmanArchitectHadoop/SparkSeptember 13, 2016

Siva will give excellent training for Hadoop,spark. He has 4 years real time experience. His teaching is will go close to real time.

sriniwaasHadoop consultantJune 2016September 13, 2016

Excellent Training, classes were so interactive,I never got bored,Siva has Immense Knowledge in all the Hadoop tools.He explained everything so near to real-time . You can never find Hadoop course so pure in the market.

AkhilaHadoop DeveloperHadoop/sparkSeptember 13, 2016

Siva did an excellent job in explaining each topic patiently, gave many real-time examples
And he was really patient enough in answering each of our doubts,responds well in time when needed.
He has Immense knowledge in all the Hadoop/spark eco-system tools. Never felt bored in his classes he makes the classes so interactive
He has an excellent blog..got addicted to it.

AkhilaHadoop DeveloperHadoop/sparkSeptember 13, 2016

Spark and Hadoop course content is really apt for the beginners. Concept articulation gives clarity on the subject and recording are quite handy for reference. my request is to start an advance level course where it takes very close to real time feel