YARN

Distributed Cache

allows you to share common configuration data across all nodes.

normally a properties file.

deprecated in the new version of hadoop

// In the driver
Configuration conf = getConf();
Job job = Job.getInstance(conf, "CountJob");
job.setMapperClass(CountMapper.class);
// ...
// see the # sign after the file location.
// we will be using the name after the # sign as our file name in the Mapper/Reducer
job.addCacheFile(new URI("/user/name/cache/some_file.json#some"));
job.addCacheFile(new URI("/user/name/cache/other_file.json#other"));
// In the setup method of the mapper, which you will need to override.
if (context.getCacheFiles() != null && context.getCacheFiles().length > 0) {
File some_file = new File("./some");
File other_file = new File("./other");
// do something, like may be read them or parse as JSON or whatever.
}
super.setup(context);

Counters

counters in hadoop can be used for debugging hadoop code.

Each mapper/reducer has its own configuration, so that there is no global configuration/counter

feature provided by the hadoop framework that allows us to globally set some key value pairs during job execution.

can be used to count how many records were processed successfully/unsuccessfully, etc

// creare an enum class to be used as a counter class
public static enum RECORD_COUNTER {
SESSION_ID_COUNTER,
USER_ID_COUNTER
};
// in mapper or reducer code
context.getCounter(RECORD_COUNTER.USER_ID_COUNTER).increment(1);
// printing the value at the end when the job is finished.
Configuration conf = new Configuration();
Cluster cluster = new Cluster(conf);
Job job = Job.getInstance(cluster,conf);
result = job.waitForCompletion(true);
...
// job finished, get counters.
Counters counters = job.getCounters();
// get counter by name and print its stats.
Counter userIdCounter = counters.findCounter(RECORD_COUNTER.USER_ID_COUNTER);
System.out.println(userIdCounter.getDisplayName()+":"+c1.getValue());

Sqoop

used to transfer data from RDBMS to HDFS and vice versa.

Misc

hadoop tries to rerun a failed job a few (default of 4) times before killing the job, and reporting the error.

to add a node in hadoop cluster, add the host file entry in configuration file and run "hadoop dfsadmin -refreshNodes"

WebDAV is an extension which allows us to view HDFS files as local filesystem.

By default, data in HDFS is replicated by a factor of 3 in which 2 copies of the data are on the same rack, and one copy is in other rack.

Map reduce is normally used to do distributed programming on a cluster on computers.

Intially, the Master node, reduces the problems to smaller subsets, and distributes to worker nodes, like a tree structure.

subtasks get done, and passed to upper levels.

Master node then collects the answers to the smaller problems, and combines them to get the initial answer it needed.

FSImage: metadata about all the file, contains filenames, permissions, block locations of each file,

EditLog: maintains the logs of any change in the file system meta data.