12 Lifecycle of a MapReduce JobMap functionReduce functionRun this program as aMapReduce job

13 Lifecycle of a MapReduce JobMap functionReduce functionRun this program as aMapReduce job

14 Lifecycle of a MapReduce JobTimeInputSplitsReduceWave 1ReduceWave 2MapWave 1MapWave 2Industry wide it is recognized that to manage the complexity of today’s systems, we need to make systems self-managing. IBM’s autonomic computing, Microsoft’s DSI, and Intel’s proactive computing are some of the major efforts in this direction.How are the number of splits, number of map and reducetasks, memory allocation to tasks, etc., determined?14

15 Job Configuration Parameters190+ parameters in HadoopSet manually or defaults are used

17 PIG One frequent complaint about MR is that it’s difficult to programOne criticism of MapReduce is that the development cycle is very longAs you implement the program in MapReduce, you’ll have to think at the level of mapper and reducer functions and job chainingPig started as a research project within Yahoo! in the summer of 2006, joining Apache Incubator in September of 2007Pig is a dataflow programming environment for processing very large files. Pig's language is called Pig LatinPig is a Hadoop extension that simplifies Hadoop programming by giving you a high-level data processing language while keeping Hadoop’s simple scalability and reliabilityYahoo runs 40% of all its hadoop jobs with Pig. Twitter use PIGIndeed, itwas created at Yahoo! to make it easier for researchers and engineers to mine the huge datasets there

18 PIG::How I look like: Not a variable, relationLoads data file into a relation,with a defined schemaNot a variable, relation

19 Word count example in PIGText=LOAD ‘text’ USING Textloader()Loads each line as one columnTokens=FOREACH text GENERATE FLATTEN(TOKENIZE($0)) as word;Wordcount=FOREACH(GROUP tokens BY word)GENERATE group as wordCOUNT_STAR($1)MR TRANSFORMATIONPIG JOBMR JOBSHDFS

20 PIG Vs HivePig is a new language, easy to learn if you know languages similar to PerlHive is a sub-set of SQL with very simple variations to enable map-reduce like computation. So, if you come from a SQL background you will find Hive QL extremely easy to pickup (many of your SQL queries will run as is), while if you come from a procedural programming background (w/o SQL knowledge) then Pig will be much more suitable for youHive is a bit easier to integrate with other systems and tools since it speaks the language they already speak (i.e. SQL).Ultimately the choice of whether to use Hive or PIG will depend on the exact requirements of the application domain and the preferences of the implementers and those writing queries.

21 HIVE(HQL)Hive is a data ware house infrastructure built on top of Hadoop that can compile SQL queries into MR jobs and run on hadoop clusterInvented at Facebook for their own problems .SQL like query language(HQL/Hive QL) to retrieve the data and process it.JDBC/ODBC access is providedCurrently used with respect to Hbase

22 HbaseHBase is not about being a high level language that compiles to map-reduce,Hbase is about allowing Hadoop to support lookups/transactions on key/value pairs. HBase allows you to do quick random lookups, versus scan all of data sequentially, do insert/update/delete from middle, not just add/append.

23 Sqoop To load bulk data into Hadoop from relational databasesImports individual tables or entire databases to files in HDFSProvides the ability to import from SQL databases straight into your Hive data warehouseImporting this table into HDFS could be done with the command:sqoop --connect jdbc:mysql://db.example.com/website --table USERS \ -- local --hive-import- See more at: