14.
ForewordHadoop got its start in Nutch. A few of us were attempting to build an open sourceweb search engine and having trouble managing computations running on even ahandful of computers. Once Google published its GFS and MapReduce papers, theroute became clear. They’d devised systems to solve precisely the problems we werehaving with Nutch. So we started, two of us, half-time, to try to re-create these systemsas a part of Nutch.We managed to get Nutch limping along on 20 machines, but it soon became clear thatto handle the Web’s massive scale, we’d need to run it on thousands of machines and,moreover, that the job was bigger than two half-time developers could handle.Around that time, Yahoo! got interested, and quickly put together a team that I joined.We split off the distributed computing part of Nutch, naming it Hadoop. With the helpof Yahoo!, Hadoop soon grew into a technology that could truly scale to the Web.In 2006, Tom White started contributing to Hadoop. I already knew Tom through anexcellent article he’d written about Nutch, so I knew he could present complex ideasin clear prose. I soon learned that he could also develop software that was as pleasantto read as his prose.From the beginning, Tom’s contributions to Hadoop showed his concern for users andfor the project. Unlike most open source contributors, Tom is not primarily interestedin tweaking the system to better meet his own needs, but rather in making it easier foranyone to use.Initially, Tom specialized in making Hadoop run well on Amazon’s EC2 and S3 serv-ices. Then he moved on to tackle a wide variety of problems, including improving theMapReduce APIs, enhancing the website, and devising an object serialization frame-work. In all cases, Tom presented his ideas precisely. In short order, Tom earned therole of Hadoop committer and soon thereafter became a member of the Hadoop ProjectManagement Committee.Tom is now a respected senior member of the Hadoop developer community. Thoughhe’s an expert in many technical corners of the project, his specialty is making Hadoopeasier to use and understand. xv

15.
Given this, I was very pleased when I learned that Tom intended to write a book aboutHadoop. Who could be better qualified? Now you have the opportunity to learn aboutHadoop from a master—not only of the technology, but also of common sense andplain talk. —Doug Cutting Shed in the Yard, Californiaxvi | Foreword

16.
PrefaceMartin Gardner, the mathematics and science writer, once said in an interview: Beyond calculus, I am lost. That was the secret of my column’s success. It took me so long to understand what I was writing about that I knew how to write in a way most readers would understand.*In many ways, this is how I feel about Hadoop. Its inner workings are complex, restingas they do on a mixture of distributed systems theory, practical engineering, and com-mon sense. And to the uninitiated, Hadoop can appear alien.But it doesn’t need to be like this. Stripped to its core, the tools that Hadoop providesfor building distributed systems—for data storage, data analysis, and coordination—are simple. If there’s a common theme, it is about raising the level of abstraction—tocreate building blocks for programmers who just happen to have lots of data to store,or lots of data to analyze, or lots of machines to coordinate, and who don’t have thetime, the skill, or the inclination to become distributed systems experts to build theinfrastructure to handle it.With such a simple and generally applicable feature set, it seemed obvious to me whenI started using it that Hadoop deserved to be widely used. However, at the time (inearly 2006), setting up, configuring, and writing programs to use Hadoop was an art.Things have certainly improved since then: there is more documentation, there aremore examples, and there are thriving mailing lists to go to when you have questions.And yet the biggest hurdle for newcomers is understanding what this technology iscapable of, where it excels, and how to use it. That is why I wrote this book.The Apache Hadoop community has come a long way. Over the course of three years,the Hadoop project has blossomed and spun off half a dozen subprojects. In this time,the software has made great leaps in performance, reliability, scalability, and manage-ability. To gain even wider adoption, however, I believe we need to make Hadoop eveneasier to use. This will involve writing more tools; integrating with more systems; and* “The science of fun,” Alex Bellos, The Guardian, May 31, 2008, http://www.guardian.co.uk/science/ 2008/may/31/maths.science. xvii

17.
writing new, improved APIs. I’m looking forward to being a part of this, and I hopethis book will encourage and enable others to do so, too.Administrative NotesDuring discussion of a particular Java class in the text, I often omit its package name,to reduce clutter. If you need to know which package a class is in, you can easily lookit up in Hadoop’s Java API documentation for the relevant subproject, linked to fromthe Apache Hadoop home page at http://hadoop.apache.org/. Or if you’re using an IDE,it can help using its auto-complete mechanism.Similarly, although it deviates from usual style guidelines, program listings that importmultiple classes from the same package may use the asterisk wildcard character to savespace (for example: import org.apache.hadoop.io.*).The sample programs in this book are available for download from the website thataccompanies this book: http://www.hadoopbook.com/. You will also find instructionsthere for obtaining the datasets that are used in examples throughout the book, as wellas further notes for running the programs in the book, and links to updates, additionalresources, and my blog.What’s in This Book?The rest of this book is organized as follows. Chapter 1 emphasizes the need for Hadoopand sketches the history of the project. Chapter 2 provides an introduction toMapReduce. Chapter 3 looks at Hadoop filesystems, and in particular HDFS, in depth.Chapter 4 covers the fundamentals of I/O in Hadoop: data integrity, compression,serialization, and file-based data structures.The next four chapters cover MapReduce in depth. Chapter 5 goes through the practicalsteps needed to develop a MapReduce application. Chapter 6 looks at how MapReduceis implemented in Hadoop, from the point of view of a user. Chapter 7 is about theMapReduce programming model, and the various data formats that MapReduce canwork with. Chapter 8 is on advanced MapReduce topics, including sorting and joiningdata.Chapters 9 and 10 are for Hadoop administrators, and describe how to set up andmaintain a Hadoop cluster running HDFS and MapReduce.Later chapters are dedicated to projects that build on Hadoop or are related to it.Chapters 11 and 12 present Pig and Hive, which are analytics platforms built on HDFSand MapReduce, whereas Chapters 13, 14, and 15 cover HBase, ZooKeeper, andSqoop, respectively.Finally, Chapter 16 is a collection of case studies contributed by members of the ApacheHadoop community.xviii | Preface

18.
What’s New in the Second Edition?The second edition has two new chapters on Hive and Sqoop (Chapters 12 and 15), anew section covering Avro (in Chapter 4), an introduction to the new security featuresin Hadoop (in Chapter 9), and a new case study on analyzing massive network graphsusing Hadoop (in Chapter 16).This edition continues to describe the 0.20 release series of Apache Hadoop, since thiswas the latest stable release at the time of writing. New features from later releases areoccasionally mentioned in the text, however, with reference to the version that theywere introduced in.Conventions Used in This BookThe following typographical conventions are used in this book:Italic Indicates new terms, URLs, email addresses, filenames, and file extensions.Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.Constant width bold Shows commands or other text that should be typed literally by the user.Constant width italic Shows text that should be replaced with user-supplied values or by values deter- mined by context. This icon signifies a tip, suggestion, or general note. This icon indicates a warning or caution.Using Code ExamplesThis book is here to help you get your job done. In general, you may use the code inthis book in your programs and documentation. You do not need to contact us forpermission unless you’re reproducing a significant portion of the code. For example,writing a program that uses several chunks of code from this book does not requirepermission. Selling or distributing a CD-ROM of examples from O’Reilly books does Preface | xix

19.
require permission. Answering a question by citing this book and quoting examplecode does not require permission. Incorporating a significant amount of example codefrom this book into your product’s documentation does require permission.We appreciate, but do not require, attribution. An attribution usually includes the title,author, publisher, and ISBN. For example: “Hadoop: The Definitive Guide, SecondEdition, by Tom White. Copyright 2011 Tom White, 978-1-449-38973-4.”If you feel your use of code examples falls outside fair use or the permission given above,feel free to contact us at permissions@oreilly.com.Safari® Books Online Safari Books Online is an on-demand digital library that lets you easily search over 7,500 technology and creative reference books and videos to find the answers you need quickly.With a subscription, you can read any page and watch any video from our library online.Read books on your cell phone and mobile devices. Access new titles before they areavailable for print, and get exclusive access to manuscripts in development and postfeedback for the authors. Copy and paste code samples, organize your favorites, down-load chapters, bookmark key sections, create notes, print out pages, and benefit fromtons of other time-saving features.O’Reilly Media has uploaded this book to the Safari Books Online service. To have fulldigital access to this book and others on similar topics from O’Reilly and other pub-lishers, sign up for free at http://my.safaribooksonline.com.How to Contact UsPlease address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax)We have a web page for this book, where we list errata, examples, and any additionalinformation. You can access this page at: http://oreilly.com/catalog/0636920010388/The author also has a site for this book at: http://www.hadoopbook.com/xx | Preface

20.
To comment or ask technical questions about this book, send email to: bookquestions@oreilly.comFor more information about our books, conferences, Resource Centers, and theO’Reilly Network, see our website at: http://www.oreilly.comAcknowledgmentsI have relied on many people, both directly and indirectly, in writing this book. I wouldlike to thank the Hadoop community, from whom I have learned, and continue to learn,a great deal.In particular, I would like to thank Michael Stack and Jonathan Gray for writing thechapter on HBase. Also thanks go to Adrian Woodhead, Marc de Palol, Joydeep SenSarma, Ashish Thusoo, Andrzej Białecki, Stu Hood, Chris K. Wensel, and OwenO’Malley for contributing case studies for Chapter 16.I would like to thank the following reviewers who contributed many helpful suggestionsand improvements to my drafts: Raghu Angadi, Matt Biddulph, Christophe Bisciglia,Ryan Cox, Devaraj Das, Alex Dorman, Chris Douglas, Alan Gates, Lars George, PatrickHunt, Aaron Kimball, Peter Krey, Hairong Kuang, Simon Maxen, Olga Natkovich,Benjamin Reed, Konstantin Shvachko, Allen Wittenauer, Matei Zaharia, and PhilipZeyliger. Ajay Anand kept the review process flowing smoothly. Philip (“flip”) Kromerkindly helped me with the NCDC weather dataset featured in the examples in this book.Special thanks to Owen O’Malley and Arun C. Murthy for explaining the intricacies ofthe MapReduce shuffle to me. Any errors that remain are, of course, to be laid at mydoor.For the second edition, I owe a debt of gratitude for the detailed review and feedbackfrom Jeff Bean, Doug Cutting, Glynn Durham, Alan Gates, Jeff Hammerbacher, AlexKozlov, Ken Krugler, Jimmy Lin, Todd Lipcon, Sarah Sproehnle, Vinithra Varadhara-jan, and Ian Wrigley, as well as all the readers who submitted errata for the first edition.I would also like to thank Aaron Kimball for contributing the chapter on Sqoop, andPhilip (“flip”) Kromer for the case study on graph processing.I am particularly grateful to Doug Cutting for his encouragement, support, and friend-ship, and for contributing the foreword.Thanks also go to the many others with whom I have had conversations or emaildiscussions over the course of writing the book.Halfway through writing this book, I joined Cloudera, and I want to thank mycolleagues for being incredibly supportive in allowing me the time to write, and to getit finished promptly. Preface | xxi

21.
I am grateful to my editor, Mike Loukides, and his colleagues at O’Reilly for their helpin the preparation of this book. Mike has been there throughout to answer my ques-tions, to read my first drafts, and to keep me on schedule.Finally, the writing of this book has been a great deal of work, and I couldn’t have doneit without the constant support of my family. My wife, Eliane, not only kept the homegoing, but also stepped in to help review, edit, and chase case studies. My daughters,Emilia and Lottie, have been very understanding, and I’m looking forward to spendinglots more time with all of them.xxii | Preface

22.
CHAPTER 1 Meet Hadoop In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers. —Grace HopperData!We live in the data age. It’s not easy to measure the total volume of data stored elec-tronically, but an IDC estimate put the size of the “digital universe” at 0.18 zettabytesin 2006, and is forecasting a tenfold growth by 2011 to 1.8 zettabytes.* A zettabyte is1021 bytes, or equivalently one thousand exabytes, one million petabytes, or one billionterabytes. That’s roughly the same order of magnitude as one disk drive for every personin the world.This flood of data is coming from many sources. Consider the following:† • The New York Stock Exchange generates about one terabyte of new trade data per day. • Facebook hosts approximately 10 billion photos, taking up one petabyte of storage. • Ancestry.com, the genealogy site, stores around 2.5 petabytes of data. • The Internet Archive stores around 2 petabytes of data, and is growing at a rate of 20 terabytes per month. • The Large Hadron Collider near Geneva, Switzerland, will produce about 15 petabytes of data per year.* From Gantz et al., “The Diverse and Exploding Digital Universe,” March 2008 (http://www.emc.com/ collateral/analyst-reports/diverse-exploding-digital-universe.pdf).† http://www.intelligententerprise.com/showArticle.jhtml?articleID=207800705, http://mashable.com/2008/10/ 15/facebook-10-billion-photos/, http://blog.familytreemagazine.com/insider/Inside+Ancestrycoms+TopSecret +Data+Center.aspx, and http://www.archive.org/about/faqs.php, http://www.interactions.org/cms/?pid= 1027032. 1

23.
So there’s a lot of data out there. But you are probably wondering how it affects you.Most of the data is locked up in the largest web properties (like search engines), orscientific or financial institutions, isn’t it? Does the advent of “Big Data,” as it is beingcalled, affect smaller organizations or individuals?I argue that it does. Take photos, for example. My wife’s grandfather was an avidphotographer, and took photographs throughout his adult life. His entire corpus ofmedium format, slide, and 35mm film, when scanned in at high-resolution, occupiesaround 10 gigabytes. Compare this to the digital photos that my family took in 2008,which take up about 5 gigabytes of space. My family is producing photographic dataat 35 times the rate my wife’s grandfather’s did, and the rate is increasing every year asit becomes easier to take more and more photos.More generally, the digital streams that individuals are producing are growing apace.Microsoft Research’s MyLifeBits project gives a glimpse of archiving of personal infor-mation that may become commonplace in the near future. MyLifeBits was an experi-ment where an individual’s interactions—phone calls, emails, documents—were cap-tured electronically and stored for later access. The data gathered included a phototaken every minute, which resulted in an overall data volume of one gigabyte a month.When storage costs come down enough to make it feasible to store continuous audioand video, the data volume for a future MyLifeBits service will be many times that.The trend is for every individual’s data footprint to grow, but perhaps more important,the amount of data generated by machines will be even greater than that generated bypeople. Machine logs, RFID readers, sensor networks, vehicle GPS traces, retailtransactions—all of these contribute to the growing mountain of data.The volume of data being made publicly available increases every year, too. Organiza-tions no longer have to merely manage their own data: success in the future will bedictated to a large extent by their ability to extract value from other organizations’ data.Initiatives such as Public Data Sets on Amazon Web Services, Infochimps.org, andtheinfo.org exist to foster the “information commons,” where data can be freely (or inthe case of AWS, for a modest price) shared for anyone to download and analyze.Mashups between different information sources make for unexpected and hithertounimaginable applications.Take, for example, the Astrometry.net project, which watches the Astrometry groupon Flickr for new photos of the night sky. It analyzes each image and identifies whichpart of the sky it is from, as well as any interesting celestial bodies, such as stars orgalaxies. This project shows the kind of things that are possible when data (in this case,tagged photographic images) is made available and used for something (image analysis)that was not anticipated by the creator.It has been said that “More data usually beats better algorithms,” which is to say thatfor some problems (such as recommending movies or music based on past preferences),2 | Chapter 1: Meet Hadoop

24.
however fiendish your algorithms are, they can often be beaten simply by having moredata (and a less sophisticated algorithm).‡The good news is that Big Data is here. The bad news is that we are struggling to storeand analyze it.Data Storage and AnalysisThe problem is simple: while the storage capacities of hard drives have increased mas-sively over the years, access speeds—the rate at which data can be read from drives—have not kept up. One typical drive from 1990 could store 1,370 MB of data and hada transfer speed of 4.4 MB/s,§ so you could read all the data from a full drive in aroundfive minutes. Over 20 years later, one terabyte drives are the norm, but the transferspeed is around 100 MB/s, so it takes more than two and a half hours to read all thedata off the disk.This is a long time to read all data on a single drive—and writing is even slower. Theobvious way to reduce the time is to read from multiple disks at once. Imagine if wehad 100 drives, each holding one hundredth of the data. Working in parallel, we couldread the data in under two minutes.Only using one hundredth of a disk may seem wasteful. But we can store one hundreddatasets, each of which is one terabyte, and provide shared access to them. We canimagine that the users of such a system would be happy to share access in return forshorter analysis times, and, statistically, that their analysis jobs would be likely to bespread over time, so they wouldn’t interfere with each other too much.There’s more to being able to read and write data in parallel to or from multiple disks,though.The first problem to solve is hardware failure: as soon as you start using many piecesof hardware, the chance that one will fail is fairly high. A common way of avoiding dataloss is through replication: redundant copies of the data are kept by the system so thatin the event of failure, there is another copy available. This is how RAID works, forinstance, although Hadoop’s filesystem, the Hadoop Distributed Filesystem (HDFS),takes a slightly different approach, as you shall see later.The second problem is that most analysis tasks need to be able to combine the data insome way; data read from one disk may need to be combined with the data from anyof the other 99 disks. Various distributed systems allow data to be combined frommultiple sources, but doing this correctly is notoriously challenging. MapReduce pro-vides a programming model that abstracts the problem from disk reads and writes,‡ The quote is from Anand Rajaraman writing about the Netflix Challenge (http://anand.typepad.com/ datawocky/2008/03/more-data-usual.html). Alon Halevy, Peter Norvig, and Fernando Pereira make the same point in “The Unreasonable Effectiveness of Data,” IEEE Intelligent Systems, March/April 2009.§ These specifications are for the Seagate ST-41600n. Data Storage and Analysis | 3

25.
transforming it into a computation over sets of keys and values. We will look at thedetails of this model in later chapters, but the important point for the present discussionis that there are two parts to the computation, the map and the reduce, and it’s theinterface between the two where the “mixing” occurs. Like HDFS, MapReduce hasbuilt-in reliability.This, in a nutshell, is what Hadoop provides: a reliable shared storage and analysissystem. The storage is provided by HDFS and analysis by MapReduce. There are otherparts to Hadoop, but these capabilities are its kernel.Comparison with Other SystemsThe approach taken by MapReduce may seem like a brute-force approach. The premiseis that the entire dataset—or at least a good portion of it—is processed for each query.But this is its power. MapReduce is a batch query processor, and the ability to run anad hoc query against your whole dataset and get the results in a reasonable time istransformative. It changes the way you think about data, and unlocks data that waspreviously archived on tape or disk. It gives people the opportunity to innovate withdata. Questions that took too long to get answered before can now be answered, whichin turn leads to new questions and new insights.For example, Mailtrust, Rackspace’s mail division, used Hadoop for processing emaillogs. One ad hoc query they wrote was to find the geographic distribution of their users.In their words: This data was so useful that we’ve scheduled the MapReduce job to run monthly and we will be using this data to help us decide which Rackspace data centers to place new mail servers in as we grow.By bringing several hundred gigabytes of data together and having the tools to analyzeit, the Rackspace engineers were able to gain an understanding of the data that theyotherwise would never have had, and, furthermore, they were able to use what theyhad learned to improve the service for their customers. You can read more about howRackspace uses Hadoop in Chapter 16.RDBMSWhy can’t we use databases with lots of disks to do large-scale batch analysis? Why isMapReduce needed?4 | Chapter 1: Meet Hadoop

26.
The answer to these questions comes from another trend in disk drives: seek time isimproving more slowly than transfer rate. Seeking is the process of moving the disk’shead to a particular place on the disk to read or write data. It characterizes the latencyof a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth.If the data access pattern is dominated by seeks, it will take longer to read or write largeportions of the dataset than streaming through it, which operates at the transfer rate.On the other hand, for updating a small proportion of records in a database, a tradi-tional B-Tree (the data structure used in relational databases, which is limited by therate it can perform seeks) works well. For updating the majority of a database, a B-Treeis less efficient than MapReduce, which uses Sort/Merge to rebuild the database.In many ways, MapReduce can be seen as a complement to an RDBMS. (The differencesbetween the two systems are shown in Table 1-1.) MapReduce is a good fit for problemsthat need to analyze the whole dataset, in a batch fashion, particularly for ad hoc anal-ysis. An RDBMS is good for point queries or updates, where the dataset has been in-dexed to deliver low-latency retrieval and update times of a relatively small amount ofdata. MapReduce suits applications where the data is written once, and read manytimes, whereas a relational database is good for datasets that are continually updated.Table 1-1. RDBMS compared to MapReduce Traditional RDBMS MapReduce Data size Gigabytes Petabytes Access Interactive and batch Batch Updates Read and write many times Write once, read many times Structure Static schema Dynamic schema Integrity High Low Scaling Nonlinear LinearAnother difference between MapReduce and an RDBMS is the amount of structure inthe datasets that they operate on. Structured data is data that is organized into entitiesthat have a defined format, such as XML documents or database tables that conformto a particular predefined schema. This is the realm of the RDBMS. Semi-structureddata, on the other hand, is looser, and though there may be a schema, it is often ignored,so it may be used only as a guide to the structure of the data: for example, a spreadsheet,in which the structure is the grid of cells, although the cells themselves may hold anyform of data. Unstructured data does not have any particular internal structure: forexample, plain text or image data. MapReduce works well on unstructured or semi-structured data, since it is designed to interpret the data at processing time. In otherwords, the input keys and values for MapReduce are not an intrinsic property of thedata, but they are chosen by the person analyzing the data. Comparison with Other Systems | 5

27.
Relational data is often normalized to retain its integrity and remove redundancy.Normalization poses problems for MapReduce, since it makes reading a record a non-local operation, and one of the central assumptions that MapReduce makes is that itis possible to perform (high-speed) streaming reads and writes.A web server log is a good example of a set of records that is not normalized (for ex-ample, the client hostnames are specified in full each time, even though the same clientmay appear many times), and this is one reason that logfiles of all kinds are particularlywell-suited to analysis with MapReduce.MapReduce is a linearly scalable programming model. The programmer writes twofunctions—a map function and a reduce function—each of which defines a mappingfrom one set of key-value pairs to another. These functions are oblivious to the size ofthe data or the cluster that they are operating on, so they can be used unchanged for asmall dataset and for a massive one. More important, if you double the size of the inputdata, a job will run twice as slow. But if you also double the size of the cluster, a jobwill run as fast as the original one. This is not generally true of SQL queries.Over time, however, the differences between relational databases and MapReduce sys-tems are likely to blur—both as relational databases start incorporating some of theideas from MapReduce (such as Aster Data’s and Greenplum’s databases) and, fromthe other direction, as higher-level query languages built on MapReduce (such as Pigand Hive) make MapReduce systems more approachable to traditional databaseprogrammers.‖Grid ComputingThe High Performance Computing (HPC) and Grid Computing communities havebeen doing large-scale data processing for years, using such APIs as Message PassingInterface (MPI). Broadly, the approach in HPC is to distribute the work across a clusterof machines, which access a shared filesystem, hosted by a SAN. This works well forpredominantly compute-intensive jobs, but becomes a problem when nodes need toaccess larger data volumes (hundreds of gigabytes, the point at which MapReduce reallystarts to shine), since the network bandwidth is the bottleneck and compute nodesbecome idle.‖ In January 2007, David J. DeWitt and Michael Stonebraker caused a stir by publishing “MapReduce: A major step backwards” (http://databasecolumn.vertica.com/database-innovation/mapreduce-a-major-step -backwards), in which they criticized MapReduce for being a poor substitute for relational databases. Many commentators argued that it was a false comparison (see, for example, Mark C. Chu-Carroll’s “Databases are hammers; MapReduce is a screwdriver,” http://scienceblogs.com/goodmath/2008/01/databases_are _hammers_mapreduc.php), and DeWitt and Stonebraker followed up with “MapReduce II” (http:// databasecolumn.vertica.com/database-innovation/mapreduce-ii), where they addressed the main topics brought up by others.6 | Chapter 1: Meet Hadoop

28.
MapReduce tries to collocate the data with the compute node, so data access is fastsince it is local.# This feature, known as data locality, is at the heart of MapReduce andis the reason for its good performance. Recognizing that network bandwidth is the mostprecious resource in a data center environment (it is easy to saturate network links bycopying data around), MapReduce implementations go to great lengths to conserve itby explicitly modelling network topology. Notice that this arrangement does not pre-clude high-CPU analyses in MapReduce.MPI gives great control to the programmer, but requires that he or she explicitly handlethe mechanics of the data flow, exposed via low-level C routines and constructs, suchas sockets, as well as the higher-level algorithm for the analysis. MapReduce operatesonly at the higher level: the programmer thinks in terms of functions of key and valuepairs, and the data flow is implicit.Coordinating the processes in a large-scale distributed computation is a challenge. Thehardest aspect is gracefully handling partial failure—when you don’t know if a remoteprocess has failed or not—and still making progress with the overall computation.MapReduce spares the programmer from having to think about failure, since theimplementation detects failed map or reduce tasks and reschedules replacements onmachines that are healthy. MapReduce is able to do this since it is a shared-nothingarchitecture, meaning that tasks have no dependence on one other. (This is a slightoversimplification, since the output from mappers is fed to the reducers, but this isunder the control of the MapReduce system; in this case, it needs to take more carererunning a failed reducer than rerunning a failed map, since it has to make sure it canretrieve the necessary map outputs, and if not, regenerate them by running the relevantmaps again.) So from the programmer’s point of view, the order in which the tasks rundoesn’t matter. By contrast, MPI programs have to explicitly manage their own check-pointing and recovery, which gives more control to the programmer, but makes themmore difficult to write.MapReduce might sound like quite a restrictive programming model, and in a sense itis: you are limited to key and value types that are related in specified ways, and mappersand reducers run with very limited coordination between one another (the mapperspass keys and values to reducers). A natural question to ask is: can you do anythinguseful or nontrivial with it?The answer is yes. MapReduce was invented by engineers at Google as a system forbuilding production search indexes because they found themselves solving the sameproblem over and over again (and MapReduce was inspired by older ideas from thefunctional programming, distributed computing, and database communities), but ithas since been used for many other applications in many other industries. It is pleasantlysurprising to see the range of algorithms that can be expressed in MapReduce, from#Jim Gray was an early advocate of putting the computation near the data. See “Distributed Computing Economics,” March 2003, http://research.microsoft.com/apps/pubs/default.aspx?id=70001. Comparison with Other Systems | 7

29.
image analysis, to graph-based problems, to machine learning algorithms.* It can’t solveevery problem, of course, but it is a general data-processing tool.You can see a sample of some of the applications that Hadoop has been used for inChapter 16.Volunteer ComputingWhen people first hear about Hadoop and MapReduce, they often ask, “How is itdifferent from SETI@home?” SETI, the Search for Extra-Terrestrial Intelligence, runsa project called SETI@home in which volunteers donate CPU time from their otherwiseidle computers to analyze radio telescope data for signs of intelligent life outside earth.SETI@home is the most well-known of many volunteer computing projects; others in-clude the Great Internet Mersenne Prime Search (to search for large prime numbers)and Folding@home (to understand protein folding and how it relates to disease).Volunteer computing projects work by breaking the problem they are trying tosolve into chunks called work units, which are sent to computers around the world tobe analyzed. For example, a SETI@home work unit is about 0.35 MB of radio telescopedata, and takes hours or days to analyze on a typical home computer. When the analysisis completed, the results are sent back to the server, and the client gets another workunit. As a precaution to combat cheating, each work unit is sent to three differentmachines and needs at least two results to agree to be accepted.Although SETI@home may be superficially similar to MapReduce (breaking a probleminto independent pieces to be worked on in parallel), there are some significant differ-ences. The SETI@home problem is very CPU-intensive, which makes it suitable forrunning on hundreds of thousands of computers across the world,† since the time totransfer the work unit is dwarfed by the time to run the computation on it. Volunteersare donating CPU cycles, not bandwidth.MapReduce is designed to run jobs that last minutes or hours on trusted, dedicatedhardware running in a single data center with very high aggregate bandwidth inter-connects. By contrast, SETI@home runs a perpetual computation on untrustedmachines on the Internet with highly variable connection speeds and no data locality.* Apache Mahout (http://mahout.apache.org/) is a project to build machine learning libraries (such as classification and clustering algorithms) that run on Hadoop.† In January 2008, SETI@home was reported at http://www.planetary.org/programs/projects/setiathome/ setiathome_20080115.html to be processing 300 gigabytes a day, using 320,000 computers (most of which are not dedicated to SETI@home; they are used for other things, too).8 | Chapter 1: Meet Hadoop

30.
A Brief History of HadoopHadoop was created by Doug Cutting, the creator of Apache Lucene, the widely usedtext search library. Hadoop has its origins in Apache Nutch, an open source web searchengine, itself a part of the Lucene project. The Origin of the Name “Hadoop” The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, Doug Cutting, explains how the name came about: The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid’s term. Subprojects and “contrib” modules in Hadoop also tend to have names that are unre- lated to their function, often with an elephant or other animal theme (“Pig,” for example). Smaller components are given more descriptive (and therefore more mun- dane) names. This is a good principle, as it means you can generally work out what something does from its name. For example, the jobtracker‡ keeps track of MapReduce jobs.Building a web search engine from scratch was an ambitious goal, for not only is thesoftware required to crawl and index websites complex to write, but it is also a challengeto run without a dedicated operations team, since there are so many moving parts. It’sexpensive, too: Mike Cafarella and Doug Cutting estimated a system supporting a1-billion-page index would cost around half a million dollars in hardware, with amonthly running cost of $30,000.§ Nevertheless, they believed it was a worthy goal, asit would open up and ultimately democratize search engine algorithms.Nutch was started in 2002, and a working crawler and search system quickly emerged.However, they realized that their architecture wouldn’t scale to the billions of pages onthe Web. Help was at hand with the publication of a paper in 2003 that described thearchitecture of Google’s distributed filesystem, called GFS, which was being used inproduction at Google.‖ GFS, or something like it, would solve their storage needs forthe very large files generated as a part of the web crawl and indexing process. In par-ticular, GFS would free up time being spent on administrative tasks such as managingstorage nodes. In 2004, they set about writing an open source implementation, theNutch Distributed Filesystem (NDFS).‡ In this book, we use the lowercase form, “jobtracker,” to denote the entity when it’s being referred to generally, and the CamelCase form JobTracker to denote the Java class that implements it.§ Mike Cafarella and Doug Cutting, “Building Nutch: Open Source Search,” ACM Queue, April 2004, http:// queue.acm.org/detail.cfm?id=988408.‖ Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System,” October 2003, http: //labs.google.com/papers/gfs.html. A Brief History of Hadoop | 9

31.
In 2004, Google published the paper that introduced MapReduce to the world.# Earlyin 2005, the Nutch developers had a working MapReduce implementation in Nutch,and by the middle of that year all the major Nutch algorithms had been ported to runusing MapReduce and NDFS.NDFS and the MapReduce implementation in Nutch were applicable beyond the realmof search, and in February 2006 they moved out of Nutch to form an independentsubproject of Lucene called Hadoop. At around the same time, Doug Cutting joinedYahoo!, which provided a dedicated team and the resources to turn Hadoop into asystem that ran at web scale (see sidebar). This was demonstrated in February 2008when Yahoo! announced that its production search index was being generated by a10,000-core Hadoop cluster.*In January 2008, Hadoop was made its own top-level project at Apache, confirming itssuccess and its diverse, active community. By this time, Hadoop was being used bymany other companies besides Yahoo!, such as Last.fm, Facebook, and the New YorkTimes. Some applications are covered in the case studies in Chapter 16 and on theHadoop wiki.In one well-publicized feat, the New York Times used Amazon’s EC2 compute cloudto crunch through four terabytes of scanned archives from the paper converting themto PDFs for the Web.† The processing took less than 24 hours to run using 100 ma-chines, and the project probably wouldn’t have been embarked on without the com-bination of Amazon’s pay-by-the-hour model (which allowed the NYT to access a largenumber of machines for a short period) and Hadoop’s easy-to-use parallel program-ming model.In April 2008, Hadoop broke a world record to become the fastest system to sort aterabyte of data. Running on a 910-node cluster, Hadoop sorted one terabyte in 209seconds (just under 3½ minutes), beating the previous year’s winner of 297 seconds(described in detail in “TeraByte Sort on Apache Hadoop” on page 553). In Novemberof the same year, Google reported that its MapReduce implementation sorted one ter-abyte in 68 seconds.‡ As the first edition of this book was going to press (May 2009),it was announced that a team at Yahoo! used Hadoop to sort one terabyte in 62 seconds.#Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters ,” December 2004, http://labs.google.com/papers/mapreduce.html.* “Yahoo! Launches World’s Largest Hadoop Production Application,” 19 February 2008, http://developer .yahoo.net/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html.† Derek Gottfrid, “Self-service, Prorated Super Computing Fun!” 1 November 2007, http://open.blogs.nytimes .com/2007/11/01/self-service-prorated-super-computing-fun/.‡ “Sorting 1PB with MapReduce,” 21 November 2008, http://googleblog.blogspot.com/2008/11/sorting-1pb -with-mapreduce.html.10 | Chapter 1: Meet Hadoop

32.
Hadoop at Yahoo!Building Internet-scale search engines requires huge amounts of data and thereforelarge numbers of machines to process it. Yahoo! Search consists of four primary com-ponents: the Crawler, which downloads pages from web servers; the WebMap, whichbuilds a graph of the known Web; the Indexer, which builds a reverse index to the bestpages; and the Runtime, which answers users’ queries. The WebMap is a graph thatconsists of roughly 1 trillion (1012) edges each representing a web link and 100 billion(1011) nodes each representing distinct URLs. Creating and analyzing such a large graphrequires a large number of computers running for many days. In early 2005, the infra-structure for the WebMap, named Dreadnaught, needed to be redesigned to scale upto more nodes. Dreadnaught had successfully scaled from 20 to 600 nodes, but requireda complete redesign to scale out further. Dreadnaught is similar to MapReduce in manyways, but provides more flexibility and less structure. In particular, each fragment in aDreadnaught job can send output to each of the fragments in the next stage of the job,but the sort was all done in library code. In practice, most of the WebMap phases werepairs that corresponded to MapReduce. Therefore, the WebMap applications wouldnot require extensive refactoring to fit into MapReduce.Eric Baldeschwieler (Eric14) created a small team and we started designing andprototyping a new framework written in C++ modeled after GFS and MapReduce toreplace Dreadnaught. Although the immediate need was for a new framework forWebMap, it was clear that standardization of the batch platform across Yahoo! Searchwas critical and by making the framework general enough to support other users, wecould better leverage investment in the new platform.At the same time, we were watching Hadoop, which was part of Nutch, and its progress.In January 2006, Yahoo! hired Doug Cutting, and a month later we decided to abandonour prototype and adopt Hadoop. The advantage of Hadoop over our prototype anddesign was that it was already working with a real application (Nutch) on 20 nodes.That allowed us to bring up a research cluster two months later and start helping realcustomers use the new framework much sooner than we could have otherwise. Anotheradvantage, of course, was that since Hadoop was already open source, it was easier(although far from easy!) to get permission from Yahoo!’s legal department to work inopen source. So we set up a 200-node cluster for the researchers in early 2006 and putthe WebMap conversion plans on hold while we supported and improved Hadoop forthe research users.Here’s a quick timeline of how things have progressed: • 2004—Initial versions of what is now Hadoop Distributed Filesystem and Map- Reduce implemented by Doug Cutting and Mike Cafarella. • December 2005—Nutch ported to the new framework. Hadoop runs reliably on 20 nodes. • January 2006—Doug Cutting joins Yahoo!. • February 2006—Apache Hadoop project officially started to support the stand- alone development of MapReduce and HDFS. A Brief History of Hadoop | 11

33.
• February 2006—Adoption of Hadoop by Yahoo! Grid team. • April 2006—Sort benchmark (10 GB/node) run on 188 nodes in 47.9 hours. • May 2006—Yahoo! set up a Hadoop research cluster—300 nodes. • May 2006—Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark). • October 2006—Research cluster reaches 600 nodes. • December 2006—Sort benchmark run on 20 nodes in 1.8 hours, 100 nodes in 3.3 hours, 500 nodes in 5.2 hours, 900 nodes in 7.8 hours. • January 2007—Research cluster reaches 900 nodes. • April 2007—Research clusters—2 clusters of 1000 nodes. • April 2008—Won the 1 terabyte sort benchmark in 209 seconds on 900 nodes. • October 2008—Loading 10 terabytes of data per day on to research clusters. • March 2009—17 clusters with a total of 24,000 nodes. • April 2009—Won the minute sort by sorting 500 GB in 59 seconds (on 1,400 nodes) and the 100 terabyte sort in 173 minutes (on 3,400 nodes). —Owen O’MalleyApache Hadoop and the Hadoop EcosystemAlthough Hadoop is best known for MapReduce and its distributed filesystem (HDFS,renamed from NDFS), the term is also used for a family of related projects that fallunder the umbrella of infrastructure for distributed computing and large-scale dataprocessing.Most of the core projects covered in this book are hosted by the Apache SoftwareFoundation, which provides support for a community of open source software projects,including the original HTTP Server from which it gets its name. As the Hadoop eco-system grows, more projects are appearing, not necessarily hosted at Apache, whichprovide complementary services to Hadoop, or build on the core to add higher-levelabstractions.The Hadoop projects that are covered in this book are described briefly here:Common A set of components and interfaces for distributed filesystems and general I/O (serialization, Java RPC, persistent data structures).Avro A serialization system for efficient, cross-language RPC, and persistent data storage.MapReduce A distributed data processing model and execution environment that runs on large clusters of commodity machines.12 | Chapter 1: Meet Hadoop

34.
HDFS A distributed filesystem that runs on large clusters of commodity machines.Pig A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters.Hive A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data.HBase A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads).ZooKeeper A distributed, highly available coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications.Sqoop A tool for efficiently moving data between relational databases and HDFS. Apache Hadoop and the Hadoop Ecosystem | 13

35.
CHAPTER 2 MapReduceMapReduce is a programming model for data processing. The model is simple, yet nottoo simple to express useful programs in. Hadoop can run MapReduce programs writ-ten in various languages; in this chapter, we shall look at the same program expressedin Java, Ruby, Python, and C++. Most important, MapReduce programs are inherentlyparallel, thus putting very large-scale data analysis into the hands of anyone withenough machines at their disposal. MapReduce comes into its own for large datasets,so let’s start by looking at one.A Weather DatasetFor our example, we will write a program that mines weather data. Weather sensorscollecting data every hour at many locations across the globe gather a large volume oflog data, which is a good candidate for analysis with MapReduce, since it is semi-structured and record-oriented.Data FormatThe data we will use is from the National Climatic Data Center (NCDC, http://www.ncdc.noaa.gov/). The data is stored using a line-oriented ASCII format, in which eachline is a record. The format supports a rich set of meteorological elements, many ofwhich are optional or with variable data lengths. For simplicity, we shall focus on thebasic elements, such as temperature, which are always present and are of fixed width.Example 2-1 shows a sample line with some of the salient fields highlighted. The linehas been split into multiple lines to show each field: in the real file, fields are packedinto one line with no delimiters. 15

37.
year’s readings were concatenated into a single file. (The means by which this wascarried out is described in Appendix C.)Analyzing the Data with Unix ToolsWhat’s the highest recorded global temperature for each year in the dataset? We willanswer this first without using Hadoop, as this information will provide a performancebaseline, as well as a useful means to check our results.The classic tool for processing line-oriented data is awk. Example 2-2 is a small scriptto calculate the maximum temperature for each year.Example 2-2. A program for finding the maximum recorded temperature by year from NCDC weatherrecords#!/usr/bin/env bashfor year in all/*do echo -ne `basename $year .gz`"t" gunzip -c $year | awk { temp = substr($0, 88, 5) + 0; q = substr($0, 93, 1); if (temp !=9999 && q ~ /[01459]/ && temp > max) max = temp } END { print max }doneThe script loops through the compressed year files, first printing the year, and thenprocessing each file using awk. The awk script extracts two fields from the data: the airtemperature and the quality code. The air temperature value is turned into an integerby adding 0. Next, a test is applied to see if the temperature is valid (the value 9999signifies a missing value in the NCDC dataset) and if the quality code indicates that thereading is not suspect or erroneous. If the reading is OK, the value is compared withthe maximum value seen so far, which is updated if a new maximum is found. TheEND block is executed after all the lines in the file have been processed, and it prints themaximum value.Here is the beginning of a run: % ./max_temperature.sh 1901 317 1902 244 1903 289 1904 256 1905 283 ...The temperature values in the source file are scaled by a factor of 10, so this works outas a maximum temperature of 31.7°C for 1901 (there were very few readings at thebeginning of the century, so this is plausible). The complete run for the century took42 minutes in one run on a single EC2 High-CPU Extra Large Instance. Analyzing the Data with Unix Tools | 17

38.
To speed up the processing, we need to run parts of the program in parallel. In theory,this is straightforward: we could process different years in different processes, using allthe available hardware threads on a machine. There are a few problems with this,however.First, dividing the work into equal-size pieces isn’t always easy or obvious. In this case,the file size for different years varies widely, so some processes will finish much earlierthan others. Even if they pick up further work, the whole run is dominated by thelongest file. A better approach, although one that requires more work, is to split theinput into fixed-size chunks and assign each chunk to a process.Second, combining the results from independent processes may need further process-ing. In this case, the result for each year is independent of other years and may becombined by concatenating all the results, and sorting by year. If using the fixed-sizechunk approach, the combination is more delicate. For this example, data for a par-ticular year will typically be split into several chunks, each processed independently.We’ll end up with the maximum temperature for each chunk, so the final step is tolook for the highest of these maximums, for each year.Third, you are still limited by the processing capacity of a single machine. If the besttime you can achieve is 20 minutes with the number of processors you have, then that’sit. You can’t make it go faster. Also, some datasets grow beyond the capacity of a singlemachine. When we start using multiple machines, a whole host of other factors comeinto play, mainly falling in the category of coordination and reliability. Who runs theoverall job? How do we deal with failed processes?So, though it’s feasible to parallelize the processing, in practice it’s messy. Using aframework like Hadoop to take care of these issues is a great help.Analyzing the Data with HadoopTo take advantage of the parallel processing that Hadoop provides, we need to expressour query as a MapReduce job. After some local, small-scale testing, we will be able torun it on a cluster of machines.Map and ReduceMapReduce works by breaking the processing into two phases: the map phase and thereduce phase. Each phase has key-value pairs as input and output, the types of whichmay be chosen by the programmer. The programmer also specifies two functions: themap function and the reduce function.The input to our map phase is the raw NCDC data. We choose a text input format thatgives us each line in the dataset as a text value. The key is the offset of the beginningof the line from the beginning of the file, but as we have no need for this, we ignore it.18 | Chapter 2: MapReduce

39.
Our map function is simple. We pull out the year and the air temperature, since theseare the only fields we are interested in. In this case, the map function is just a datapreparation phase, setting up the data in such a way that the reducer function can doits work on it: finding the maximum temperature for each year. The map function isalso a good place to drop bad records: here we filter out temperatures that are missing,suspect, or erroneous.To visualize the way the map works, consider the following sample lines of input data(some unused columns have been dropped to fit the page, indicated by ellipses): 0067011990999991950051507004...9999999N9+00001+99999999999... 0043011990999991950051512004...9999999N9+00221+99999999999... 0043011990999991950051518004...9999999N9-00111+99999999999... 0043012650999991949032412004...0500001N9+01111+99999999999... 0043012650999991949032418004...0500001N9+00781+99999999999...These lines are presented to the map function as the key-value pairs: (0, 0067011990999991950051507004...9999999N9+00001+99999999999...) (106, 0043011990999991950051512004...9999999N9+00221+99999999999...) (212, 0043011990999991950051518004...9999999N9-00111+99999999999...) (318, 0043012650999991949032412004...0500001N9+01111+99999999999...) (424, 0043012650999991949032418004...0500001N9+00781+99999999999...)The keys are the line offsets within the file, which we ignore in our map function. Themap function merely extracts the year and the air temperature (indicated in bold text),and emits them as its output (the temperature values have been interpreted asintegers): (1950, 0) (1950, 22) (1950, −11) (1949, 111) (1949, 78)The output from the map function is processed by the MapReduce framework beforebeing sent to the reduce function. This processing sorts and groups the key-value pairsby key. So, continuing the example, our reduce function sees the following input: (1949, [111, 78]) (1950, [0, 22, −11])Each year appears with a list of all its air temperature readings. All the reduce functionhas to do now is iterate through the list and pick up the maximum reading: (1949, 111) (1950, 22)This is the final output: the maximum global temperature recorded in each year.The whole data flow is illustrated in Figure 2-1. At the bottom of the diagram is a Unixpipeline, which mimics the whole MapReduce flow, and which we will see again laterin the chapter when we look at Hadoop Streaming. Analyzing the Data with Hadoop | 19

41.
The Mapper interface is a generic type, with four formal type parameters that specify theinput key, input value, output key, and output value types of the map function. For thepresent example, the input key is a long integer offset, the input value is a line of text,the output key is a year, and the output value is an air temperature (an integer). Ratherthan use built-in Java types, Hadoop provides its own set of basic types that are opti-mized for network serialization. These are found in the org.apache.hadoop.io package.Here we use LongWritable, which corresponds to a Java Long, Text (like Java String),and IntWritable (like Java Integer).The map() method is passed a key and a value. We convert the Text value containingthe line of input into a Java String, then use its substring() method to extract thecolumns we are interested in.The map() method also provides an instance of OutputCollector to write the output to.In this case, we write the year as a Text object (since we are just using it as a key), andthe temperature is wrapped in an IntWritable. We write an output record only if thetemperature is present and the quality code indicates the temperature reading is OK.The reduce function is similarly defined using a Reducer, as illustrated in Example 2-4.Example 2-4. Reducer for maximum temperature exampleimport java.io.IOException;import java.util.Iterator;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reducer;import org.apache.hadoop.mapred.Reporter;public class MaxTemperatureReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int maxValue = Integer.MIN_VALUE; while (values.hasNext()) { maxValue = Math.max(maxValue, values.next().get()); } output.collect(key, new IntWritable(maxValue)); }}Again, four formal type parameters are used to specify the input and output types, thistime for the reduce function. The input types of the reduce function must match theoutput types of the map function: Text and IntWritable. And in this case, the outputtypes of the reduce function are Text and IntWritable, for a year and its maximum Analyzing the Data with Hadoop | 21

42.
temperature, which we find by iterating through the temperatures and comparing eachwith a record of the highest found so far.The third piece of code runs the MapReduce job (see Example 2-5).Example 2-5. Application to find the maximum temperature in the weather datasetimport java.io.IOException;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.FileInputFormat;import org.apache.hadoop.mapred.FileOutputFormat;import org.apache.hadoop.mapred.JobClient;import org.apache.hadoop.mapred.JobConf;public class MaxTemperature { public static void main(String[] args) throws IOException { if (args.length != 2) { System.err.println("Usage: MaxTemperature <input path> <output path>"); System.exit(-1); } JobConf conf = new JobConf(MaxTemperature.class); conf.setJobName("Max temperature"); FileInputFormat.addInputPath(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); conf.setMapperClass(MaxTemperatureMapper.class); conf.setReducerClass(MaxTemperatureReducer.class); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); JobClient.runJob(conf); }}A JobConf object forms the specification of the job. It gives you control over how thejob is run. When we run this job on a Hadoop cluster, we will package the code into aJAR file (which Hadoop will distribute around the cluster). Rather than explicitly spec-ify the name of the JAR file, we can pass a class in the JobConf constructor, whichHadoop will use to locate the relevant JAR file by looking for the JAR file containingthis class.Having constructed a JobConf object, we specify the input and output paths. An inputpath is specified by calling the static addInputPath() method on FileInputFormat, andit can be a single file, a directory (in which case, the input forms all the files in thatdirectory), or a file pattern. As the name suggests, addInputPath() can be called morethan once to use input from multiple paths.22 | Chapter 2: MapReduce

43.
The output path (of which there is only one) is specified by the static setOutputPath() method on FileOutputFormat. It specifies a directory where the output files fromthe reducer functions are written. The directory shouldn’t exist before running the job,as Hadoop will complain and not run the job. This precaution is to prevent data loss(it can be very annoying to accidentally overwrite the output of a long job withanother).Next, we specify the map and reduce types to use via the setMapperClass() andsetReducerClass() methods.The setOutputKeyClass() and setOutputValueClass() methods control the output typesfor the map and the reduce functions, which are often the same, as they are in our case.If they are different, then the map output types can be set using the methodssetMapOutputKeyClass() and setMapOutputValueClass().The input types are controlled via the input format, which we have not explicitly setsince we are using the default TextInputFormat.After setting the classes that define the map and reduce functions, we are ready to runthe job. The static runJob() method on JobClient submits the job and waits for it tofinish, writing information about its progress to the console.A test runAfter writing a MapReduce job, it’s normal to try it out on a small dataset to flush outany immediate problems with the code. First install Hadoop in standalone mode—there are instructions for how to do this in Appendix A. This is the mode in whichHadoop runs using the local filesystem with a local job runner. Let’s test it on the five-line sample discussed earlier (the output has been slightly reformatted to fit the page): % export HADOOP_CLASSPATH=build/classes % hadoop MaxTemperature input/ncdc/sample.txt output 09/04/07 12:34:35 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=Job Tracker, sessionId= 09/04/07 12:34:35 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 09/04/07 12:34:35 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 09/04/07 12:34:35 INFO mapred.FileInputFormat: Total input paths to process : 1 09/04/07 12:34:35 INFO mapred.JobClient: Running job: job_local_0001 09/04/07 12:34:35 INFO mapred.FileInputFormat: Total input paths to process : 1 09/04/07 12:34:35 INFO mapred.MapTask: numReduceTasks: 1 09/04/07 12:34:35 INFO mapred.MapTask: io.sort.mb = 100 09/04/07 12:34:35 INFO mapred.MapTask: data buffer = 79691776/99614720 09/04/07 12:34:35 INFO mapred.MapTask: record buffer = 262144/327680 09/04/07 12:34:35 INFO mapred.MapTask: Starting flush of map output 09/04/07 12:34:36 INFO mapred.MapTask: Finished spill 0 09/04/07 12:34:36 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting 09/04/07 12:34:36 INFO mapred.LocalJobRunner: file:/Users/tom/workspace/htdg/input/n cdc/sample.txt:0+529 09/04/07 12:34:36 INFO mapred.TaskRunner: Task attempt_local_0001_m_000000_0 done. Analyzing the Data with Hadoop | 23

45.
attempt_local_0001_r_000000_0). Knowing the job and task IDs can be very useful whendebugging MapReduce jobs.The last section of the output, titled “Counters,” shows the statistics that Hadoopgenerates for each job it runs. These are very useful for checking whether the amountof data processed is what you expected. For example, we can follow the number ofrecords that went through the system: five map inputs produced five map outputs, thenfive reduce inputs in two groups produced two reduce outputs.The output was written to the output directory, which contains one output file perreducer. The job had a single reducer, so we find a single file, named part-00000: % cat output/part-00000 1949 111 1950 22This result is the same as when we went through it by hand earlier. We interpret thisas saying that the maximum temperature recorded in 1949 was 11.1°C, and in 1950 itwas 2.2°C.The new Java MapReduce APIRelease 0.20.0 of Hadoop included a new Java MapReduce API, sometimes referred toas “Context Objects,” designed to make the API easier to evolve in the future. The newAPI is type-incompatible with the old, however, so applications need to be rewrittento take advantage of it.*There are several notable differences between the two APIs: • The new API favors abstract classes over interfaces, since these are easier to evolve. For example, you can add a method (with a default implementation) to an abstract class without breaking old implementations of the class. In the new API, the Mapper and Reducer interfaces are now abstract classes. • The new API is in the org.apache.hadoop.mapreduce package (and subpackages). The old API can still be found in org.apache.hadoop.mapred. • The new API makes extensive use of context objects that allow the user code to communicate with the MapReduce system. The MapContext, for example, essen- tially unifies the role of the JobConf, the OutputCollector, and the Reporter. • The new API supports both a “push” and a “pull” style of iteration. In both APIs, key-value record pairs are pushed to the mapper, but in addition, the new API allows a mapper to pull records from within the map() method. The same goes for the reducer. An example of how the “pull” style can be useful is processing records in batches, rather than one by one.* The new API is not complete (or stable) in the 0.20 release series (the latest available at the time of writing). This book uses the old API for this reason. However, a copy of all of the examples in this book, rewritten to use the new API (for releases 0.21.0 and later), will be made available on the book’s website. Analyzing the Data with Hadoop | 25

46.
• Configuration has been unified. The old API has a special JobConf object for job configuration, which is an extension of Hadoop’s vanilla Configuration object (used for configuring daemons; see “The Configuration API” on page 130). In the new API, this distinction is dropped, so job configuration is done through a Configuration. • Job control is performed through the Job class, rather than JobClient, which no longer exists in the new API. • Output files are named slightly differently: part-m-nnnnn for map outputs, and part- r-nnnnn for reduce outputs (where nnnnn is an integer designating the part number, starting from zero).Example 2-6 shows the MaxTemperature application rewritten to use the new API. Thedifferences are highlighted in bold. When converting your Mapper and Reducer classes to the new API, don’t forget to change the signature of the map() and reduce() methods to the new form. Just changing your class to extend the new Mapper or Reducer classes will not produce a compilation error or warning, since these classes provide an identity form of the map() or reduce() method (respectively). Your mapper or reducer code, however, will not be in- voked, which can lead to some hard-to-diagnose errors.Example 2-6. Application to find the maximum temperature in the weather dataset using the newcontext objects MapReduce APIpublic class NewMaxTemperature { static class NewMaxTemperatureMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private static final int MISSING = 9999; public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String year = line.substring(15, 19); int airTemperature; if (line.charAt(87) == +) { // parseInt doesnt like leading plus signs airTemperature = Integer.parseInt(line.substring(88, 92)); } else { airTemperature = Integer.parseInt(line.substring(87, 92)); } String quality = line.substring(92, 93); if (airTemperature != MISSING && quality.matches("[01459]")) { context.write(new Text(year), new IntWritable(airTemperature)); } } }26 | Chapter 2: MapReduce

47.
static class NewMaxTemperatureReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int maxValue = Integer.MIN_VALUE; for (IntWritable value : values) { maxValue = Math.max(maxValue, value.get()); } context.write(key, new IntWritable(maxValue)); } } public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: NewMaxTemperature <input path> <output path>"); System.exit(-1); } Job job = new Job(); job.setJarByClass(NewMaxTemperature.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(NewMaxTemperatureMapper.class); job.setReducerClass(NewMaxTemperatureReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); System.exit(job.waitForCompletion(true) ? 0 : 1); }}Scaling OutYou’ve seen how MapReduce works for small inputs; now it’s time to take a bird’s-eyeview of the system and look at the data flow for large inputs. For simplicity, theexamples so far have used files on the local filesystem. However, to scale out, we needto store the data in a distributed filesystem, typically HDFS (which you’ll learn aboutin the next chapter), to allow Hadoop to move the MapReduce computation to eachmachine hosting a part of the data. Let’s see how this works. Scaling Out | 27

48.
Data FlowFirst, some terminology. A MapReduce job is a unit of work that the client wants to beperformed: it consists of the input data, the MapReduce program, and configurationinformation. Hadoop runs the job by dividing it into tasks, of which there are two types:map tasks and reduce tasks.There are two types of nodes that control the job execution process: a jobtracker anda number of tasktrackers. The jobtracker coordinates all the jobs run on the system byscheduling tasks to run on tasktrackers. Tasktrackers run tasks and send progressreports to the jobtracker, which keeps a record of the overall progress of each job. If atask fails, the jobtracker can reschedule it on a different tasktracker.Hadoop divides the input to a MapReduce job into fixed-size pieces called inputsplits, or just splits. Hadoop creates one map task for each split, which runs the user-defined map function for each record in the split.Having many splits means the time taken to process each split is small compared to thetime to process the whole input. So if we are processing the splits in parallel, the pro-cessing is better load-balanced if the splits are small, since a faster machine will be ableto process proportionally more splits over the course of the job than a slower machine.Even if the machines are identical, failed processes or other jobs running concurrentlymake load balancing desirable, and the quality of the load balancing increases as thesplits become more fine-grained.On the other hand, if splits are too small, then the overhead of managing the splits andof map task creation begins to dominate the total job execution time. For most jobs, agood split size tends to be the size of an HDFS block, 64 MB by default, although thiscan be changed for the cluster (for all newly created files), or specified when each fileis created.Hadoop does its best to run the map task on a node where the input data resides inHDFS. This is called the data locality optimization. It should now be clear why theoptimal split size is the same as the block size: it is the largest size of input that canbe guaranteed to be stored on a single node. If the split spanned two blocks, it wouldbe unlikely that any HDFS node stored both blocks, so some of the split would haveto be transferred across the network to the node running the map task, which is clearlyless efficient than running the whole map task using local data.Map tasks write their output to the local disk, not to HDFS. Why is this? Map outputis intermediate output: it’s processed by reduce tasks to produce the final output, andonce the job is complete the map output can be thrown away. So storing it in HDFS,with replication, would be overkill. If the node running the map task fails before themap output has been consumed by the reduce task, then Hadoop will automaticallyrerun the map task on another node to re-create the map output.28 | Chapter 2: MapReduce

49.
Reduce tasks don’t have the advantage of data locality—the input to a single reducetask is normally the output from all mappers. In the present example, we have a singlereduce task that is fed by all of the map tasks. Therefore, the sorted map outputs haveto be transferred across the network to the node where the reduce task is running, wherethey are merged and then passed to the user-defined reduce function. The output ofthe reduce is normally stored in HDFS for reliability. As explained in Chapter 3, foreach HDFS block of the reduce output, the first replica is stored on the local node, withother replicas being stored on off-rack nodes. Thus, writing the reduce output doesconsume network bandwidth, but only as much as a normal HDFS write pipelineconsumes.The whole data flow with a single reduce task is illustrated in Figure 2-2. The dottedboxes indicate nodes, the light arrows show data transfers on a node, and the heavyarrows show data transfers between nodes.Figure 2-2. MapReduce data flow with a single reduce taskThe number of reduce tasks is not governed by the size of the input, but is specifiedindependently. In “The Default MapReduce Job” on page 191, you will see how tochoose the number of reduce tasks for a given job.When there are multiple reducers, the map tasks partition their output, each creatingone partition for each reduce task. There can be many keys (and their associated values)in each partition, but the records for any given key are all in a single partition. Thepartitioning can be controlled by a user-defined partitioning function, but normally thedefault partitioner—which buckets keys using a hash function—works very well. Scaling Out | 29

50.
The data flow for the general case of multiple reduce tasks is illustrated in Figure 2-3.This diagram makes it clear why the data flow between map and reduce tasks is collo-quially known as “the shuffle,” as each reduce task is fed by many map tasks. Theshuffle is more complicated than this diagram suggests, and tuning it can have a bigimpact on job execution time, as you will see in “Shuffle and Sort” on page 177.Figure 2-3. MapReduce data flow with multiple reduce tasksFinally, it’s also possible to have zero reduce tasks. This can be appropriate when youdon’t need the shuffle since the processing can be carried out entirely in parallel (a fewexamples are discussed in “NLineInputFormat” on page 211). In this case, the onlyoff-node data transfer is when the map tasks write to HDFS (see Figure 2-4).Combiner FunctionsMany MapReduce jobs are limited by the bandwidth available on the cluster, so it paysto minimize the data transferred between map and reduce tasks. Hadoop allows theuser to specify a combiner function to be run on the map output—the combiner func-tion’s output forms the input to the reduce function. Since the combiner function is anoptimization, Hadoop does not provide a guarantee of how many times it will call itfor a particular map output record, if at all. In other words, calling the combiner func-tion zero, one, or many times should produce the same output from the reducer.30 | Chapter 2: MapReduce

51.
Figure 2-4. MapReduce data flow with no reduce tasksThe contract for the combiner function constrains the type of function that may beused. This is best illustrated with an example. Suppose that for the maximum temper-ature example, readings for the year 1950 were processed by two maps (because theywere in different splits). Imagine the first map produced the output: (1950, 0) (1950, 20) (1950, 10)And the second produced: (1950, 25) (1950, 15)The reduce function would be called with a list of all the values: (1950, [0, 20, 10, 25, 15])with output: (1950, 25)since 25 is the maximum value in the list. We could use a combiner function that, justlike the reduce function, finds the maximum temperature for each map output. Thereduce would then be called with: (1950, [20, 25])and the reduce would produce the same output as before. More succinctly, we mayexpress the function calls on the temperature values in this case as follows: max(0, 20, 10, 25, 15) = max(max(0, 20, 10), max(25, 15)) = max(20, 25) = 25 Scaling Out | 31

52.
Not all functions possess this property.† For example, if we were calculating meantemperatures, then we couldn’t use the mean as our combiner function, since: mean(0, 20, 10, 25, 15) = 14but: mean(mean(0, 20, 10), mean(25, 15)) = mean(10, 20) = 15The combiner function doesn’t replace the reduce function. (How could it? The reducefunction is still needed to process records with the same key from different maps.) Butit can help cut down the amount of data shuffled between the maps and the reduces,and for this reason alone it is always worth considering whether you can use a combinerfunction in your MapReduce job.Specifying a combiner functionGoing back to the Java MapReduce program, the combiner function is defined usingthe Reducer interface, and for this application, it is the same implementation as thereducer function in MaxTemperatureReducer. The only change we need to make is to setthe combiner class on the JobConf (see Example 2-7).Example 2-7. Application to find the maximum temperature, using a combiner function for efficiencypublic class MaxTemperatureWithCombiner { public static void main(String[] args) throws IOException { if (args.length != 2) { System.err.println("Usage: MaxTemperatureWithCombiner <input path> " + "<output path>"); System.exit(-1); } JobConf conf = new JobConf(MaxTemperatureWithCombiner.class); conf.setJobName("Max temperature"); FileInputFormat.addInputPath(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); conf.setMapperClass(MaxTemperatureMapper.class); conf.setCombinerClass(MaxTemperatureReducer.class); conf.setReducerClass(MaxTemperatureReducer.class); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); JobClient.runJob(conf); }}† Functions with this property are called distributive in the paper “Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals,” Gray et al. (1995).32 | Chapter 2: MapReduce

53.
Running a Distributed MapReduce JobThe same program will run, without alteration, on a full dataset. This is the point ofMapReduce: it scales to the size of your data and the size of your hardware. Here’s onedata point: on a 10-node EC2 cluster running High-CPU Extra Large Instances, theprogram took six minutes to run.‡We’ll go through the mechanics of running programs on a cluster in Chapter 5.Hadoop StreamingHadoop provides an API to MapReduce that allows you to write your map and reducefunctions in languages other than Java. Hadoop Streaming uses Unix standard streamsas the interface between Hadoop and your program, so you can use any language thatcan read standard input and write to standard output to write your MapReduceprogram.Streaming is naturally suited for text processing (although, as of version 0.21.0, it canhandle binary streams, too), and when used in text mode, it has a line-oriented view ofdata. Map input data is passed over standard input to your map function, which pro-cesses it line by line and writes lines to standard output. A map output key-value pairis written as a single tab-delimited line. Input to the reduce function is in the sameformat—a tab-separated key-value pair—passed over standard input. The reduce func-tion reads lines from standard input, which the framework guarantees are sorted bykey, and writes its results to standard output.Let’s illustrate this by rewriting our MapReduce program for finding maximum tem-peratures by year in Streaming.RubyThe map function can be expressed in Ruby as shown in Example 2-8.Example 2-8. Map function for maximum temperature in Ruby#!/usr/bin/env rubySTDIN.each_line do |line| val = line year, temp, q = val[15,4], val[87,5], val[92,1] puts "#{year}t#{temp}" if (temp != "+9999" && q =~ /[01459]/)end‡ This is a factor of seven faster than the serial run on one machine using awk. The main reason it wasn’t proportionately faster is because the input data wasn’t evenly partitioned. For convenience, the input files were gzipped by year, resulting in large files for later years in the dataset, when the number of weather records was much higher. Hadoop Streaming | 33

54.
The program iterates over lines from standard input by executing a block for each linefrom STDIN (a global constant of type IO). The block pulls out the relevant fields fromeach input line, and, if the temperature is valid, writes the year and the temperatureseparated by a tab character t to standard output (using puts). It’s worth drawing out a design difference between Streaming and the Java MapReduce API. The Java API is geared toward processing your map function one record at a time. The framework calls the map() method on your Mapper for each record in the input, whereas with Streaming the map program can decide how to process the input—for example, it could easily read and process multiple lines at a time since it’s in control of the reading. The user’s Java map implementation is “pushed” records, but it’s still possible to consider multiple lines at a time by accumulating previous lines in an instance variable in the Mapper.§ In this case, you need to implement the close() method so that you know when the last record has been read, so you can finish pro- cessing the last group of lines.Since the script just operates on standard input and output, it’s trivial to test the scriptwithout using Hadoop, simply using Unix pipes: % cat input/ncdc/sample.txt | ch02/src/main/ruby/max_temperature_map.rb 1950 +0000 1950 +0022 1950 -0011 1949 +0111 1949 +0078The reduce function shown in Example 2-9 is a little more complex.Example 2-9. Reduce function for maximum temperature in Ruby#!/usr/bin/env rubylast_key, max_val = nil, 0STDIN.each_line do |line| key, val = line.split("t") if last_key && last_key != key puts "#{last_key}t#{max_val}" last_key, max_val = key, val.to_i else last_key, max_val = key, [max_val, val.to_i].max endendputs "#{last_key}t#{max_val}" if last_key§ Alternatively, you could use “pull” style processing in the new MapReduce API—see “The new Java MapReduce API” on page 25.34 | Chapter 2: MapReduce

55.
Again, the program iterates over lines from standard input, but this time we have tostore some state as we process each key group. In this case, the keys are weather stationidentifiers, and we store the last key seen and the maximum temperature seen so farfor that key. The MapReduce framework ensures that the keys are ordered, so we knowthat if a key is different from the previous one, we have moved into a new key group.In contrast to the Java API, where you are provided an iterator over each key group, inStreaming you have to find key group boundaries in your program.For each line, we pull out the key and value, then if we’ve just finished a group (last_key&& last_key != key), we write the key and the maximum temperature for that group,separated by a tab character, before resetting the maximum temperature for the newkey. If we haven’t just finished a group, we just update the maximum temperature forthe current key.The last line of the program ensures that a line is written for the last key group in theinput.We can now simulate the whole MapReduce pipeline with a Unix pipeline (which isequivalent to the Unix pipeline shown in Figure 2-1): % cat input/ncdc/sample.txt | ch02/src/main/ruby/max_temperature_map.rb | sort | ch02/src/main/ruby/max_temperature_reduce.rb 1949 111 1950 22The output is the same as the Java program, so the next step is to run it using Hadoopitself.The hadoop command doesn’t support a Streaming option; instead, you specify theStreaming JAR file along with the jar option. Options to the Streaming program specifythe input and output paths, and the map and reduce scripts. This is what it looks like: % hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar -input input/ncdc/sample.txt -output output -mapper ch02/src/main/ruby/max_temperature_map.rb -reducer ch02/src/main/ruby/max_temperature_reduce.rbWhen running on a large dataset on a cluster, we should set the combiner, using the-combiner option.From release 0.21.0, the combiner can be any Streaming command. For earlier releases,the combiner had to be written in Java, so as a workaround it was common to do manualcombining in the mapper, without having to resort to Java. In this case, we could changethe mapper to be a pipeline: % hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar -input input/ncdc/all -output output -mapper "ch02/src/main/ruby/max_temperature_map.rb | sort | ch02/src/main/ruby/max_temperature_reduce.rb" -reducer ch02/src/main/ruby/max_temperature_reduce.rb Hadoop Streaming | 35

58.
maxValue = std::max(maxValue, HadoopUtils::toInt(context.getInputValue())); } context.emit(context.getInputKey(), HadoopUtils::toString(maxValue)); }};int main(int argc, char *argv[]) { return HadoopPipes::runTask(HadoopPipes::TemplateFactory<MaxTemperatureMapper, MapTemperatureReducer>());}The application links against the Hadoop C++ library, which is a thin wrapper forcommunicating with the tasktracker child process. The map and reduce functions aredefined by extending the Mapper and Reducer classes defined in the HadoopPipes name-space and providing implementations of the map() and reduce() methods in each case.These methods take a context object (of type MapContext or ReduceContext), whichprovides the means for reading input and writing output, as well as accessing job con-figuration information via the JobConf class. The processing in this example is verysimilar to the Java equivalent.Unlike the Java interface, keys and values in the C++ interface are byte buffers, repre-sented as Standard Template Library (STL) strings. This makes the interface simpler,although it does put a slightly greater burden on the application developer, who has toconvert to and from richer domain-level types. This is evident in MapTemperatureReducer where we have to convert the input value into an integer (using a conven-ience method in HadoopUtils) and then the maximum value back into a string beforeit’s written out. In some cases, we can save on doing the conversion, such as in MaxTemperatureMapper where the airTemperature value is never converted to an integer sinceit is never processed as a number in the map() method.The main() method is the application entry point. It calls HadoopPipes::runTask, whichconnects to the Java parent process and marshals data to and from the Mapper orReducer. The runTask() method is passed a Factory so that it can create instances of theMapper or Reducer. Which one it creates is controlled by the Java parent over the socketconnection. There are overloaded template factory methods for setting a combiner,partitioner, record reader, or record writer.Compiling and RunningNow we can compile and link our program using the Makefile in Example 2-13.Example 2-13. Makefile for C++ MapReduce programCC = g++CPPFLAGS = -m32 -I$(HADOOP_INSTALL)/c++/$(PLATFORM)/includemax_temperature: max_temperature.cpp $(CC) $(CPPFLAGS) $< -Wall -L$(HADOOP_INSTALL)/c++/$(PLATFORM)/lib -lhadooppipes -lhadooputils -lpthread -g -O2 -o $@38 | Chapter 2: MapReduce

59.
The Makefile expects a couple of environment variables to be set. Apart fromHADOOP_INSTALL (which you should already have set if you followed the installationinstructions in Appendix A), you need to define PLATFORM, which specifies the operatingsystem, architecture, and data model (e.g., 32- or 64-bit). I ran it on a 32-bit Linuxsystem with the following: % export PLATFORM=Linux-i386-32 % makeOn successful completion, you’ll find the max_temperature executable in the currentdirectory.To run a Pipes job, we need to run Hadoop in pseudo-distributed mode (where all thedaemons run on the local machine), for which there are setup instructions in Appen-dix A. Pipes doesn’t run in standalone (local) mode, since it relies on Hadoop’sdistributed cache mechanism, which works only when HDFS is running.With the Hadoop daemons now running, the first step is to copy the executable toHDFS so that it can be picked up by tasktrackers when they launch map and reducetasks: % hadoop fs -put max_temperature bin/max_temperatureThe sample data also needs to be copied from the local filesystem into HDFS: % hadoop fs -put input/ncdc/sample.txt sample.txtNow we can run the job. For this, we use the Hadoop pipes command, passing the URIof the executable in HDFS using the -program argument: % hadoop pipes -D hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true -input sample.txt -output output -program bin/max_temperatureWe specify two properties using the -D option: hadoop.pipes.java.recordreader andhadoop.pipes.java.recordwriter, setting both to true to say that we have not specifieda C++ record reader or writer, but that we want to use the default Java ones (which arefor text input and output). Pipes also allows you to set a Java mapper, reducer,combiner, or partitioner. In fact, you can have a mixture of Java or C++ classes withinany one job.The result is the same as the other versions of the same program that we ran. Hadoop Pipes | 39

60.
CHAPTER 3 The Hadoop Distributed FilesystemWhen a dataset outgrows the storage capacity of a single physical machine, it becomesnecessary to partition it across a number of separate machines. Filesystems that managethe storage across a network of machines are called distributed filesystems. Since theyare network-based, all the complications of network programming kick in, thus makingdistributed filesystems more complex than regular disk filesystems. For example, oneof the biggest challenges is making the filesystem tolerate node failure without sufferingdata loss.Hadoop comes with a distributed filesystem called HDFS, which stands for HadoopDistributed Filesystem. (You may sometimes see references to “DFS”—informally or inolder documentation or configurations—which is the same thing.) HDFS is Hadoop’sflagship filesystem and is the focus of this chapter, but Hadoop actually has a general-purpose filesystem abstraction, so we’ll see along the way how Hadoop integrates withother storage systems (such as the local filesystem and Amazon S3).The Design of HDFSHDFS is a filesystem designed for storing very large files with streaming data accesspatterns, running on clusters of commodity hardware.* Let’s examine this statement inmore detail:Very large files “Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of data.†* The architecture of HDFS is described in “The Hadoop Distributed File System” by Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler (Proceedings of MSST2010, May 2010, http:// storageconference.org/2010/Papers/MSST/Shvachko.pdf).† “Scaling Hadoop to 4000 nodes at Yahoo!,” http://developer.yahoo.net/blogs/hadoop/2008/09/scaling_hadoop _to_4000_nodes_a.html. 41

61.
Streaming data access HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. A dataset is typically generated or copied from source, then various analyses are performed on that dataset over time. Each analysis will involve a large proportion, if not all, of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record.Commodity hardware Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s designed to run on clusters of commodity hardware (commonly available hardware available from multiple vendors‡) for which the chance of node failure across the cluster is high, at least for large clusters. HDFS is designed to carry on working without a noticeable interruption to the user in the face of such failure.It is also worth examining the applications for which using HDFS does not work sowell. While this may change in the future, these are areas where HDFS is not a good fittoday:Low-latency data access Applications that require low-latency access to data, in the tens of milliseconds range, will not work well with HDFS. Remember, HDFS is optimized for delivering a high throughput of data, and this may be at the expense of latency. HBase (Chapter 13) is currently a better choice for low-latency access.Lots of small files Since the namenode holds filesystem metadata in memory, the limit to the number of files in a filesystem is governed by the amount of memory on the namenode. As a rule of thumb, each file, directory, and block takes about 150 bytes. So, for example, if you had one million files, each taking one block, you would need at least 300 MB of memory. While storing millions of files is feasible, billions is be- yond the capability of current hardware.§Multiple writers, arbitrary file modifications Files in HDFS may be written to by a single writer. Writes are always made at the end of the file. There is no support for multiple writers, or for modifications at arbitrary offsets in the file. (These might be supported in the future, but they are likely to be relatively inefficient.)‡ See Chapter 9 for a typical machine specification.§ For an in-depth exposition of the scalability limits of HDFS, see Konstantin V. Shvachko’s “Scalability of the Hadoop Distributed File System,” (http://developer.yahoo.net/blogs/hadoop/2010/05/scalability_of_the _hadoop_dist.html) and the companion paper “HDFS Scalability: The limits to growth,” (April 2010, pp. 6– 16. http://www.usenix.org/publications/login/2010-04/openpdfs/shvachko.pdf) by the same author.42 | Chapter 3: The Hadoop Distributed Filesystem

62.
HDFS ConceptsBlocksA disk has a block size, which is the minimum amount of data that it can read or write.Filesystems for a single disk build on this by dealing with data in blocks, which are anintegral multiple of the disk block size. Filesystem blocks are typically a few kilobytesin size, while disk blocks are normally 512 bytes. This is generally transparent to thefilesystem user who is simply reading or writing a file—of whatever length. However,there are tools to perform filesystem maintenance, such as df and fsck, that operate onthe filesystem block level.HDFS, too, has the concept of a block, but it is a much larger unit—64 MB by default.Like in a filesystem for a single disk, files in HDFS are broken into block-sized chunks,which are stored as independent units. Unlike a filesystem for a single disk, a file inHDFS that is smaller than a single block does not occupy a full block’s worth of un-derlying storage. When unqualified, the term “block” in this book refers to a block inHDFS. Why Is a Block in HDFS So Large? HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks. By making a block large enough, the time to transfer the data from the disk can be made to be significantly larger than the time to seek to the start of the block. Thus the time to transfer a large file made of multiple blocks operates at the disk transfer rate. A quick calculation shows that if the seek time is around 10 ms, and the transfer rate is 100 MB/s, then to make the seek time 1% of the transfer time, we need to make the block size around 100 MB. The default is actually 64 MB, although many HDFS in- stallations use 128 MB blocks. This figure will continue to be revised upward as transfer speeds grow with new generations of disk drives. This argument shouldn’t be taken too far, however. Map tasks in MapReduce normally operate on one block at a time, so if you have too few tasks (fewer than nodes in the cluster), your jobs will run slower than they could otherwise.Having a block abstraction for a distributed filesystem brings several benefits. The firstbenefit is the most obvious: a file can be larger than any single disk in the network.There’s nothing that requires the blocks from a file to be stored on the same disk, sothey can take advantage of any of the disks in the cluster. In fact, it would be possible,if unusual, to store a single file on an HDFS cluster whose blocks filled all the disks inthe cluster. HDFS Concepts | 43

63.
Second, making the unit of abstraction a block rather than a file simplifies the storagesubsystem. Simplicity is something to strive for all in all systems, but is especiallyimportant for a distributed system in which the failure modes are so varied. The storagesubsystem deals with blocks, simplifying storage management (since blocks are a fixedsize, it is easy to calculate how many can be stored on a given disk) and eliminatingmetadata concerns (blocks are just a chunk of data to be stored—file metadata such aspermissions information does not need to be stored with the blocks, so another systemcan handle metadata separately).Furthermore, blocks fit well with replication for providing fault tolerance and availa-bility. To insure against corrupted blocks and disk and machine failure, each block isreplicated to a small number of physically separate machines (typically three). If a blockbecomes unavailable, a copy can be read from another location in a way that is trans-parent to the client. A block that is no longer available due to corruption or machinefailure can be replicated from its alternative locations to other live machines to bringthe replication factor back to the normal level. (See “Data Integrity” on page 75 formore on guarding against corrupt data.) Similarly, some applications may choose toset a high replication factor for the blocks in a popular file to spread the read load onthe cluster.Like its disk filesystem cousin, HDFS’s fsck command understands blocks. For exam-ple, running: % hadoop fsck / -files -blockswill list the blocks that make up each file in the filesystem. (See also “Filesystem check(fsck)” on page 301.)Namenodes and DatanodesAn HDFS cluster has two types of node operating in a master-worker pattern: a name-node (the master) and a number of datanodes (workers). The namenode manages thefilesystem namespace. It maintains the filesystem tree and the metadata for all the filesand directories in the tree. This information is stored persistently on the local disk inthe form of two files: the namespace image and the edit log. The namenode also knowsthe datanodes on which all the blocks for a given file are located, however, it doesnot store block locations persistently, since this information is reconstructed fromdatanodes when the system starts.A client accesses the filesystem on behalf of the user by communicating with the name-node and datanodes. The client presents a POSIX-like filesystem interface, so the usercode does not need to know about the namenode and datanode to function.Datanodes are the workhorses of the filesystem. They store and retrieve blocks whenthey are told to (by clients or the namenode), and they report back to the namenodeperiodically with lists of blocks that they are storing.44 | Chapter 3: The Hadoop Distributed Filesystem

64.
Without the namenode, the filesystem cannot be used. In fact, if the machine runningthe namenode were obliterated, all the files on the filesystem would be lost since therewould be no way of knowing how to reconstruct the files from the blocks on thedatanodes. For this reason, it is important to make the namenode resilient to failure,and Hadoop provides two mechanisms for this.The first way is to back up the files that make up the persistent state of the filesystemmetadata. Hadoop can be configured so that the namenode writes its persistent stateto multiple filesystems. These writes are synchronous and atomic. The usual configu-ration choice is to write to local disk as well as a remote NFS mount.It is also possible to run a secondary namenode, which despite its name does not act asa namenode. Its main role is to periodically merge the namespace image with the editlog to prevent the edit log from becoming too large. The secondary namenode usuallyruns on a separate physical machine, since it requires plenty of CPU and as muchmemory as the namenode to perform the merge. It keeps a copy of the merged name-space image, which can be used in the event of the namenode failing. However, thestate of the secondary namenode lags that of the primary, so in the event of total failureof the primary, data loss is almost certain. The usual course of action in this case is tocopy the namenode’s metadata files that are on NFS to the secondary and run it as thenew primary.See “The filesystem image and edit log” on page 294 for more details.The Command-Line InterfaceWe’re going to have a look at HDFS by interacting with it from the command line.There are many other interfaces to HDFS, but the command line is one of the simplestand, to many developers, the most familiar.We are going to run HDFS on one machine, so first follow the instructions for settingup Hadoop in pseudo-distributed mode in Appendix A. Later you’ll see how to run ona cluster of machines to give us scalability and fault tolerance.There are two properties that we set in the pseudo-distributed configuration that de-serve further explanation. The first is fs.default.name, set to hdfs://localhost/, which isused to set a default filesystem for Hadoop. Filesystems are specified by a URI, andhere we have used an hdfs URI to configure Hadoop to use HDFS by default. The HDFSdaemons will use this property to determine the host and port for the HDFS namenode.We’ll be running it on localhost, on the default HDFS port, 8020. And HDFS clientswill use this property to work out where the namenode is running so they can connectto it. The Command-Line Interface | 45

65.
We set the second property, dfs.replication, to 1 so that HDFS doesn’t replicatefilesystem blocks by the default factor of three. When running with a single datanode,HDFS can’t replicate blocks to three datanodes, so it would perpetually warn aboutblocks being under-replicated. This setting solves that problem.Basic Filesystem OperationsThe filesystem is ready to be used, and we can do all of the usual filesystem operationssuch as reading files, creating directories, moving files, deleting data, and listing direc-tories. You can type hadoop fs -help to get detailed help on every command.Start by copying a file from the local filesystem to HDFS: % hadoop fs -copyFromLocal input/docs/quangle.txt hdfs://localhost/user/tom/ quangle.txtThis command invokes Hadoop’s filesystem shell command fs, which supports anumber of subcommands—in this case, we are running -copyFromLocal. The local filequangle.txt is copied to the file /user/tom/quangle.txt on the HDFS instance running onlocalhost. In fact, we could have omitted the scheme and host of the URI and pickedup the default, hdfs://localhost, as specified in core-site.xml: % hadoop fs -copyFromLocal input/docs/quangle.txt /user/tom/quangle.txtWe could also have used a relative path and copied the file to our home directory inHDFS, which in this case is /user/tom: % hadoop fs -copyFromLocal input/docs/quangle.txt quangle.txtLet’s copy the file back to the local filesystem and check whether it’s the same: % hadoop fs -copyToLocal quangle.txt quangle.copy.txt % md5 input/docs/quangle.txt quangle.copy.txt MD5 (input/docs/quangle.txt) = a16f231da6b05e2ba7a339320e7dacd9 MD5 (quangle.copy.txt) = a16f231da6b05e2ba7a339320e7dacd9The MD5 digests are the same, showing that the file survived its trip to HDFS and isback intact.Finally, let’s look at an HDFS file listing. We create a directory first just to see how itis displayed in the listing: % hadoop fs -mkdir books % hadoop fs -ls . Found 2 items drwxr-xr-x - tom supergroup 0 2009-04-02 22:41 /user/tom/books -rw-r--r-- 1 tom supergroup 118 2009-04-02 22:29 /user/tom/quangle.txtThe information returned is very similar to the Unix command ls -l, with a few minordifferences. The first column shows the file mode. The second column is the replicationfactor of the file (something a traditional Unix filesystem does not have). Rememberwe set the default replication factor in the site-wide configuration to be 1, which is whywe see the same value here. The entry in this column is empty for directories since the46 | Chapter 3: The Hadoop Distributed Filesystem

66.
concept of replication does not apply to them—directories are treated as metadata andstored by the namenode, not the datanodes. The third and fourth columns show thefile owner and group. The fifth column is the size of the file in bytes, or zero for direc-tories. The sixth and seventh columns are the last modified date and time. Finally, theeighth column is the absolute name of the file or directory. File Permissions in HDFS HDFS has a permissions model for files and directories that is much like POSIX. There are three types of permission: the read permission (r), the write permission (w), and the execute permission (x). The read permission is required to read files or list the contents of a directory. The write permission is required to write a file, or for a directory, to create or delete files or directories in it. The execute permission is ignored for a file since you can’t execute a file on HDFS (unlike POSIX), and for a directory it is required to access its children. Each file and directory has an owner, a group, and a mode. The mode is made up of the permissions for the user who is the owner, the permissions for the users who are members of the group, and the permissions for users who are neither the owners nor members of the group. By default, a client’s identity is determined by the username and groups of the process it is running in. Because clients are remote, this makes it possible to become an arbitrary user, simply by creating an account of that name on the remote system. Thus, permis- sions should be used only in a cooperative community of users, as a mechanism for sharing filesystem resources and for avoiding accidental data loss, and not for securing resources in a hostile environment. (Note, however, that the latest versions of Hadoop support Kerberos authentication, which removes these restrictions, see “Secur- ity” on page 281.) Despite these limitations, it is worthwhile having permissions ena- bled (as it is by default; see the dfs.permissions property), to avoid accidental modifi- cation or deletion of substantial parts of the filesystem, either by users or by automated tools or programs. When permissions checking is enabled, the owner permissions are checked if the cli- ent’s username matches the owner, and the group permissions are checked if the client is a member of the group; otherwise, the other permissions are checked. There is a concept of a super-user, which is the identity of the namenode process. Permissions checks are not performed for the super-user.Hadoop FilesystemsHadoop has an abstract notion of filesystem, of which HDFS is just one implementa-tion. The Java abstract class org.apache.hadoop.fs.FileSystem represents a filesystemin Hadoop, and there are several concrete implementations, which are described inTable 3-1. Hadoop Filesystems | 47

67.
Table 3-1. Hadoop filesystems Filesystem URI scheme Java implementation Description (all under org.apache.hadoop) Local file fs.LocalFileSystem A filesystem for a locally connected disk with client- side checksums. Use RawLocalFileSystem for a local filesystem with no checksums. See “LocalFileSys- tem” on page 76. HDFS hdfs hdfs. Hadoop’s distributed filesystem. HDFS is designed to DistributedFileSystem work efficiently in conjunction with MapReduce. HFTP hftp hdfs.HftpFileSystem A filesystem providing read-only access to HDFS over HTTP. (Despite its name, HFTP has no connection with FTP.) Often used with distcp (see “Parallel Copying with distcp” on page 70) to copy data between HDFS clusters running different versions. HSFTP hsftp hdfs.HsftpFileSystem A filesystem providing read-only access to HDFS over HTTPS. (Again, this has no connection with FTP.) HAR har fs.HarFileSystem A filesystem layered on another filesystem for archiving files. Hadoop Archives are typically used for archiving files in HDFS to reduce the namenode’s memory usage. See “Hadoop Archives” on page 71. KFS (Cloud- kfs fs.kfs. CloudStore (formerly Kosmos filesystem) is a dis- Store) KosmosFileSystem tributed filesystem like HDFS or Google’s GFS, written in C++. Find more information about it at http://kosmosfs.sourceforge.net/. FTP ftp fs.ftp.FTPFileSystem A filesystem backed by an FTP server. S3 (native) s3n fs.s3native. A filesystem backed by Amazon S3. See http://wiki NativeS3FileSystem .apache.org/hadoop/AmazonS3. S3 (block- s3 fs.s3.S3FileSystem A filesystem backed by Amazon S3, which stores files based) in blocks (much like HDFS) to overcome S3’s 5 GB file size limit.Hadoop provides many interfaces to its filesystems, and it generally uses the URIscheme to pick the correct filesystem instance to communicate with. For example, thefilesystem shell that we met in the previous section operates with all Hadoop filesys-tems. To list the files in the root directory of the local filesystem, type: % hadoop fs -ls file:///Although it is possible (and sometimes very convenient) to run MapReduce programsthat access any of these filesystems, when you are processing large volumes of data,you should choose a distributed filesystem that has the data locality optimization, suchas HDFS or KFS (see “Scaling Out” on page 27).48 | Chapter 3: The Hadoop Distributed Filesystem

68.
InterfacesHadoop is written in Java, and all Hadoop filesystem interactions are mediated throughthe Java API.‖ The filesystem shell, for example, is a Java application that uses the JavaFileSystem class to provide filesystem operations. The other filesystem interfaces arediscussed briefly in this section. These interfaces are most commonly used with HDFS,since the other filesystems in Hadoop typically have existing tools to access the under-lying filesystem (FTP clients for FTP, S3 tools for S3, etc.), but many of them will workwith any Hadoop filesystem.ThriftBy exposing its filesystem interface as a Java API, Hadoop makes it awkward for non-Java applications to access Hadoop filesystems. The Thrift API in the “thriftfs” contribmodule remedies this deficiency by exposing Hadoop filesystems as an Apache Thriftservice, making it easy for any language that has Thrift bindings to interact with aHadoop filesystem, such as HDFS.To use the Thrift API, run a Java server that exposes the Thrift service and acts as aproxy to the Hadoop filesystem. Your application accesses the Thrift service, which istypically running on the same machine as your application.The Thrift API comes with a number of pregenerated stubs for a variety of languages,including C++, Perl, PHP, Python, and Ruby. Thrift has support for versioning, so it’sa good choice if you want to access different versions of a Hadoop filesystem from thesame client code (you will need to run a proxy for each version of Hadoop to achievethis, however).For installation and usage instructions, please refer to the documentation in thesrc/contrib/thriftfs directory of the Hadoop distribution.CHadoop provides a C library called libhdfs that mirrors the Java FileSystem interface(it was written as a C library for accessing HDFS, but despite its name it can be usedto access any Hadoop filesystem). It works using the Java Native Interface (JNI) to calla Java filesystem client.The C API is very similar to the Java one, but it typically lags the Java one, so newerfeatures may not be supported. You can find the generated documentation for the CAPI in the libhdfs/docs/api directory of the Hadoop distribution.‖ The RPC interfaces in Hadoop are based on Hadoop’s Writable interface, which is Java-centric. In the future, Hadoop will adopt Avro, a cross-language, RPC framework, which will allow native HDFS clients to be written in languages other than Java. Hadoop Filesystems | 49

69.
Hadoop comes with prebuilt libhdfs binaries for 32-bit Linux, but for other platforms,you will need to build them yourself using the instructions at http://wiki.apache.org/hadoop/LibHDFS.FUSEFilesystem in Userspace (FUSE) allows filesystems that are implemented in user spaceto be integrated as a Unix filesystem. Hadoop’s Fuse-DFS contrib module allows anyHadoop filesystem (but typically HDFS) to be mounted as a standard filesystem. Youcan then use Unix utilities (such as ls and cat) to interact with the filesystem, as wellas POSIX libraries to access the filesystem from any programming language.Fuse-DFS is implemented in C using libhdfs as the interface to HDFS. Documentationfor compiling and running Fuse-DFS is located in the src/contrib/fuse-dfs directory ofthe Hadoop distribution.WebDAVWebDAV is a set of extensions to HTTP to support editing and updating files. WebDAVshares can be mounted as filesystems on most operating systems, so by exposing HDFS(or other Hadoop filesystems) over WebDAV, it’s possible to access HDFS as a standardfilesystem.At the time of this writing, WebDAV support in Hadoop (which is implemented bycalling the Java API to Hadoop) is still under development, and can be tracked at https://issues.apache.org/jira/browse/HADOOP-496.Other HDFS InterfacesThere are two interfaces that are specific to HDFS:HTTP HDFS defines a read-only interface for retrieving directory listings and data over HTTP. Directory listings are served by the namenode’s embedded web server (which runs on port 50070) in XML format, while file data is streamed from datanodes by their web servers (running on port 50075). This protocol is not tied to a specific HDFS version, making it possible to write clients that can use HTTP to read data from HDFS clusters that run different versions of Hadoop. HftpFile System is a such a client: it is a Hadoop filesystem that talks to HDFS over HTTP (HsftpFileSystem is the HTTPS variant).FTP Although not complete at the time of this writing (https://issues.apache.org/jira/ browse/HADOOP-3199), there is an FTP interface to HDFS, which permits the use of the FTP protocol to interact with HDFS. This interface is a convenient way to transfer data into and out of HDFS using existing FTP clients.50 | Chapter 3: The Hadoop Distributed Filesystem

70.
The FTP interface to HDFS is not to be confused with FTPFileSystem, which ex- poses any FTP server as a Hadoop filesystem.The Java InterfaceIn this section, we dig into the Hadoop’s FileSystem class: the API for interacting withone of Hadoop’s filesystems.# While we focus mainly on the HDFS implementation,DistributedFileSystem, in general you should strive to write your code against theFileSystem abstract class, to retain portability across filesystems. This is very usefulwhen testing your program, for example, since you can rapidly run tests using datastored on the local filesystem.Reading Data from a Hadoop URLOne of the simplest ways to read a file from a Hadoop filesystem is by using ajava.net.URL object to open a stream to read the data from. The general idiom is: InputStream in = null; try { in = new URL("hdfs://host/path").openStream(); // process in } finally { IOUtils.closeStream(in); }There’s a little bit more work required to make Java recognize Hadoop’s hdfs URLscheme. This is achieved by calling the setURLStreamHandlerFactory method on URLwith an instance of FsUrlStreamHandlerFactory. This method can only be called onceper JVM, so it is typically executed in a static block. This limitation means that if someother part of your program—perhaps a third-party component outside your control—sets a URLStreamHandlerFactory, you won’t be able to use this approach for reading datafrom Hadoop. The next section discusses an alternative.Example 3-1 shows a program for displaying files from Hadoop filesystems on standardoutput, like the Unix cat command.Example 3-1. Displaying files from a Hadoop filesystem on standard output using aURLStreamHandlerpublic class URLCat { static { URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()); }#From release 0.21.0, there is a new filesystem interface called FileContext with better handling of multiple filesystems (so a single FileContext can resolve multiple filesystem schemes, for example) and a cleaner, more consistent interface. The Java Interface | 51

71.
public static void main(String[] args) throws Exception { InputStream in = null; try { in = new URL(args[0]).openStream(); IOUtils.copyBytes(in, System.out, 4096, false); } finally { IOUtils.closeStream(in); } }}We make use of the handy IOUtils class that comes with Hadoop for closing the streamin the finally clause, and also for copying bytes between the input stream and theoutput stream (System.out in this case). The last two arguments to the copyBytesmethod are the buffer size used for copying and whether to close the streams when thecopy is complete. We close the input stream ourselves, and System.out doesn’t need tobe closed.Here’s a sample run:* % hadoop URLCat hdfs://localhost/user/tom/quangle.txt On the top of the Crumpetty Tree The Quangle Wangle sat, But his face you could not see, On account of his Beaver Hat.Reading Data Using the FileSystem APIAs the previous section explained, sometimes it is impossible to set a URLStreamHandlerFactory for your application. In this case, you will need to use the FileSystem APIto open an input stream for a file.A file in a Hadoop filesystem is represented by a Hadoop Path object (and nota java.io.File object, since its semantics are too closely tied to the local filesystem).You can think of a Path as a Hadoop filesystem URI, such as hdfs://localhost/user/tom/quangle.txt.FileSystem is a general filesystem API, so the first step is to retrieve an instance for thefilesystem we want to use—HDFS in this case. There are two static factory methodsfor getting a FileSystem instance: public static FileSystem get(Configuration conf) throws IOException public static FileSystem get(URI uri, Configuration conf) throws IOExceptionA Configuration object encapsulates a client or server’s configuration, which is set usingconfiguration files read from the classpath, such as conf/core-site.xml. The first methodreturns the default filesystem (as specified in the file conf/core-site.xml, or the defaultlocal filesystem if not specified there). The second uses the given URI’s scheme and* The text is from The Quangle Wangle’s Hat by Edward Lear.52 | Chapter 3: The Hadoop Distributed Filesystem

72.
authority to determine the filesystem to use, falling back to the default filesystem if noscheme is specified in the given URI.With a FileSystem instance in hand, we invoke an open() method to get the input streamfor a file: public FSDataInputStream open(Path f) throws IOException public abstract FSDataInputStream open(Path f, int bufferSize) throws IOExceptionThe first method uses a default buffer size of 4 K.Putting this together, we can rewrite Example 3-1 as shown in Example 3-2.Example 3-2. Displaying files from a Hadoop filesystem on standard output by using the FileSystemdirectlypublic class FileSystemCat { public static void main(String[] args) throws Exception { String uri = args[0]; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(uri), conf); InputStream in = null; try { in = fs.open(new Path(uri)); IOUtils.copyBytes(in, System.out, 4096, false); } finally { IOUtils.closeStream(in); } }}The program runs as follows: % hadoop FileSystemCat hdfs://localhost/user/tom/quangle.txt On the top of the Crumpetty Tree The Quangle Wangle sat, But his face you could not see, On account of his Beaver Hat.FSDataInputStreamThe open() method on FileSystem actually returns a FSDataInputStream rather than astandard java.io class. This class is a specialization of java.io.DataInputStream withsupport for random access, so you can read from any part of the stream: package org.apache.hadoop.fs; public class FSDataInputStream extends DataInputStream implements Seekable, PositionedReadable { // implementation elided } The Java Interface | 53

73.
The Seekable interface permits seeking to a position in the file and a query method forthe current offset from the start of the file (getPos()): public interface Seekable { void seek(long pos) throws IOException; long getPos() throws IOException; }Calling seek() with a position that is greater than the length of the file will result in anIOException. Unlike the skip() method of java.io.InputStream that positions thestream at a point later than the current position, seek() can move to an arbitrary, ab-solute position in the file.Example 3-3 is a simple extension of Example 3-2 that writes a file to standard outtwice: after writing it once, it seeks to the start of the file and streams through it onceagain.Example 3-3. Displaying files from a Hadoop filesystem on standard output twice, by using seekpublic class FileSystemDoubleCat { public static void main(String[] args) throws Exception { String uri = args[0]; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(uri), conf); FSDataInputStream in = null; try { in = fs.open(new Path(uri)); IOUtils.copyBytes(in, System.out, 4096, false); in.seek(0); // go back to the start of the file IOUtils.copyBytes(in, System.out, 4096, false); } finally { IOUtils.closeStream(in); } }}Here’s the result of running it on a small file: % hadoop FileSystemDoubleCat hdfs://localhost/user/tom/quangle.txt On the top of the Crumpetty Tree The Quangle Wangle sat, But his face you could not see, On account of his Beaver Hat. On the top of the Crumpetty Tree The Quangle Wangle sat, But his face you could not see, On account of his Beaver Hat.FSDataInputStream also implements the PositionedReadable interface for reading partsof a file at a given offset: public interface PositionedReadable { public int read(long position, byte[] buffer, int offset, int length)54 | Chapter 3: The Hadoop Distributed Filesystem

74.
throws IOException; public void readFully(long position, byte[] buffer, int offset, int length) throws IOException; public void readFully(long position, byte[] buffer) throws IOException; }The read() method reads up to length bytes from the given position in the file into thebuffer at the given offset in the buffer. The return value is the number of bytes actuallyread: callers should check this value as it may be less than length. The readFully()methods will read length bytes into the buffer (or buffer.length bytes for the versionthat just takes a byte array buffer), unless the end of the file is reached, in which casean EOFException is thrown.All of these methods preserve the current offset in the file and are thread-safe, so theyprovide a convenient way to access another part of the file—metadata perhaps—whilereading the main body of the file. In fact, they are just implemented using theSeekable interface using the following pattern: long oldPos = getPos(); try { seek(position); // read data } finally { seek(oldPos); }Finally, bear in mind that calling seek() is a relatively expensive operation and shouldbe used sparingly. You should structure your application access patterns to rely onstreaming data, (by using MapReduce, for example) rather than performing a largenumber of seeks.Writing DataThe FileSystem class has a number of methods for creating a file. The simplest is themethod that takes a Path object for the file to be created and returns an output streamto write to: public FSDataOutputStream create(Path f) throws IOExceptionThere are overloaded versions of this method that allow you to specify whether toforcibly overwrite existing files, the replication factor of the file, the buffer size to usewhen writing the file, the block size for the file, and file permissions. The create() methods create any parent directories of the file to be written that don’t already exist. Though convenient, this behavior may be unexpected. If you want the write to fail if the parent directory doesn’t exist, then you should check for the existence of the parent directory first by calling the exists() method. The Java Interface | 55

75.
There’s also an overloaded method for passing a callback interface, Progressable, soyour application can be notified of the progress of the data being written to thedatanodes: package org.apache.hadoop.util; public interface Progressable { public void progress(); }As an alternative to creating a new file, you can append to an existing file using theappend() method (there are also some other overloaded versions): public FSDataOutputStream append(Path f) throws IOExceptionThe append operation allows a single writer to modify an already written file by openingit and writing data from the final offset in the file. With this API, applications thatproduce unbounded files, such as logfiles, can write to an existing file after a restart,for example. The append operation is optional and not implemented by all Hadoopfilesystems. For example, HDFS supports append, but S3 filesystems don’t.Example 3-4 shows how to copy a local file to a Hadoop filesystem. We illustrate pro-gress by printing a period every time the progress() method is called by Hadoop, whichis after each 64 K packet of data is written to the datanode pipeline. (Note that thisparticular behavior is not specified by the API, so it is subject to change in later versionsof Hadoop. The API merely allows you to infer that “something is happening.”)Example 3-4. Copying a local file to a Hadoop filesystempublic class FileCopyWithProgress { public static void main(String[] args) throws Exception { String localSrc = args[0]; String dst = args[1]; InputStream in = new BufferedInputStream(new FileInputStream(localSrc)); Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(dst), conf); OutputStream out = fs.create(new Path(dst), new Progressable() { public void progress() { System.out.print("."); } }); IOUtils.copyBytes(in, out, 4096, true); }}Typical usage: % hadoop FileCopyWithProgress input/docs/1400-8.txt hdfs://localhost/user/tom/ 1400-8.txt ...............56 | Chapter 3: The Hadoop Distributed Filesystem

76.
Currently, none of the other Hadoop filesystems call progress() during writes. Progressis important in MapReduce applications, as you will see in later chapters.FSDataOutputStreamThe create() method on FileSystem returns an FSDataOutputStream, which, likeFSDataInputStream, has a method for querying the current position in the file: package org.apache.hadoop.fs; public class FSDataOutputStream extends DataOutputStream implements Syncable { public long getPos() throws IOException { // implementation elided } // implementation elided }However, unlike FSDataInputStream, FSDataOutputStream does not permit seeking. Thisis because HDFS allows only sequential writes to an open file or appends to an alreadywritten file. In other words, there is no support for writing to anywhere other than theend of the file, so there is no value in being able to seek while writing.DirectoriesFileSystem provides a method to create a directory: public boolean mkdirs(Path f) throws IOExceptionThis method creates all of the necessary parent directories if they don’t already exist,just like the java.io.File’s mkdirs() method. It returns true if the directory (and allparent directories) was (were) successfully created.Often, you don’t need to explicitly create a directory, since writing a file, by callingcreate(), will automatically create any parent directories.Querying the FilesystemFile metadata: FileStatusAn important feature of any filesystem is the ability to navigate its directory structureand retrieve information about the files and directories that it stores. The FileStatusclass encapsulates filesystem metadata for files and directories, including file length,block size, replication, modification time, ownership, and permission information.The method getFileStatus() on FileSystem provides a way of getting a FileStatusobject for a single file or directory. Example 3-5 shows an example of its use. The Java Interface | 57

78.
assertThat(stat.getReplication(), is((short) 0)); assertThat(stat.getBlockSize(), is(0L)); assertThat(stat.getOwner(), is("tom")); assertThat(stat.getGroup(), is("supergroup")); assertThat(stat.getPermission().toString(), is("rwxr-xr-x")); }}If no file or directory exists, a FileNotFoundException is thrown. However, if you areinterested only in the existence of a file or directory, then the exists() method onFileSystem is more convenient: public boolean exists(Path f) throws IOExceptionListing filesFinding information on a single file or directory is useful, but you also often need to beable to list the contents of a directory. That’s what FileSystem’s listStatus() methodsare for: public FileStatus[] listStatus(Path f) throws IOException public FileStatus[] listStatus(Path f, PathFilter filter) throws IOException public FileStatus[] listStatus(Path[] files) throws IOException public FileStatus[] listStatus(Path[] files, PathFilter filter) throws IOExceptionWhen the argument is a file, the simplest variant returns an array of FileStatus objectsof length 1. When the argument is a directory, it returns zero or more FileStatus objectsrepresenting the files and directories contained in the directory.Overloaded variants allow a PathFilter to be supplied to restrict the files and directoriesto match—you will see an example in section “PathFilter” on page 61. Finally, if youspecify an array of paths, the result is a shortcut for calling the equivalent single-pathlistStatus method for each path in turn and accumulating the FileStatus object arraysin a single array. This can be useful for building up lists of input files to process fromdistinct parts of the filesystem tree. Example 3-6 is a simple demonstration of this idea.Note the use of stat2Paths() in FileUtil for turning an array of FileStatus objects toan array of Path objects.Example 3-6. Showing the file statuses for a collection of paths in a Hadoop filesystempublic class ListStatus { public static void main(String[] args) throws Exception { String uri = args[0]; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(uri), conf); Path[] paths = new Path[args.length]; for (int i = 0; i < paths.length; i++) { paths[i] = new Path(args[i]); } The Java Interface | 59

79.
FileStatus[] status = fs.listStatus(paths); Path[] listedPaths = FileUtil.stat2Paths(status); for (Path p : listedPaths) { System.out.println(p); } }}We can use this program to find the union of directory listings for a collection of paths: % hadoop ListStatus hdfs://localhost/ hdfs://localhost/user/tom hdfs://localhost/user hdfs://localhost/user/tom/books hdfs://localhost/user/tom/quangle.txtFile patternsIt is a common requirement to process sets of files in a single operation. For example,a MapReduce job for log processing might analyze a month’s worth of files containedin a number of directories. Rather than having to enumerate each file and directory tospecify the input, it is convenient to use wildcard characters to match multiple fileswith a single expression, an operation that is known as globbing. Hadoop providestwo FileSystem method for processing globs: public FileStatus[] globStatus(Path pathPattern) throws IOException public FileStatus[] globStatus(Path pathPattern, PathFilter filter) throws IOExceptionThe globStatus() method returns an array of FileStatus objects whose paths matchthe supplied pattern, sorted by path. An optional PathFilter can be specified to restrictthe matches further.Hadoop supports the same set of glob characters as Unix bash (see Table 3-2).Table 3-2. Glob characters and their meanings Glob Name Matches * asterisk Matches zero or more characters ? question mark Matches a single character [ab] character class Matches a single character in the set {a, b} [^ab] negated character class Matches a single character that is not in the set {a, b} [a-b] character range Matches a single character in the (closed) range [a, b], where a is lexicographically less than or equal to b [^a-b] negated character range Matches a single character that is not in the (closed) range [a, b], where a is lexicographically less than or equal to b {a,b} alternation Matches either expression a or b c escaped character Matches character c when it is a metacharacter60 | Chapter 3: The Hadoop Distributed Filesystem

80.
Imagine that logfiles are stored in a directory structure organized hierarchically bydate. So, for example, logfiles for the last day of 2007 would go in a directorynamed /2007/12/31. Suppose that the full file listing is: • /2007/12/30 • /2007/12/31 • /2008/01/01 • /2008/01/02Here are some file globs and their expansions: Glob Expansion /* /2007 /2008 /*/* /2007/12 /2008/01 /*/12/* /2007/12/30 /2007/12/31 /200? /2007 /2008 /200[78] /2007 /2008 /200[7-8] /2007 /2008 /200[^01234569] /2007 /2008 /*/*/{31,01} /2007/12/31 /2008/01/01 /*/*/3{0,1} /2007/12/30 /2007/12/31 /*/{12/31,01/01} /2007/12/31 /2008/01/01PathFilterGlob patterns are not always powerful enough to describe a set of files you want toaccess. For example, it is not generally possible to exclude a particular file using a globpattern. The listStatus() and globStatus() methods of FileSystem take an optionalPathFilter, which allows programmatic control over matching: package org.apache.hadoop.fs; public interface PathFilter { boolean accept(Path path); }PathFilter is the equivalent of java.io.FileFilter for Path objects rather than Fileobjects.Example 3-7 shows a PathFilter for excluding paths that match a regular expression. The Java Interface | 61

81.
Example 3-7. A PathFilter for excluding paths that match a regular expressionpublic class RegexExcludePathFilter implements PathFilter { private final String regex; public RegexExcludePathFilter(String regex) { this.regex = regex; } public boolean accept(Path path) { return !path.toString().matches(regex); }}The filter passes only files that don’t match the regular expression. We use the filter inconjunction with a glob that picks out an initial set of files to include: the filter is usedto refine the results. For example: fs.globStatus(new Path("/2007/*/*"), new RegexExcludeFilter("^.*/2007/12/31$"))will expand to /2007/12/30.Filters can only act on a file’s name, as represented by a Path. They can’t use a file’sproperties, such as creation time, as the basis of the filter. Nevertheless, they can per-form matching that neither glob patterns nor regular expressions can achieve. For ex-ample, if you store files in a directory structure that is laid out by date (like in theprevious section), then you can write a PathFilter to pick out files that fall in a givendate range.Deleting DataUse the delete() method on FileSystem to permanently remove files or directories: public boolean delete(Path f, boolean recursive) throws IOExceptionIf f is a file or an empty directory, then the value of recursive is ignored. A nonemptydirectory is only deleted, along with its contents, if recursive is true (otherwise anIOException is thrown).Data FlowAnatomy of a File ReadTo get an idea of how data flows between the client interacting with HDFS, the name-node and the datanodes, consider Figure 3-1, which shows the main sequence of eventswhen reading a file.62 | Chapter 3: The Hadoop Distributed Filesystem

82.
Figure 3-1. A client reading data from HDFSThe client opens the file it wishes to read by calling open() on the FileSystem object,which for HDFS is an instance of DistributedFileSystem (step 1 in Figure 3-1).DistributedFileSystem calls the namenode, using RPC, to determine the locations ofthe blocks for the first few blocks in the file (step 2). For each block, the namenodereturns the addresses of the datanodes that have a copy of that block. Furthermore, thedatanodes are sorted according to their proximity to the client (according to the top-ology of the cluster’s network; see “Network Topology and Hadoop” on page 64). Ifthe client is itself a datanode (in the case of a MapReduce task, for instance), then itwill read from the local datanode, if it hosts a copy of the block.The DistributedFileSystem returns an FSDataInputStream (an input stream that sup-ports file seeks) to the client for it to read data from. FSDataInputStream in turn wrapsa DFSInputStream, which manages the datanode and namenode I/O.The client then calls read() on the stream (step 3). DFSInputStream, which has storedthe datanode addresses for the first few blocks in the file, then connects to the first(closest) datanode for the first block in the file. Data is streamed from the datanodeback to the client, which calls read() repeatedly on the stream (step 4). When the endof the block is reached, DFSInputStream will close the connection to the datanode, thenfind the best datanode for the next block (step 5). This happens transparently to theclient, which from its point of view is just reading a continuous stream.Blocks are read in order with the DFSInputStream opening new connections to datanodesas the client reads through the stream. It will also call the namenode to retrieve thedatanode locations for the next batch of blocks as needed. When the client has finishedreading, it calls close() on the FSDataInputStream (step 6). Data Flow | 63

83.
During reading, if the DFSInputStream encounters an error while communicating witha datanode, then it will try the next closest one for that block. It will also rememberdatanodes that have failed so that it doesn’t needlessly retry them for later blocks. TheDFSInputStream also verifies checksums for the data transferred to it from the datanode.If a corrupted block is found, it is reported to the namenode before the DFSInputStream attempts to read a replica of the block from another datanode.One important aspect of this design is that the client contacts datanodes directly toretrieve data and is guided by the namenode to the best datanode for each block. Thisdesign allows HDFS to scale to a large number of concurrent clients, since the datatraffic is spread across all the datanodes in the cluster. The namenode meanwhile merelyhas to service block location requests (which it stores in memory, making them veryefficient) and does not, for example, serve data, which would quickly become a bot-tleneck as the number of clients grew. Network Topology and Hadoop What does it mean for two nodes in a local network to be “close” to each other? In the context of high-volume data processing, the limiting factor is the rate at which we can transfer data between nodes—bandwidth is a scarce commodity. The idea is to use the bandwidth between two nodes as a measure of distance. Rather than measuring bandwidth between nodes, which can be difficult to do in prac- tice (it requires a quiet cluster, and the number of pairs of nodes in a cluster grows as the square of the number of nodes), Hadoop takes a simple approach in which the network is represented as a tree and the distance between two nodes is the sum of their distances to their closest common ancestor. Levels in the tree are not predefined, but it is common to have levels that correspond to the data center, the rack, and the node that a process is running on. The idea is that the bandwidth available for each of the following scenarios becomes progressively less: • Processes on the same node • Different nodes on the same rack • Nodes on different racks in the same data center • Nodes in different data centers† For example, imagine a node n1 on rack r1 in data center d1. This can be represented as /d1/r1/n1. Using this notation, here are the distances for the four scenarios: • distance(/d1/r1/n1, /d1/r1/n1) = 0 (processes on the same node) • distance(/d1/r1/n1, /d1/r1/n2) = 2 (different nodes on the same rack) • distance(/d1/r1/n1, /d1/r2/n3) = 4 (nodes on different racks in the same data center) • distance(/d1/r1/n1, /d2/r3/n4) = 6 (nodes in different data centers)† At the time of this writing, Hadoop is not suited for running across data centers.64 | Chapter 3: The Hadoop Distributed Filesystem

84.
This is illustrated schematically in Figure 3-2. (Mathematically inclined readers will notice that this is an example of a distance metric.) Finally, it is important to realize that Hadoop cannot divine your network topology for you. It needs some help; we’ll cover how to configure topology in “Network Topol- ogy” on page 261. By default though, it assumes that the network is flat—a single-level hierarchy—or in other words, that all nodes are on a single rack in a single data center. For small clusters, this may actually be the case, and no further configuration is required.Figure 3-2. Network distance in HadoopAnatomy of a File WriteNext we’ll look at how files are written to HDFS. Although quite detailed, it is instruc-tive to understand the data flow since it clarifies HDFS’s coherency model.The case we’re going to consider is the case of creating a new file, writing data to it,then closing the file. See Figure 3-3.The client creates the file by calling create() on DistributedFileSystem (step 1 inFigure 3-3). DistributedFileSystem makes an RPC call to the namenode to create a newfile in the filesystem’s namespace, with no blocks associated with it (step 2). The name-node performs various checks to make sure the file doesn’t already exist, and that theclient has the right permissions to create the file. If these checks pass, the namenodemakes a record of the new file; otherwise, file creation fails and the client is thrown anIOException. The DistributedFileSystem returns an FSDataOutputStream for the client Data Flow | 65

85.
to start writing data to. Just as in the read case, FSDataOutputStream wraps a DFSOutputStream, which handles communication with the datanodes and namenode.As the client writes data (step 3), DFSOutputStream splits it into packets, which it writesto an internal queue, called the data queue. The data queue is consumed by the DataStreamer, whose responsibility it is to ask the namenode to allocate new blocks bypicking a list of suitable datanodes to store the replicas. The list of datanodes forms apipeline—we’ll assume the replication level is three, so there are three nodes in thepipeline. The DataStreamer streams the packets to the first datanode in the pipeline,which stores the packet and forwards it to the second datanode in the pipeline. Simi-larly, the second datanode stores the packet and forwards it to the third (and last)datanode in the pipeline (step 4).Figure 3-3. A client writing data to HDFSDFSOutputStream also maintains an internal queue of packets that are waiting to beacknowledged by datanodes, called the ack queue. A packet is removed from the ackqueue only when it has been acknowledged by all the datanodes in the pipeline (step 5).If a datanode fails while data is being written to it, then the following actions are taken,which are transparent to the client writing the data. First the pipeline is closed, and anypackets in the ack queue are added to the front of the data queue so that datanodesthat are downstream from the failed node will not miss any packets. The current blockon the good datanodes is given a new identity, which is communicated to the name-node, so that the partial block on the failed datanode will be deleted if the failed66 | Chapter 3: The Hadoop Distributed Filesystem

86.
datanode recovers later on. The failed datanode is removed from the pipeline and theremainder of the block’s data is written to the two good datanodes in the pipeline. Thenamenode notices that the block is under-replicated, and it arranges for a further replicato be created on another node. Subsequent blocks are then treated as normal.It’s possible, but unlikely, that multiple datanodes fail while a block is being written.As long as dfs.replication.min replicas (default one) are written, the write will succeed,and the block will be asynchronously replicated across the cluster until its target rep-lication factor is reached (dfs.replication, which defaults to three).When the client has finished writing data, it calls close() on the stream (step 6). Thisaction flushes all the remaining packets to the datanode pipeline and waits for ac-knowledgments before contacting the namenode to signal that the file is complete (step7). The namenode already knows which blocks the file is made up of (via DataStreamer asking for block allocations), so it only has to wait for blocks to be minimallyreplicated before returning successfully. Replica Placement How does the namenode choose which datanodes to store replicas on? There’s a trade- off between reliability and write bandwidth and read bandwidth here. For example, placing all replicas on a single node incurs the lowest write bandwidth penalty since the replication pipeline runs on a single node, but this offers no real redundancy (if the node fails, the data for that block is lost). Also, the read bandwidth is high for off-rack reads. At the other extreme, placing replicas in different data centers may maximize redundancy, but at the cost of bandwidth. Even in the same data center (which is what all Hadoop clusters to date have run in), there are a variety of placement strategies. Indeed, Hadoop changed its placement strategy in release 0.17.0 to one that helps keep a fairly even distribution of blocks across the cluster. (See “balancer” on page 304 for details on keeping a cluster balanced.) And from 0.21.0, block placement policies are pluggable. Hadoop’s default strategy is to place the first replica on the same node as the client (for clients running outside the cluster, a node is chosen at random, although the system tries not to pick nodes that are too full or too busy). The second replica is placed on a different rack from the first (off-rack), chosen at random. The third replica is placed on the same rack as the second, but on a different node chosen at random. Further replicas are placed on random nodes on the cluster, although the system tries to avoid placing too many replicas on the same rack. Once the replica locations have been chosen, a pipeline is built, taking network topol- ogy into account. For a replication factor of 3, the pipeline might look like Figure 3-4. Overall, this strategy gives a good balance among reliability (blocks are stored on two racks), write bandwidth (writes only have to traverse a single network switch), read performance (there’s a choice of two racks to read from), and block distribution across the cluster (clients only write a single block on the local rack). Data Flow | 67

87.
Figure 3-4. A typical replica pipelineCoherency ModelA coherency model for a filesystem describes the data visibility of reads and writes fora file. HDFS trades off some POSIX requirements for performance, so some operationsmay behave differently than you expect them to.After creating a file, it is visible in the filesystem namespace, as expected: Path p = new Path("p"); fs.create(p); assertThat(fs.exists(p), is(true));However, any content written to the file is not guaranteed to be visible, even if thestream is flushed. So the file appears to have a length of zero: Path p = new Path("p"); OutputStream out = fs.create(p); out.write("content".getBytes("UTF-8")); out.flush(); assertThat(fs.getFileStatus(p).getLen(), is(0L));Once more than a block’s worth of data has been written, the first block will be visibleto new readers. This is true of subsequent blocks, too: it is always the current blockbeing written that is not visible to other readers.68 | Chapter 3: The Hadoop Distributed Filesystem

88.
HDFS provides a method for forcing all buffers to be synchronized to the datanodesvia the sync() method on FSDataOutputStream. After a successful return from sync(),HDFS guarantees that the data written up to that point in the file is persisted and visibleto all new readers:‡ Path p = new Path("p"); FSDataOutputStream out = fs.create(p); out.write("content".getBytes("UTF-8")); out.flush(); out.sync(); assertThat(fs.getFileStatus(p).getLen(), is(((long) "content".length())));This behavior is similar to the fsync system call in POSIX that commits buffered datafor a file descriptor. For example, using the standard Java API to write a local file, weare guaranteed to see the content after flushing the stream and synchronizing: FileOutputStream out = new FileOutputStream(localFile); out.write("content".getBytes("UTF-8")); out.flush(); // flush to operating system out.getFD().sync(); // sync to disk assertThat(localFile.length(), is(((long) "content".length())));Closing a file in HDFS performs an implicit sync(), too: Path p = new Path("p"); OutputStream out = fs.create(p); out.write("content".getBytes("UTF-8")); out.close(); assertThat(fs.getFileStatus(p).getLen(), is(((long) "content".length())));Consequences for application designThis coherency model has implications for the way you design applications. With nocalls to sync(), you should be prepared to lose up to a block of data in the event ofclient or system failure. For many applications, this is unacceptable, so you should callsync() at suitable points, such as after writing a certain number of records or numberof bytes. Though the sync() operation is designed to not unduly tax HDFS, it does havesome overhead, so there is a trade-off between data robustness and throughput. Whatis an acceptable trade-off is application-dependent, and suitable values can be selectedafter measuring your application’s performance with different sync() frequencies.‡ Releases of Hadoop up to and including 0.20 do not have a working implementation of sync(); however, this has been remedied from 0.21.0 onward. Also, from that version, sync() is deprecated in favor of hflush(), which only guarantees that new readers will see all data written to that point, and hsync(), which makes a stronger guarantee that the operating system has flushed the data to disk (like POSIX fsync), although data may still be in the disk cache. Data Flow | 69

89.
Parallel Copying with distcpThe HDFS access patterns that we have seen so far focus on single-threaded access. It’spossible to act on a collection of files, by specifying file globs, for example, but forefficient, parallel processing of these files you would have to write a program yourself.Hadoop comes with a useful program called distcp for copying large amounts of datato and from Hadoop filesystems in parallel.The canonical use case for distcp is for transferring data between two HDFS clusters.If the clusters are running identical versions of Hadoop, the hdfs scheme isappropriate: % hadoop distcp hdfs://namenode1/foo hdfs://namenode2/barThis will copy the /foo directory (and its contents) from the first cluster to the /bardirectory on the second cluster, so the second cluster ends up with the directory struc-ture /bar/foo. If /bar doesn’t exist, it will be created first. You can specify multiple sourcepaths, and all will be copied to the destination. Source paths must be absolute.By default, distcp will skip files that already exist in the destination, but they can beoverwritten by supplying the -overwrite option. You can also update only files thathave changed using the -update option. Using either (or both) of -overwrite or -update changes how the source and destination paths are interpreted. This is best shown with an ex- ample. If we changed a file in the /foo subtree on the first cluster from the previous example, then we could synchronize the change with the second cluster by running: % hadoop distcp -update hdfs://namenode1/foo hdfs://namenode2/bar/foo The extra trailing /foo subdirectory is needed on the destination, as now the contents of the source directory are copied to the contents of the destination directory. (If you are familiar with rsync, you can think of the -overwrite or -update options as adding an implicit trailing slash to the source.) If you are unsure of the effect of a distcp operation, it is a good idea to try it out on a small test directory tree first.There are more options to control the behavior of distcp, including ones to preserve fileattributes, ignore failures, and limit the number of files or total data copied. Run it withno options to see the usage instructions.distcp is implemented as a MapReduce job where the work of copying is done by themaps that run in parallel across the cluster. There are no reducers. Each file is copiedby a single map, and distcp tries to give each map approximately the same amount ofdata, by bucketing files into roughly equal allocations.70 | Chapter 3: The Hadoop Distributed Filesystem

90.
The number of maps is decided as follows. Since it’s a good idea to get each map tocopy a reasonable amount of data to minimize overheads in task setup, each map copiesat least 256 MB (unless the total size of the input is less, in which case one map handlesit all). For example, 1 GB of files will be given four map tasks. When the data size isvery large, it becomes necessary to limit the number of maps in order to limit bandwidthand cluster utilization. By default, the maximum number of maps is 20 per (tasktracker)cluster node. For example, copying 1,000 GB of files to a 100-node cluster will allocate2,000 maps (20 per node), so each will copy 512 MB on average. This can be reducedby specifying the -m argument to distcp. For example, -m 1000 would allocate 1,000maps, each copying 1 GB on average.If you try to use distcp between two HDFS clusters that are running different versions,the copy will fail if you use the hdfs protocol, since the RPC systems are incompatible.To remedy this, you can use the read-only HTTP-based HFTP filesystem to read fromthe source. The job must run on the destination cluster so that the HDFS RPC versionsare compatible. To repeat the previous example using HFTP: % hadoop distcp hftp://namenode1:50070/foo hdfs://namenode2/barNote that you need to specify the namenode’s web port in the source URI. This isdetermined by the dfs.http.address property, which defaults to 50070.Keeping an HDFS Cluster BalancedWhen copying data into HDFS, it’s important to consider cluster balance. HDFS worksbest when the file blocks are evenly spread across the cluster, so you want to ensurethat distcp doesn’t disrupt this. Going back to the 1,000 GB example, by specifying -m1 a single map would do the copy, which—apart from being slow and not using thecluster resources efficiently—would mean that the first replica of each block wouldreside on the node running the map (until the disk filled up). The second and thirdreplicas would be spread across the cluster, but this one node would be unbalanced.By having more maps than nodes in the cluster, this problem is avoided—for this rea-son, it’s best to start by running distcp with the default of 20 maps per node.However, it’s not always possible to prevent a cluster from becoming unbalanced. Per-haps you want to limit the number of maps so that some of the nodes can be used byother jobs. In this case, you can use the balancer tool (see “balancer” on page 304) tosubsequently even out the block distribution across the cluster.Hadoop ArchivesHDFS stores small files inefficiently, since each file is stored in a block, and blockmetadata is held in memory by the namenode. Thus, a large number of small files caneat up a lot of memory on the namenode. (Note, however, that small files do not takeup any more disk space than is required to store the raw contents of the file. For Hadoop Archives | 71

91.
example, a 1 MB file stored with a block size of 128 MB uses 1 MB of disk space, not128 MB.)Hadoop Archives, or HAR files, are a file archiving facility that packs files into HDFSblocks more efficiently, thereby reducing namenode memory usage while still allowingtransparent access to files. In particular, Hadoop Archives can be used as input toMapReduce.Using Hadoop ArchivesA Hadoop Archive is created from a collection of files using the archive tool. The toolruns a MapReduce job to process the input files in parallel, so to run it, you need aMapReduce cluster running to use it. Here are some files in HDFS that we would liketo archive: % hadoop fs -lsr /my/files -rw-r--r-- 1 tom supergroup 1 2009-04-09 19:13 /my/files/a drwxr-xr-x - tom supergroup 0 2009-04-09 19:13 /my/files/dir -rw-r--r-- 1 tom supergroup 1 2009-04-09 19:13 /my/files/dir/bNow we can run the archive command: % hadoop archive -archiveName files.har /my/files /myThe first option is the name of the archive, here files.har. HAR files always havea .har extension, which is mandatory for reasons we shall see later. Next comes the filesto put in the archive. Here we are archiving only one source tree, the files in /my/filesin HDFS, but the tool accepts multiple source trees. The final argument is the outputdirectory for the HAR file. Let’s see what the archive has created: % hadoop fs -ls /my Found 2 items drwxr-xr-x - tom supergroup 0 2009-04-09 19:13 /my/files drwxr-xr-x - tom supergroup 0 2009-04-09 19:13 /my/files.har % hadoop fs -ls /my/files.har Found 3 items -rw-r--r-- 10 tom supergroup 165 2009-04-09 19:13 /my/files.har/_index -rw-r--r-- 10 tom supergroup 23 2009-04-09 19:13 /my/files.har/_masterindex -rw-r--r-- 1 tom supergroup 2 2009-04-09 19:13 /my/files.har/part-0The directory listing shows what a HAR file is made of: two index files and a collectionof part files—just one in this example. The part files contain the contents of a numberof the original files concatenated together, and the indexes make it possible to look upthe part file that an archived file is contained in, and its offset and length. All thesedetails are hidden from the application, however, which uses the har URI scheme tointeract with HAR files, using a HAR filesystem that is layered on top of the underlyingfilesystem (HDFS in this case). The following command recursively lists the files in thearchive:72 | Chapter 3: The Hadoop Distributed Filesystem

92.
% hadoop fs -lsr har:///my/files.har drw-r--r-- - tom supergroup 0 2009-04-09 19:13 /my/files.har/my drw-r--r-- - tom supergroup 0 2009-04-09 19:13 /my/files.har/my/files -rw-r--r-- 10 tom supergroup 1 2009-04-09 19:13 /my/files.har/my/files/a drw-r--r-- - tom supergroup 0 2009-04-09 19:13 /my/files.har/my/files/dir -rw-r--r-- 10 tom supergroup 1 2009-04-09 19:13 /my/files.har/my/files/dir/bThis is quite straightforward if the filesystem that the HAR file is on is the defaultfilesystem. On the other hand, if you want to refer to a HAR file on a different filesystem,then you need to use a different form of the path URI to normal. These two commandshave the same effect, for example: % hadoop fs -lsr har:///my/files.har/my/files/dir % hadoop fs -lsr har://hdfs-localhost:8020/my/files.har/my/files/dirNotice in the second form that the scheme is still har to signify a HAR filesystem, butthe authority is hdfs to specify the underlying filesystem’s scheme, followed by a dashand the HDFS host (localhost) and port (8020). We can now see why HAR files haveto have a .har extension. The HAR filesystem translates the har URI into a URI for theunderlying filesystem, by looking at the authority and path up to and including thecomponent with the .har extension. In this case, it is hdfs://localhost:8020/my/files.har. The remaining part of the path is the path of the file in the archive: /my/files/dir.To delete a HAR file, you need to use the recursive form of delete, since from theunderlying filesystem’s point of view the HAR file is a directory: % hadoop fs -rmr /my/files.harLimitationsThere are a few limitations to be aware of with HAR files. Creating an archive createsa copy of the original files, so you need as much disk space as the files you are archivingto create the archive (although you can delete the originals once you have created thearchive). There is currently no support for archive compression, although the files thatgo into the archive can be compressed (HAR files are like tar files in this respect).Archives are immutable once they have been created. To add or remove files, you mustre-create the archive. In practice, this is not a problem for files that don’t change afterbeing written, since they can be archived in batches on a regular basis, such as daily orweekly.As noted earlier, HAR files can be used as input to MapReduce. However, there is noarchive-aware InputFormat that can pack multiple files into a single MapReduce split,so processing lots of small files, even in a HAR file, can still be inefficient. “Small filesand CombineFileInputFormat” on page 203 discusses another approach to thisproblem. Hadoop Archives | 73

93.
CHAPTER 4 Hadoop I/OHadoop comes with a set of primitives for data I/O. Some of these are techniques thatare more general than Hadoop, such as data integrity and compression, but deservespecial consideration when dealing with multiterabyte datasets. Others are Hadooptools or APIs that form the building blocks for developing distributed systems, such asserialization frameworks and on-disk data structures.Data IntegrityUsers of Hadoop rightly expect that no data will be lost or corrupted during storage orprocessing. However, since every I/O operation on the disk or network carries with ita small chance of introducing errors into the data that it is reading or writing, when thevolumes of data flowing through the system are as large as the ones Hadoop is capableof handling, the chance of data corruption occurring is high.The usual way of detecting corrupted data is by computing a checksum for the datawhen it first enters the system, and again whenever it is transmitted across a channelthat is unreliable and hence capable of corrupting the data. The data is deemed to becorrupt if the newly generated checksum doesn’t exactly match the original. This tech-nique doesn’t offer any way to fix the data—merely error detection. (And this is a reasonfor not using low-end hardware; in particular, be sure to use ECC memory.) Note thatit is possible that it’s the checksum that is corrupt, not the data, but this is very unlikely,since the checksum is much smaller than the data.A commonly used error-detecting code is CRC-32 (cyclic redundancy check), whichcomputes a 32-bit integer checksum for input of any size.Data Integrity in HDFSHDFS transparently checksums all data written to it and by default verifies checksumswhen reading data. A separate checksum is created for every io.bytes.per.checksum 75

94.
bytes of data. The default is 512 bytes, and since a CRC-32 checksum is 4 bytes long,the storage overhead is less than 1%.Datanodes are responsible for verifying the data they receive before storing the dataand its checksum. This applies to data that they receive from clients and from otherdatanodes during replication. A client writing data sends it to a pipeline of datanodes(as explained in Chapter 3), and the last datanode in the pipeline verifies the checksum.If it detects an error, the client receives a ChecksumException, a subclass of IOException, which it should handle in an application-specific manner, by retrying the opera-tion, for example.When clients read data from datanodes, they verify checksums as well, comparing themwith the ones stored at the datanode. Each datanode keeps a persistent log of checksumverifications, so it knows the last time each of its blocks was verified. When a clientsuccessfully verifies a block, it tells the datanode, which updates its log. Keeping sta-tistics such as these is valuable in detecting bad disks.Aside from block verification on client reads, each datanode runs a DataBlockScannerin a background thread that periodically verifies all the blocks stored on the datanode.This is to guard against corruption due to “bit rot” in the physical storage media. See“Datanode block scanner” on page 303 for details on how to access the scannerreports.Since HDFS stores replicas of blocks, it can “heal” corrupted blocks by copying one ofthe good replicas to produce a new, uncorrupt replica. The way this works is that if aclient detects an error when reading a block, it reports the bad block and the datanodeit was trying to read from to the namenode before throwing a ChecksumException. Thenamenode marks the block replica as corrupt, so it doesn’t direct clients to it, or try tocopy this replica to another datanode. It then schedules a copy of the block to be re-plicated on another datanode, so its replication factor is back at the expected level.Once this has happened, the corrupt replica is deleted.It is possible to disable verification of checksums by passing false to the setVerifyChecksum() method on FileSystem, before using the open() method to read a file. Thesame effect is possible from the shell by using the -ignoreCrc option with the -get orthe equivalent -copyToLocal command. This feature is useful if you have a corrupt filethat you want to inspect so you can decide what to do with it. For example, you mightwant to see whether it can be salvaged before you delete it.LocalFileSystemThe Hadoop LocalFileSystem performs client-side checksumming. This means thatwhen you write a file called filename, the filesystem client transparently creates a hiddenfile, .filename.crc, in the same directory containing the checksums for each chunk ofthe file. Like HDFS, the chunk size is controlled by the io.bytes.per.checksum property,which defaults to 512 bytes. The chunk size is stored as metadata in the .crc file, so the76 | Chapter 4: Hadoop I/O

95.
file can be read back correctly even if the setting for the chunk size has changed.Checksums are verified when the file is read, and if an error is detected,LocalFileSystem throws a ChecksumException.Checksums are fairly cheap to compute (in Java, they are implemented in native code),typically adding a few percent overhead to the time to read or write a file. For mostapplications, this is an acceptable price to pay for data integrity. It is, however, possibleto disable checksums: typically when the underlying filesystem supports checksumsnatively. This is accomplished by using RawLocalFileSystem in place of LocalFileSystem. To do this globally in an application, it suffices to remap the implementa-tion for file URIs by setting the property fs.file.impl to the valueorg.apache.hadoop.fs.RawLocalFileSystem. Alternatively, you can directly create a RawLocalFileSystem instance, which may be useful if you want to disable checksum veri-fication for only some reads; for example: Configuration conf = ... FileSystem fs = new RawLocalFileSystem(); fs.initialize(null, conf);ChecksumFileSystemLocalFileSystem uses ChecksumFileSystem to do its work, and this class makes it easyto add checksumming to other (nonchecksummed) filesystems, as ChecksumFileSystem is just a wrapper around FileSystem. The general idiom is as follows: FileSystem rawFs = ... FileSystem checksummedFs = new ChecksumFileSystem(rawFs);The underlying filesystem is called the raw filesystem, and may be retrieved using thegetRawFileSystem() method on ChecksumFileSystem. ChecksumFileSystem has a fewmore useful methods for working with checksums, such as getChecksumFile() for get-ting the path of a checksum file for any file. Check the documentation for the others.If an error is detected by ChecksumFileSystem when reading a file, it will call itsreportChecksumFailure() method. The default implementation does nothing, butLocalFileSystem moves the offending file and its checksum to a side directory on thesame device called bad_files. Administrators should periodically check for these badfiles and take action on them.CompressionFile compression brings two major benefits: it reduces the space needed to store files,and it speeds up data transfer across the network, or to or from disk. When dealingwith large volumes of data, both of these savings can be significant, so it pays to carefullyconsider how to use compression in Hadoop. Compression | 77

96.
There are many different compression formats, tools and algorithms, each with differ-ent characteristics. Table 4-1 lists some of the more common ones that can be usedwith Hadoop.*Table 4-1. A summary of compression formats Compression format Tool Algorithm Filename extension Multiple files Splittable DEFLATEa N/A DEFLATE .deflate No No gzip gzip DEFLATE .gz No No bzip2 bzip2 bzip2 .bz2 No Yes LZO lzop LZO .lzo No Noa DEFLATE is a compression algorithm whose standard implementation is zlib. There is no commonly available command-line tool for producing files in DEFLATE format, as gzip is normally used. (Note that the gzip file format is DEFLATE with extra headers and a footer.) The .deflate filename extension is a Hadoop convention.All compression algorithms exhibit a space/time trade-off: faster compression and de-compression speeds usually come at the expense of smaller space savings. All of thetools listed in Table 4-1 give some control over this trade-off at compression time byoffering nine different options: –1 means optimize for speed and -9 means optimize forspace. For example, the following command creates a compressed file file.gz using thefastest compression method: gzip -1 fileThe different tools have very different compression characteristics. Gzip is a general-purpose compressor, and sits in the middle of the space/time trade-off. Bzip2 com-presses more effectively than gzip, but is slower. Bzip2’s decompression speed is fasterthan its compression speed, but it is still slower than the other formats. LZO, on theother hand, optimizes for speed: it is faster than gzip (or any other compression ordecompression tool†), but compresses slightly less effectively.The “Splittable” column in Table 4-1 indicates whether the compression format sup-ports splitting; that is, whether you can seek to any point in the stream and start readingfrom some point further on. Splittable compression formats are especially suitable forMapReduce; see “Compression and Input Splits” on page 83 for further discussion.CodecsA codec is the implementation of a compression-decompression algorithm. In Hadoop,a codec is represented by an implementation of the CompressionCodec interface. So, for* At the time of this writing, Hadoop does not support ZIP compression. See https://issues.apache.org/jira/ browse/MAPREDUCE-210.† Jeff Gilchrist’s Archive Comparison Test at http://compression.ca/act/act-summary.html contains benchmarks for compression and decompression speed, and compression ratio for a wide range of tools.78 | Chapter 4: Hadoop I/O

97.
example, GzipCodec encapsulates the compression and decompression algorithm forgzip. Table 4-2 lists the codecs that are available for Hadoop.Table 4-2. Hadoop compression codecs Compression format Hadoop CompressionCodec DEFLATE org.apache.hadoop.io.compress.DefaultCodec gzip org.apache.hadoop.io.compress.GzipCodec bzip2 org.apache.hadoop.io.compress.BZip2Codec LZO com.hadoop.compression.lzo.LzopCodecThe LZO libraries are GPL-licensed and may not be included in Apache distributions,so for this reason the Hadoop codecs must be downloaded separately from http://code.google.com/p/hadoop-gpl-compression/ (or http://github.com/kevinweil/hadoop-lzo,which includes bugfixes and more tools). The LzopCodec is compatible with the lzoptool, which is essentially the LZO format with extra headers, and is the one you nor-mally want. There is also a LzoCodec for the pure LZO format, which uses the .lzo_de-flate filename extension (by analogy with DEFLATE, which is gzip without theheaders).Compressing and decompressing streams with CompressionCodecCompressionCodec has two methods that allow you to easily compress or decompressdata. To compress data being written to an output stream, use the createOutputStream(OutputStream out) method to create a CompressionOutputStream to which youwrite your uncompressed data to have it written in compressed form to the underlyingstream. Conversely, to decompress data being read from an input stream, callcreateInputStream(InputStream in) to obtain a CompressionInputStream, which allowsyou to read uncompressed data from the underlying stream.CompressionOutputStream and CompressionInputStream are similar tojava.util.zip.DeflaterOutputStream and java.util.zip.DeflaterInputStream, exceptthat both of the former provide the ability to reset their underlying compressor or de-compressor, which is important for applications that compress sections of the datastream as separate blocks, such as SequenceFile, described in “Sequence-File” on page 116.Example 4-1 illustrates how to use the API to compress data read from standard inputand write it to standard output.Example 4-1. A program to compress data read from standard input and write it to standard outputpublic class StreamCompressor { public static void main(String[] args) throws Exception { String codecClassname = args[0]; Class<?> codecClass = Class.forName(codecClassname); Compression | 79

98.
Configuration conf = new Configuration(); CompressionCodec codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf); CompressionOutputStream out = codec.createOutputStream(System.out); IOUtils.copyBytes(System.in, out, 4096, false); out.finish(); }}The application expects the fully qualified name of the CompressionCodec implementa-tion as the first command-line argument. We use ReflectionUtils to construct a newinstance of the codec, then obtain a compression wrapper around System.out. Then wecall the utility method copyBytes() on IOUtils to copy the input to the output, whichis compressed by the CompressionOutputStream. Finally, we call finish() onCompressionOutputStream, which tells the compressor to finish writing to the com-pressed stream, but doesn’t close the stream. We can try it out with the followingcommand line, which compresses the string “Text” using the StreamCompressor pro-gram with the GzipCodec, then decompresses it from standard input using gunzip: % echo "Text" | hadoop StreamCompressor org.apache.hadoop.io.compress.GzipCodec | gunzip - TextInferring CompressionCodecs using CompressionCodecFactoryIf you are reading a compressed file, you can normally infer the codec to use by lookingat its filename extension. A file ending in .gz can be read with GzipCodec, and so on.The extension for each compression format is listed in Table 4-1.CompressionCodecFactory provides a way of mapping a filename extension to aCompressionCodec using its getCodec() method, which takes a Path object for the file inquestion. Example 4-2 shows an application that uses this feature to decompress files.Example 4-2. A program to decompress a compressed file using a codec inferred from the file’sextensionpublic class FileDecompressor { public static void main(String[] args) throws Exception { String uri = args[0]; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(uri), conf); Path inputPath = new Path(uri); CompressionCodecFactory factory = new CompressionCodecFactory(conf); CompressionCodec codec = factory.getCodec(inputPath); if (codec == null) { System.err.println("No codec found for " + uri); System.exit(1); }80 | Chapter 4: Hadoop I/O

99.
String outputUri = CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension()); InputStream in = null; OutputStream out = null; try { in = codec.createInputStream(fs.open(inputPath)); out = fs.create(new Path(outputUri)); IOUtils.copyBytes(in, out, conf); } finally { IOUtils.closeStream(in); IOUtils.closeStream(out); } }}Once the codec has been found, it is used to strip off the file suffix to form the outputfilename (via the removeSuffix() static method of CompressionCodecFactory). In thisway, a file named file.gz is decompressed to file by invoking the program as follows: % hadoop FileDecompressor file.gzCompressionCodecFactory finds codecs from a list defined by theio.compression.codecs configuration property. By default, this lists all the codecs pro-vided by Hadoop (see Table 4-3), so you would need to alter it only if you have a customcodec that you wish to register (such as the externally hosted LZO codecs). Each codecknows its default filename extension, thus permitting CompressionCodecFactory tosearch through the registered codecs to find a match for a given extension (if any).Table 4-3. Compression codec properties Property name Type Default value Description io.compression.codecs comma-separated org.apache.hadoop.io. A list of the Class names compress.DefaultCodec, CompressionCodec classes org.apache.hadoop.io. for compression/ compress.GzipCodec, decompression. org.apache.hadoop.io. compress.Bzip2CodecNative librariesFor performance, it is preferable to use a native library for compression anddecompression. For example, in one test, using the native gzip libraries reduced de-compression times by up to 50% and compression times by around 10% (compared tothe built-in Java implementation). Table 4-4 shows the availability of Java and nativeimplementations for each compression format. Not all formats have native implemen-tations (bzip2, for example), whereas others are only available as a native implemen-tation (LZO, for example). Compression | 81

100.
Table 4-4. Compression library implementations Compression format Java implementation Native implementation DEFLATE Yes Yes gzip Yes Yes bzip2 Yes No LZO No YesHadoop comes with prebuilt native compression libraries for 32- and 64-bit Linux,which you can find in the lib/native directory. For other platforms, you will need tocompile the libraries yourself, following the instructions on the Hadoop wiki at http://wiki.apache.org/hadoop/NativeHadoop.The native libraries are picked up using the Java system property java.library.path.The hadoop script in the bin directory sets this property for you, but if you don’t usethis script, you will need to set the property in your application.By default, Hadoop looks for native libraries for the platform it is running on, and loadsthem automatically if they are found. This means you don’t have to change any con-figuration settings to use the native libraries. In some circumstances, however, you maywish to disable use of native libraries, such as when you are debugging a compression-related problem. You can achieve this by setting the property hadoop.native.lib tofalse, which ensures that the built-in Java equivalents will be used (if they are available).CodecPool. If you are using a native library and you are doing a lot of compression ordecompression in your application, consider using CodecPool, which allows you to re-use compressors and decompressors, thereby amortizing the cost of creating theseobjects.The code in Example 4-3 shows the API, although in this program, which only createsa single Compressor, there is really no need to use a pool.Example 4-3. A program to compress data read from standard input and write it to standard outputusing a pooled compressorpublic class PooledStreamCompressor { public static void main(String[] args) throws Exception { String codecClassname = args[0]; Class<?> codecClass = Class.forName(codecClassname); Configuration conf = new Configuration(); CompressionCodec codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf); Compressor compressor = null; try { compressor = CodecPool.getCompressor(codec); CompressionOutputStream out = codec.createOutputStream(System.out, compressor); IOUtils.copyBytes(System.in, out, 4096, false); out.finish();82 | Chapter 4: Hadoop I/O

101.
} finally { CodecPool.returnCompressor(compressor); } }}We retrieve a Compressor instance from the pool for a given CompressionCodec, whichwe use in the codec’s overloaded createOutputStream() method. By using a finallyblock, we ensure that the compressor is returned to the pool even if there is anIOException while copying the bytes between the streams.Compression and Input SplitsWhen considering how to compress data that will be processed by MapReduce, it isimportant to understand whether the compression format supports splitting. Consideran uncompressed file stored in HDFS whose size is 1 GB. With an HDFS block size of64 MB, the file will be stored as 16 blocks, and a MapReduce job using this file as inputwill create 16 input splits, each processed independently as input to a separate map task.Imagine now the file is a gzip-compressed file whose compressed size is 1 GB. As before,HDFS will store the file as 16 blocks. However, creating a split for each block won’twork since it is impossible to start reading at an arbitrary point in the gzip stream, andtherefore impossible for a map task to read its split independently of the others. Thegzip format uses DEFLATE to store the compressed data, and DEFLATE stores dataas a series of compressed blocks. The problem is that the start of each block is notdistinguished in any way that would allow a reader positioned at an arbitrary point inthe stream to advance to the beginning of the next block, thereby synchronizing itselfwith the stream. For this reason, gzip does not support splitting.In this case, MapReduce will do the right thing and not try to split the gzipped file,since it knows that the input is gzip-compressed (by looking at the filename extension)and that gzip does not support splitting. This will work, but at the expense of locality:a single map will process the 16 HDFS blocks, most of which will not be local to themap. Also, with fewer maps, the job is less granular, and so may take longer to run.If the file in our hypothetical example were an LZO file, we would have the sameproblem since the underlying compression format does not provide a way for a readerto synchronize itself with the stream.‡ A bzip2 file, however, does provide a synchro-nization marker between blocks (a 48-bit approximation of pi), so it does supportsplitting. (Table 4-1 lists whether each compression format supports splitting.)‡ It is possible to preprocess gzip and LZO files to build an index of split points, effectively making them splittable. See https://issues.apache.org/jira/browse/MAPREDUCE-491 for gzip. For LZO, there is an indexer tool available with the Hadoop LZO libraries, which you can obtain from the site listed in “Codecs” on page 78. Compression | 83

102.
Which Compression Format Should I Use? Which compression format you should use depends on your application. Do you want to maximize the speed of your application or are you more concerned about keeping storage costs down? In general, you should try different strategies for your application, and benchmark them with representative datasets to find the best approach. For large, unbounded files, like logfiles, the options are: • Store the files uncompressed. • Use a compression format that supports splitting, like bzip2. • Split the file into chunks in the application and compress each chunk separately using any supported compression format (it doesn’t matter whether it is splittable). In this case, you should choose the chunk size so that the compressed chunks are approximately the size of an HDFS block. • Use Sequence File, which supports compression and splitting. See “Sequence- File” on page 116. • Use an Avro data file, which supports compression and splitting, just like Sequence File, but has the added advantage of being readable and writable from many languages, not just Java. See “Avro data files” on page 109. For large files, you should not use a compression format that does not support splitting on the whole file, since you lose locality and make MapReduce applications very inefficient. For archival purposes, consider the Hadoop archive format (see “Hadoop Ar- chives” on page 71), although it does not support compression.Using Compression in MapReduceAs described in “Inferring CompressionCodecs using CompressionCodecFac-tory” on page 80, if your input files are compressed, they will be automaticallydecompressed as they are read by MapReduce, using the filename extension to deter-mine the codec to use.To compress the output of a MapReduce job, in the job configuration, set themapred.output.compress property to true and the mapred.output.compression.codecproperty to the classname of the compression codec you want to use, as shown inExample 4-4.Example 4-4. Application to run the maximum temperature job producing compressed outputpublic class MaxTemperatureWithCompression { public static void main(String[] args) throws IOException { if (args.length != 2) { System.err.println("Usage: MaxTemperatureWithCompression <input path> " + "<output path>");84 | Chapter 4: Hadoop I/O

103.
System.exit(-1); } JobConf conf = new JobConf(MaxTemperatureWithCompression.class); conf.setJobName("Max temperature with output compression"); FileInputFormat.addInputPath(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setBoolean("mapred.output.compress", true); conf.setClass("mapred.output.compression.codec", GzipCodec.class, CompressionCodec.class); conf.setMapperClass(MaxTemperatureMapper.class); conf.setCombinerClass(MaxTemperatureReducer.class); conf.setReducerClass(MaxTemperatureReducer.class); JobClient.runJob(conf); }}We run the program over compressed input (which doesn’t have to use the same com-pression format as the output, although it does in this example) as follows: % hadoop MaxTemperatureWithCompression input/ncdc/sample.txt.gz outputEach part of the final output is compressed; in this case, there is a single part: % gunzip -c output/part-00000.gz 1949 111 1950 22If you are emitting sequence files for your output, then you can set the mapred.output.compression.type property to control the type of compression to use. The defaultis RECORD, which compresses individual records. Changing this to BLOCK, whichcompresses groups of records, is recommended since it compresses better (see “TheSequenceFile format” on page 122).Compressing map outputEven if your MapReduce application reads and writes uncompressed data, it may ben-efit from compressing the intermediate output of the map phase. Since the map outputis written to disk and transferred across the network to the reducer nodes, by using afast compressor such as LZO, you can get performance gains simply because the volumeof data to transfer is reduced. The configuration properties to enable compression formap outputs and to set the compression format are shown in Table 4-5. Compression | 85

104.
Table 4-5. Map output compression properties Property name Type Default value Description mapred.compress.map. output boolean false Compress map outputs. mapred.map.output. Class org.apache.hadoop.io. The compression codec to use for compression.codec compress.DefaultCodec map outputs.Here are the lines to add to enable gzip map output compression in your job: conf.setCompressMapOutput(true); conf.setMapOutputCompressorClass(GzipCodec.class);SerializationSerialization is the process of turning structured objects into a byte stream for trans-mission over a network or for writing to persistent storage. Deserialization is the reverseprocess of turning a byte stream back into a series of structured objects.Serialization appears in two quite distinct areas of distributed data processing: forinterprocess communication and for persistent storage.In Hadoop, interprocess communication between nodes in the system is implementedusing remote procedure calls (RPCs). The RPC protocol uses serialization to render themessage into a binary stream to be sent to the remote node, which then deserializes thebinary stream into the original message. In general, it is desirable that an RPC seriali-zation format is:Compact A compact format makes the best use of network bandwidth, which is the most scarce resource in a data center.Fast Interprocess communication forms the backbone for a distributed system, so it is essential that there is as little performance overhead as possible for the serialization and deserialization process.Extensible Protocols change over time to meet new requirements, so it should be straightforward to evolve the protocol in a controlled manner for clients and serv- ers. For example, it should be possible to add a new argument to a method call, and have the new servers accept messages in the old format (without the new ar- gument) from old clients.Interoperable For some systems, it is desirable to be able to support clients that are written in different languages to the server, so the format needs to be designed to make this possible.86 | Chapter 4: Hadoop I/O

105.
On the face of it, the data format chosen for persistent storage would have differentrequirements from a serialization framework. After all, the lifespan of an RPC is lessthan a second, whereas persistent data may be read years after it was written. As it turnsout, the four desirable properties of an RPC’s serialization format are also crucial for apersistent storage format. We want the storage format to be compact (to make efficientuse of storage space), fast (so the overhead in reading or writing terabytes of data isminimal), extensible (so we can transparently read data written in an older format),and interoperable (so we can read or write persistent data using different languages).Hadoop uses its own serialization format, Writables, which is certainly compact andfast, but not so easy to extend or use from languages other than Java. Since Writablesare central to Hadoop (most MapReduce programs use them for their key and valuetypes), we look at them in some depth in the next three sections, before looking atserialization frameworks in general, and then Avro (a serialization system that wasdesigned to overcome some of the limitations of Writables) in more detail.The Writable InterfaceThe Writable interface defines two methods: one for writing its state to a DataOutputbinary stream, and one for reading its state from a DataInput binary stream: package org.apache.hadoop.io; import java.io.DataOutput; import java.io.DataInput; import java.io.IOException; public interface Writable { void write(DataOutput out) throws IOException; void readFields(DataInput in) throws IOException; }Let’s look at a particular Writable to see what we can do with it. We will useIntWritable, a wrapper for a Java int. We can create one and set its value using theset() method: IntWritable writable = new IntWritable(); writable.set(163);Equivalently, we can use the constructor that takes the integer value: IntWritable writable = new IntWritable(163);To examine the serialized form of the IntWritable, we write a small helper method thatwraps a java.io.ByteArrayOutputStream in a java.io.DataOutputStream (an implemen-tation of java.io.DataOutput) to capture the bytes in the serialized stream: public static byte[] serialize(Writable writable) throws IOException { ByteArrayOutputStream out = new ByteArrayOutputStream(); DataOutputStream dataOut = new DataOutputStream(out); writable.write(dataOut); dataOut.close(); Serialization | 87

106.
return out.toByteArray(); }An integer is written using four bytes (as we see using JUnit 4 assertions): byte[] bytes = serialize(writable); assertThat(bytes.length, is(4));The bytes are written in big-endian order (so the most significant byte is written to thestream first, this is dictated by the java.io.DataOutput interface), and we can see theirhexadecimal representation by using a method on Hadoop’s StringUtils: assertThat(StringUtils.byteToHexString(bytes), is("000000a3"));Let’s try deserialization. Again, we create a helper method to read a Writable objectfrom a byte array: public static byte[] deserialize(Writable writable, byte[] bytes) throws IOException { ByteArrayInputStream in = new ByteArrayInputStream(bytes); DataInputStream dataIn = new DataInputStream(in); writable.readFields(dataIn); dataIn.close(); return bytes; }We construct a new, value-less, IntWritable, then call deserialize() to read from theoutput data that we just wrote. Then we check that its value, retrieved using theget() method, is the original value, 163: IntWritable newWritable = new IntWritable(); deserialize(newWritable, bytes); assertThat(newWritable.get(), is(163));WritableComparable and comparatorsIntWritable implements the WritableComparable interface, which is just a subinterfaceof the Writable and java.lang.Comparable interfaces: package org.apache.hadoop.io; public interface WritableComparable<T> extends Writable, Comparable<T> { }Comparison of types is crucial for MapReduce, where there is a sorting phase duringwhich keys are compared with one another. One optimization that Hadoop providesis the RawComparator extension of Java’s Comparator:88 | Chapter 4: Hadoop I/O

107.
package org.apache.hadoop.io; import java.util.Comparator; public interface RawComparator<T> extends Comparator<T> { public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2); }This interface permits implementors to compare records read from a stream withoutdeserializing them into objects, thereby avoiding any overhead of object creation. Forexample, the comparator for IntWritables implements the raw compare() method byreading an integer from each of the byte arrays b1 and b2 and comparing them directly,from the given start positions (s1 and s2) and lengths (l1 and l2).WritableComparator is a general-purpose implementation of RawComparator forWritableComparable classes. It provides two main functions. First, it provides a defaultimplementation of the raw compare() method that deserializes the objects to be com-pared from the stream and invokes the object compare() method. Second, it acts as afactory for RawComparator instances (that Writable implementations have registered).For example, to obtain a comparator for IntWritable, we just use: RawComparator<IntWritable> comparator = WritableComparator.get(IntWritable.class);The comparator can be used to compare two IntWritable objects: IntWritable w1 = new IntWritable(163); IntWritable w2 = new IntWritable(67); assertThat(comparator.compare(w1, w2), greaterThan(0));or their serialized representations: byte[] b1 = serialize(w1); byte[] b2 = serialize(w2); assertThat(comparator.compare(b1, 0, b1.length, b2, 0, b2.length), greaterThan(0));Writable ClassesHadoop comes with a large selection of Writable classes in the org.apache.hadoop.iopackage. They form the class hierarchy shown in Figure 4-1.Writable wrappers for Java primitivesThere are Writable wrappers for all the Java primitive types (see Table 4-6) exceptshort and char (both of which can be stored in an IntWritable). All have a get() anda set() method for retrieving and storing the wrapped value. Serialization | 89

109.
Java primitive Writable implementation Serialized size (bytes) double DoubleWritable 8When it comes to encoding integers, there is a choice between the fixed-length formats(IntWritable and LongWritable) and the variable-length formats (VIntWritable andVLongWritable). The variable-length formats use only a single byte to encode the valueif it is small enough (between –112 and 127, inclusive); otherwise, they use the firstbyte to indicate whether the value is positive or negative, and how many bytes follow.For example, 163 requires two bytes: byte[] data = serialize(new VIntWritable(163)); assertThat(StringUtils.byteToHexString(data), is("8fa3"));How do you choose between a fixed-length and a variable-length encoding? Fixed-length encodings are good when the distribution of values is fairly uniform across thewhole value space, such as a (well-designed) hash function. Most numeric variablestend to have nonuniform distributions, and on average the variable-length encodingwill save space. Another advantage of variable-length encodings is that you can switchfrom VIntWritable to VLongWritable, since their encodings are actually the same. So bychoosing a variable-length representation, you have room to grow without committingto an 8-byte long representation from the beginning.TextText is a Writable for UTF-8 sequences. It can be thought of as the Writable equivalentof java.lang.String. Text is a replacement for the UTF8 class, which was deprecatedbecause it didn’t support strings whose encoding was over 32,767 bytes, and becauseit used Java’s modified UTF-8.The Text class uses an int (with a variable-length encoding) to store the number ofbytes in the string encoding, so the maximum value is 2 GB. Furthermore, Text usesstandard UTF-8, which makes it potentially easier to interoperate with other tools thatunderstand UTF-8.Indexing. Because of its emphasis on using standard UTF-8, there are some differencesbetween Text and the Java String class. Indexing for the Text class is in terms of positionin the encoded byte sequence, not the Unicode character in the string, or the Javachar code unit (as it is for String). For ASCII strings, these three concepts of indexposition coincide. Here is an example to demonstrate the use of the charAt() method: Text t = new Text("hadoop"); assertThat(t.getLength(), is(6)); assertThat(t.getBytes().length, is(6)); assertThat(t.charAt(2), is((int) d)); assertThat("Out of bounds", t.charAt(100), is(-1)); Serialization | 91

110.
Notice that charAt() returns an int representing a Unicode code point, unlike theString variant that returns a char. Text also has a find() method, which is analogousto String’s indexOf(): Text t = new Text("hadoop"); assertThat("Find a substring", t.find("do"), is(2)); assertThat("Finds first o", t.find("o"), is(3)); assertThat("Finds o from position 4 or later", t.find("o", 4), is(4)); assertThat("No match", t.find("pig"), is(-1));Unicode. When we start using characters that are encoded with more than a single byte,the differences between Text and String become clear. Consider the Unicode charactersshown in Table 4-7.§Table 4-7. Unicode characters Unicode code point U+0041 U+00DF U+6771 U+10400 Name LATIN CAPITAL LATIN SMALL LETTER N/A (a unified DESERET CAPITAL LETTER LETTER A SHARP S Han ideograph) LONG I UTF-8 code units 41 c3 9f e6 9d b1 f0 90 90 80 Java representation u0041 u00DF u6771 uuD801uDC00All but the last character in the table, U+10400, can be expressed using a single Javachar. U+10400 is a supplementary character and is represented by two Java chars,known as a surrogate pair. The tests in Example 4-5 show the differences betweenString and Text when processing a string of the four characters from Table 4-7.Example 4-5. Tests showing the differences between the String and Text classespublic class StringTextComparisonTest { @Test public void string() throws UnsupportedEncodingException { String s = "u0041u00DFu6771uD801uDC00"; assertThat(s.length(), is(5)); assertThat(s.getBytes("UTF-8").length, is(10)); assertThat(s.indexOf("u0041"), is(0)); assertThat(s.indexOf("u00DF"), is(1)); assertThat(s.indexOf("u6771"), is(2)); assertThat(s.indexOf("uD801uDC00"), is(3)); assertThat(s.charAt(0), is(u0041)); assertThat(s.charAt(1), is(u00DF)); assertThat(s.charAt(2), is(u6771)); assertThat(s.charAt(3), is(uD801)); assertThat(s.charAt(4), is(uDC00));§ This example is based on one from the article Supplementary Characters in the Java Platform.92 | Chapter 4: Hadoop I/O

111.
assertThat(s.codePointAt(0), is(0x0041)); assertThat(s.codePointAt(1), is(0x00DF)); assertThat(s.codePointAt(2), is(0x6771)); assertThat(s.codePointAt(3), is(0x10400)); } @Test public void text() { Text t = new Text("u0041u00DFu6771uD801uDC00"); assertThat(t.getLength(), is(10)); assertThat(t.find("u0041"), is(0)); assertThat(t.find("u00DF"), is(1)); assertThat(t.find("u6771"), is(3)); assertThat(t.find("uD801uDC00"), is(6)); assertThat(t.charAt(0), is(0x0041)); assertThat(t.charAt(1), is(0x00DF)); assertThat(t.charAt(3), is(0x6771)); assertThat(t.charAt(6), is(0x10400)); }}The test confirms that the length of a String is the number of char code units it contains(5, one from each of the first three characters in the string, and a surrogate pair fromthe last), whereas the length of a Text object is the number of bytes in its UTF-8 encoding(10 = 1+2+3+4). Similarly, the indexOf() method in String returns an index in charcode units, and find() for Text is a byte offset.The charAt() method in String returns the char code unit for the given index, whichin the case of a surrogate pair will not represent a whole Unicode character. The codePointAt() method, indexed by char code unit, is needed to retrieve a single Unicodecharacter represented as an int. In fact, the charAt() method in Text is more like thecodePointAt() method than its namesake in String. The only difference is that it isindexed by byte offset.Iteration. Iterating over the Unicode characters in Text is complicated by the use of byteoffsets for indexing, since you can’t just increment the index. The idiom for iterationis a little obscure (see Example 4-6): turn the Text object into a java.nio.ByteBuffer,then repeatedly call the bytesToCodePoint() static method on Text with the buffer. Thismethod extracts the next code point as an int and updates the position in the buffer.The end of the string is detected when bytesToCodePoint() returns –1.Example 4-6. Iterating over the characters in a Text objectpublic class TextIterator { public static void main(String[] args) { Text t = new Text("u0041u00DFu6771uD801uDC00"); ByteBuffer buf = ByteBuffer.wrap(t.getBytes(), 0, t.getLength()); Serialization | 93

112.
int cp; while (buf.hasRemaining() && (cp = Text.bytesToCodePoint(buf)) != -1) { System.out.println(Integer.toHexString(cp)); } }}Running the program prints the code points for the four characters in the string: % hadoop TextIterator 41 df 6771 10400Mutability. Another difference with String is that Text is mutable (like all Writable im-plementations in Hadoop, except NullWritable, which is a singleton). You can reuse aText instance by calling one of the set() methods on it. For example: Text t = new Text("hadoop"); t.set("pig"); assertThat(t.getLength(), is(3)); assertThat(t.getBytes().length, is(3)); In some situations, the byte array returned by the getBytes() method may be longer than the length returned by getLength(): Text t = new Text("hadoop"); t.set(new Text("pig")); assertThat(t.getLength(), is(3)); assertThat("Byte length not shortened", t.getBytes().length, is(6)); This shows why it is imperative that you always call getLength() when calling getBytes(), so you know how much of the byte array is valid data.Resorting to String. Text doesn’t have as rich an API for manipulating strings asjava.lang.String, so in many cases, you need to convert the Text object to a String.This is done in the usual way, using the toString() method: assertThat(new Text("hadoop").toString(), is("hadoop"));BytesWritableBytesWritable is a wrapper for an array of binary data. Its serialized format is an integerfield (4 bytes) that specifies the number of bytes to follow, followed by the bytes them-selves. For example, the byte array of length two with values 3 and 5 is serialized as a4-byte integer (00000002) followed by the two bytes from the array (03 and 05): BytesWritable b = new BytesWritable(new byte[] { 3, 5 }); byte[] bytes = serialize(b); assertThat(StringUtils.byteToHexString(bytes), is("000000020305"));94 | Chapter 4: Hadoop I/O

113.
BytesWritable is mutable, and its value may be changed by calling its set() method.As with Text, the size of the byte array returned from the getBytes() method for BytesWritable—the capacity—may not reflect the actual size of the data stored in theBytesWritable. You can determine the size of the BytesWritable by calling getLength(). To demonstrate: b.setCapacity(11); assertThat(b.getLength(), is(2)); assertThat(b.getBytes().length, is(11));NullWritableNullWritable is a special type of Writable, as it has a zero-length serialization. No bytesare written to, or read from, the stream. It is used as a placeholder; for example, inMapReduce, a key or a value can be declared as a NullWritable when you don’t needto use that position—it effectively stores a constant empty value. NullWritable can alsobe useful as a key in SequenceFile when you want to store a list of values, as opposedto key-value pairs. It is an immutable singleton: the instance can be retrieved by callingNullWritable.get().ObjectWritable and GenericWritableObjectWritable is a general-purpose wrapper for the following: Java primitives, String,enum, Writable, null, or arrays of any of these types. It is used in Hadoop RPC to marshaland unmarshal method arguments and return types.ObjectWritable is useful when a field can be of more than one type: for example, if thevalues in a SequenceFile have multiple types, then you can declare the value type as anObjectWritable and wrap each type in an ObjectWritable. Being a general-purposemechanism, it’s fairly wasteful of space since it writes the classname of the wrappedtype every time it is serialized. In cases where the number of types is small and knownahead of time, this can be improved by having a static array of types, and using theindex into the array as the serialized reference to the type. This is the approach thatGenericWritable takes, and you have to subclass it to specify the types to support.Writable collectionsThere are four Writable collection types in the org.apache.hadoop.io package: ArrayWritable, TwoDArrayWritable, MapWritable, and SortedMapWritable.ArrayWritable and TwoDArrayWritable are Writable implementations for arrays andtwo-dimensional arrays (array of arrays) of Writable instances. All the elements of anArrayWritable or a TwoDArrayWritable must be instances of the same class, which isspecified at construction, as follows: ArrayWritable writable = new ArrayWritable(Text.class); Serialization | 95

114.
In contexts where the Writable is defined by type, such as in SequenceFile keys orvalues, or as input to MapReduce in general, you need to subclass ArrayWritable (orTwoDArrayWritable, as appropriate) to set the type statically. For example: public class TextArrayWritable extends ArrayWritable { public TextArrayWritable() { super(Text.class); } }ArrayWritable and TwoDArrayWritable both have get() and set() methods, as well as atoArray() method, which creates a shallow copy of the array (or 2D array).MapWritable and SortedMapWritable are implementations of java.util.Map<Writable,Writable> and java.util.SortedMap<WritableComparable, Writable>, respectively. Thetype of each key and value field is a part of the serialization format for that field. Thetype is stored as a single byte that acts as an index into an array of types. The array ispopulated with the standard types in the org.apache.hadoop.io package, but customWritable types are accommodated, too, by writing a header that encodes the type arrayfor nonstandard types. As they are implemented, MapWritable and SortedMapWritableuse positive byte values for custom types, so a maximum of 127 distinct nonstandardWritable classes can be used in any particular MapWritable or SortedMapWritable in-stance. Here’s a demonstration of using a MapWritable with different types for keys andvalues: MapWritable src = new MapWritable(); src.put(new IntWritable(1), new Text("cat")); src.put(new VIntWritable(2), new LongWritable(163)); MapWritable dest = new MapWritable(); WritableUtils.cloneInto(dest, src); assertThat((Text) dest.get(new IntWritable(1)), is(new Text("cat"))); assertThat((LongWritable) dest.get(new VIntWritable(2)), is(new LongWritable(163)));Conspicuous by their absence are Writable collection implementations for sets andlists. A set can be emulated by using a MapWritable (or a SortedMapWritable for a sortedset), with NullWritable values. For lists of a single type of Writable, ArrayWritable isadequate, but to store different types of Writable in a single list, you can useGenericWritable to wrap the elements in an ArrayWritable. Alternatively, you couldwrite a general ListWritable using the ideas from MapWritable.Implementing a Custom WritableHadoop comes with a useful set of Writable implementations that serve most purposes;however, on occasion, you may need to write your own custom implementation. Witha custom Writable, you have full control over the binary representation and the sortorder. Because Writables are at the heart of the MapReduce data path, tuning the binaryrepresentation can have a significant effect on performance. The stock Writable96 | Chapter 4: Hadoop I/O

116.
} @Override public int hashCode() { return first.hashCode() * 163 + second.hashCode(); } @Override public boolean equals(Object o) { if (o instanceof TextPair) { TextPair tp = (TextPair) o; return first.equals(tp.first) && second.equals(tp.second); } return false; } @Override public String toString() { return first + "t" + second; } @Override public int compareTo(TextPair tp) { int cmp = first.compareTo(tp.first); if (cmp != 0) { return cmp; } return second.compareTo(tp.second); }}The first part of the implementation is straightforward: there are two Text instancevariables, first and second, and associated constructors, getters, and setters. AllWritable implementations must have a default constructor so that the MapReduceframework can instantiate them, then populate their fields by calling readFields().Writable instances are mutable and often reused, so you should take care to avoidallocating objects in the write() or readFields() methods.TextPair’s write() method serializes each Text object in turn to the output stream, bydelegating to the Text objects themselves. Similarly, readFields() deserializes the bytesfrom the input stream by delegating to each Text object. The DataOutput andDataInput interfaces have a rich set of methods for serializing and deserializing Javaprimitives, so, in general, you have complete control over the wire format of yourWritable object.Just as you would for any value object you write in Java, you should override thehashCode(), equals(), and toString() methods from java.lang.Object. The hashCode() method is used by the HashPartitioner (the default partitioner in MapReduce)to choose a reduce partition, so you should make sure that you write a good hashfunction that mixes well to ensure reduce partitions are of a similar size.98 | Chapter 4: Hadoop I/O

117.
If you ever plan to use your custom Writable with TextOutputFormat, then you must implement its toString() method. TextOutputFormat calls toString() on keys and values for their output representation. For Text Pair, we write the underlying Text objects as strings separated by a tab character.TextPair is an implementation of WritableComparable, so it provides an implementationof the compareTo() method that imposes the ordering you would expect: it sorts by thefirst string followed by the second. Notice that TextPair differs from TextArrayWritable from the previous section (apart from the number of Text objects it can store), sinceTextArrayWritable is only a Writable, not a WritableComparable.Implementing a RawComparator for speedThe code for TextPair in Example 4-7 will work as it stands; however, there is a furtheroptimization we can make. As explained in “WritableComparable and compara-tors” on page 88, when TextPair is being used as a key in MapReduce, it will have tobe deserialized into an object for the compareTo() method to be invoked. What if it werepossible to compare two TextPair objects just by looking at their serializedrepresentations?It turns out that we can do this, since TextPair is the concatenation of two Text objects,and the binary representation of a Text object is a variable-length integer containingthe number of bytes in the UTF-8 representation of the string, followed by the UTF-8bytes themselves. The trick is to read the initial length, so we know how long the firstText object’s byte representation is; then we can delegate to Text’s RawComparator, andinvoke it with the appropriate offsets for the first or second string. Example 4-8 givesthe details (note that this code is nested in the TextPair class).Example 4-8. A RawComparator for comparing TextPair byte representations public static class Comparator extends WritableComparator { private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator(); public Comparator() { super(TextPair.class); } @Override public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) { try { int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1); int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2); int cmp = TEXT_COMPARATOR.compare(b1, s1, firstL1, b2, s2, firstL2); if (cmp != 0) { return cmp; } Serialization | 99

118.
return TEXT_COMPARATOR.compare(b1, s1 + firstL1, l1 - firstL1, b2, s2 + firstL2, l2 - firstL2); } catch (IOException e) { throw new IllegalArgumentException(e); } } } static { WritableComparator.define(TextPair.class, new Comparator()); }We actually subclass WritableComparator rather than implement RawComparator di-rectly, since it provides some convenience methods and default implementations. Thesubtle part of this code is calculating firstL1 and firstL2, the lengths of the firstText field in each byte stream. Each is made up of the length of the variable-lengthinteger (returned by decodeVIntSize() on WritableUtils) and the value it is encoding(returned by readVInt()).The static block registers the raw comparator so that whenever MapReduce sees theTextPair class, it knows to use the raw comparator as its default comparator.Custom comparatorsAs we can see with TextPair, writing raw comparators takes some care, since you haveto deal with details at the byte level. It is worth looking at some of the implementationsof Writable in the org.apache.hadoop.io package for further ideas, if you need to writeyour own. The utility methods on WritableUtils are very handy, too.Custom comparators should also be written to be RawComparators, if possible. Theseare comparators that implement a different sort order to the natural sort order definedby the default comparator. Example 4-9 shows a comparator for TextPair, called FirstComparator, that considers only the first string of the pair. Note that we override thecompare() method that takes objects so both compare() methods have the samesemantics.We will make use of this comparator in Chapter 8, when we look at joins and secondarysorting in MapReduce (see “Joins” on page 247).Example 4-9. A custom RawComparator for comparing the first field of TextPair byte representations public static class FirstComparator extends WritableComparator { private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator(); public FirstComparator() { super(TextPair.class); } @Override public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {100 | Chapter 4: Hadoop I/O

119.
try { int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1); int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2); return TEXT_COMPARATOR.compare(b1, s1, firstL1, b2, s2, firstL2); } catch (IOException e) { throw new IllegalArgumentException(e); } } @Override public int compare(WritableComparable a, WritableComparable b) { if (a instanceof TextPair && b instanceof TextPair) { return ((TextPair) a).first.compareTo(((TextPair) b).first); } return super.compare(a, b); } }Serialization FrameworksAlthough most MapReduce programs use Writable key and value types, this isn’t man-dated by the MapReduce API. In fact, any types can be used; the only requirement isthat there be a mechanism that translates to and from a binary representation of eachtype.To support this, Hadoop has an API for pluggable serialization frameworks. A seriali-zation framework is represented by an implementation of Serialization (in theorg.apache.hadoop.io.serializer package). WritableSerialization, for example, isthe implementation of Serialization for Writable types.A Serialization defines a mapping from types to Serializer instances (for turning anobject into a byte stream) and Deserializer instances (for turning a byte stream intoan object).Set the io.serializations property to a comma-separated list of classnames to registerSerialization implementations. Its default value is org.apache.hadoop.io.serializer.WritableSerialization, which means that only Writable objects can be serializedor deserialized out of the box.Hadoop includes a class called JavaSerialization that uses Java Object Serialization.Although it makes it convenient to be able to use standard Java types in MapReduceprograms, like Integer or String, Java Object Serialization is not as efficient as Writa-bles, so it’s not worth making this trade-off (see the sidebar on the next page). Serialization | 101

120.
Why Not Use Java Object Serialization? Java comes with its own serialization mechanism, called Java Object Serialization (often referred to simply as “Java Serialization”), that is tightly integrated with the language, so it’s natural to ask why this wasn’t used in Hadoop. Here’s what Doug Cutting said in response to that question: Why didn’t I use Serialization when we first started Hadoop? Because it looked big and hairy and I thought we needed something lean and mean, where we had precise control over exactly how objects are written and read, since that is central to Hadoop. With Serialization you can get some control, but you have to fight for it. The logic for not using RMI was similar. Effective, high-performance inter-process communications are critical to Hadoop. I felt like we’d need to precisely control how things like connections, timeouts and buffers are handled, and RMI gives you little control over those. The problem is that Java Serialization doesn’t meet the criteria for a serialization format listed earlier: compact, fast, extensible, and interoperable. Java Serialization is not compact: it writes the classname of each object being written to the stream—this is true of classes that implement java.io.Serializable or java.io.Externalizable. Subsequent instances of the same class write a reference han- dle to the first occurrence, which occupies only 5 bytes. However, reference handles don’t work well with random access, since the referent class may occur at any point in the preceding stream—that is, there is state stored in the stream. Even worse, reference handles play havoc with sorting records in a serialized stream, since the first record of a particular class is distinguished and must be treated as a special case. All these problems are avoided by not writing the classname to the stream at all, which is the approach that Writable takes. This makes the assumption that the client knows the expected type. The result is that the format is considerably more compact than Java Serialization, and random access and sorting work as expected since each record is independent of the others (so there is no stream state). Java Serialization is a general-purpose mechanism for serializing graphs of objects, so it necessarily has some overhead for serialization and deserialization operations. What’s more, the deserialization procedure creates a new instance for each object deserialized from the stream. Writable objects, on the other hand, can be (and often are) reused. For example, for a MapReduce job, which at its core serializes and deserializes billions of records of just a handful of different types, the savings gained by not having to allocate new objects are significant. In terms of extensibility, Java Serialization has some support for evolving a type, but it is brittle and hard to use effectively (Writables have no support: the programmer has to manage them himself). In principle, other languages could interpret the Java Serialization stream protocol (de- fined by the Java Object Serialization Specification), but in practice there are no widely102 | Chapter 4: Hadoop I/O

121.
used implementations in other languages, so it is a Java-only solution. The situation is the same for Writables.Serialization IDLThere are a number of other serialization frameworks that approach the problem in adifferent way: rather than defining types through code, you define them in a language-neutral, declarative fashion, using an interface description language (IDL). The systemcan then generate types for different languages, which is good for interoperability. Theyalso typically define versioning schemes that make type evolution straightforward.Hadoop’s own Record I/O (found in the org.apache.hadoop.record package) has anIDL that is compiled into Writable objects, which makes it convenient for generatingtypes that are compatible with MapReduce. For whatever reason, however, RecordI/O was not widely used, and has been deprecated in favor of Avro.Apache Thrift and Google Protocol Buffers are both popular serialization frameworks,and they are commonly used as a format for persistent binary data. There is limitedsupport for these as MapReduce formats;‖ however, Thrift is used in parts of Hadoopto provide cross-language APIs, such as the “thriftfs” contrib module, where it is usedto expose an API to Hadoop filesystems (see “Thrift” on page 49).In the next section, we look at Avro, an IDL-based serialization framework designedto work well with large-scale data processing in Hadoop.AvroApache Avro# is a language-neutral data serialization system. The project was createdby Doug Cutting (the creator of Hadoop) to address the major downside of HadoopWritables: lack of language portability. Having a data format that can be processed bymany languages (currently C, C++, Java, Python, and Ruby) makes it easier to sharedatasets with a wider audience than one tied to a single language. It is also more future-proof, allowing data to potentially outlive the language used to read and write it.But why a new data serialization system? Avro has a set of features that, taken together,differentiate it from other systems like Apache Thrift or Google’s Protocol Buffers.* Likethese systems and others, Avro data is described using a language-independentschema. However, unlike some other systems, code generation is optional in Avro,‖ You can find the latest status for a Thrift Serialization at https://issues.apache.org/jira/browse/HADOOP -3787, and a Protocol Buffers Serialization at https://issues.apache.org/jira/browse/HADOOP-3788. Twitter’s Elephant Bird project (http://github.com/kevinweil/elephant-bird) includes tools for working with Protocol Buffers in Hadoop.#Named after the British aircraft manufacturer from the 20th century.* Avro also performs favorably compared to other serialization libraries, as the benchmarks at http://code.google .com/p/thrift-protobuf-compare/ demonstrate. Serialization | 103

122.
which means you can read and write data that conforms to a given schema even if yourcode has not seen that particular schema before. To achieve this, Avro assumes thatthe schema is always present—at both read and write time—which makes for a verycompact encoding, since encoded values do not need to be tagged with a field identifier.Avro schemas are usually written in JSON, and data is usually encoded using a binaryformat, but there are other options, too. There is a higher-level language called AvroIDL, for writing schemas in a C-like language that is more familiar to developers. Thereis also a JSON-based data encoder, which, being human-readable, is useful for proto-typing and debugging Avro data.The Avro specification precisely defines the binary format that all implementations mustsupport. It also specifies many of the other features of Avro that implementationsshould support. One area that the specification does not rule on, however, is APIs:implementations have complete latitude in the API they expose for working with Avrodata, since each one is necessarily language-specific. The fact that there is only onebinary format is significant, since it means the barrier for implementing a new languagebinding is lower, and avoids the problem of a combinatorial explosion of languagesand formats, which would harm interoperability.Avro has rich schema resolution capabilities. Within certain carefully defined con-straints, the schema used to read data need not be identical to the schema that was usedto write the data. This is the mechanism by which Avro supports schema evolution.For example, a new, optional field may be added to a record by declaring it in theschema used to read the old data. New and old clients alike will be able to read the olddata, while new clients can write new data that uses the new field. Conversely, if an oldclient sees newly encoded data, it will gracefully ignore the new field and carry onprocessing as it would have done with old data.Avro specifies an object container format for sequences of objects—similar to Hadoop’ssequence file. An Avro data file has a metadata section where the schema is stored,which makes the file self-describing. Avro data files support compression and are split-table, which is crucial for a MapReduce data input format. Furthermore, since Avrowas designed with MapReduce in mind, in the future it will be possible to use Avro tobring first-class MapReduce APIs (that is, ones that are richer than Streaming, like theJava API, or C++ Pipes) to languages that speak Avro.Avro can be used for RPC, too, although this isn’t covered here. The Hadoop projecthas plans to migrate to Avro RPC, which will have several benefits, including supportingrolling upgrades, and the possibility of multilanguage clients, such as an HDFS clientimplemented entirely in C.Avro data types and schemasAvro defines a small number of data types, which can be used to build application-specific data structures by writing schemas. For interoperability, implementations mustsupport all Avro types.104 | Chapter 4: Hadoop I/O

124.
Type Description Schema example "size": 16 } union A union of schemas. A union is represented by a JSON [ "null", array, where each element in the array is a schema. "string", Data represented by a union must match one of the {"type": "map", "values": "string"} schemas in the union. ]Each Avro language API has a representation for each Avro type that is specific to thelanguage. For example, Avro’s double type is represented in C, C++, and Java by adouble, in Python by a float, and in Ruby by a Float.What’s more, there may be more than one representation, or mapping, for a language.All languages support a dynamic mapping, which can be used even when the schemais not known ahead of run time. Java calls this the generic mapping.In addition, the Java and C++ implementations can generate code to represent the datafor an Avro schema. Code generation, which is called the specific mapping in Java, isan optimization that is useful when you have a copy of the schema before you read orwrite data. Generated classes also provide a more domain-oriented API for user codethan generic ones.Java has a third mapping, the reflect mapping, which maps Avro types onto preexistingJava types, using reflection. It is slower than the generic and specific mappings, and isnot generally recommended for new applications.Java’s type mappings are shown in Table 4-10. As the table shows, the specific mappingis the same as the generic one unless otherwise noted (and the reflect one is the sameas the specific one unless noted). The specific mapping only differs from the genericone for record, enum, and fixed, all of which have generated classes (the name of whichis controlled by the name and optional namespace attribute). Why don’t the Java generic and specific mappings use Java String to represent an Avro string? The answer is efficiency: the Avro Utf8 type is mutable, so it may be reused for reading or writing a series of values. Also, Java String decodes UTF-8 at object construction time, while Avro Utf8 does it lazily, which can increase performance in some cases. Note that the Java reflect mapping does use Java’s String class, since it is designed for Java compatibility, not performance.Table 4-10. Avro Java type mappings Avro type Generic Java mapping Specific Java mapping Reflect Java mapping null null type boolean boolean int int short or int106 | Chapter 4: Hadoop I/O

125.
Avro type Generic Java mapping Specific Java mapping Reflect Java mapping long long float float double double bytes java.nio.ByteBuffer Array of byte string org.apache.avro. java.lang.String util.Utf8 array org.apache.avro. Array or java.util.Collection generic.GenericArray map java.util.Map record org.apache.avro. Generated class implementing Arbitrary user class with a zero- generic.Generic org.apache.avro. argument constructor. All inherited Record specific.Specific nontransient instance fields are used. Record. enum java.lang.String Generated Java enum Arbitrary Java enum fixed org.apache.avro. Generated class implementing org.apache.avro. generic.GenericFixed org.apache.avro. generic.GenericFixed specific.SpecificFixed. union java.lang.ObjectIn-memory serialization and deserializationAvro provides APIs for serialization and deserialization, which are useful when youwant to integrate Avro with an existing system, such as a messaging system where theframing format is already defined. In other cases, consider using Avro’s data file format.Let’s write a Java program to read and write Avro data to and from streams. We’ll startwith a simple Avro schema for representing a pair of strings as a record: { "type": "record", "name": "Pair", "doc": "A pair of strings.", "fields": [ {"name": "left", "type": "string"}, {"name": "right", "type": "string"} ] }If this schema is saved in a file on the classpath called Pair.avsc (.avsc is the conven-tional extension for an Avro schema), then we can load it using the following statement: Schema schema = Schema.parse(getClass().getResourceAsStream("Pair.avsc"));We can create an instance of an Avro record using the generic API as follows: Serialization | 107

126.
GenericRecord datum = new GenericData.Record(schema); datum.put("left", new Utf8("L")); datum.put("right", new Utf8("R"));Notice that we construct Avro Utf8 instances for the record’s string fields.Next, we serialize the record to an output stream: ByteArrayOutputStream out = new ByteArrayOutputStream(); DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema); Encoder encoder = new BinaryEncoder(out); writer.write(datum, encoder); encoder.flush(); out.close();There are two important objects here: the DatumWriter and the Encoder. ADatumWriter translates data objects into the types understood by an Encoder, which thelatter writes to the output stream. Here we are using a GenericDatumWriter, which passesthe fields of GenericRecord to the Encoder, in this case the BinaryEncoder.In this example only one object is written to the stream, but we could call write() withmore objects before closing the stream if we wanted to.The GenericDatumWriter needs to be passed the schema since it follows the schema todetermine which values from the data objects to write out. After we have called thewriter’s write() method, we flush the encoder, then close the output stream.We can reverse the process and read the object back from the byte buffer: DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(schema); Decoder decoder = DecoderFactory.defaultFactory() .createBinaryDecoder(out.toByteArray(), null); GenericRecord result = reader.read(null, decoder); assertThat(result.get("left").toString(), is("L")); assertThat(result.get("right").toString(), is("R"));We pass null to the calls to createBinaryDecoder() and read() since we are not reusingobjects here (the decoder or the record, respectively).Let’s look briefly at the equivalent code using the specific API. We can generate thePair class from the schema file, by using the Avro tools JAR file:† % java -jar $AVRO_HOME/avro-tools-*.jar compile schema > avro/src/main/resources/Pair.avsc avro/src/main/javaThen instead of a GenericRecord we construct a Pair instance, which we write to thestream using a SpecificDatumWriter, and read back using a SpecificDatumReader: Pair datum = new Pair(); datum.left = new Utf8("L"); datum.right = new Utf8("R"); ByteArrayOutputStream out = new ByteArrayOutputStream();† Avro can be downloaded in both source and binary forms from http://avro.apache.org/releases.html.108 | Chapter 4: Hadoop I/O

127.
DatumWriter<Pair> writer = new SpecificDatumWriter<Pair>(Pair.class); Encoder encoder = new BinaryEncoder(out); writer.write(datum, encoder); encoder.flush(); out.close(); DatumReader<Pair> reader = new SpecificDatumReader<Pair>(Pair.class); Decoder decoder = DecoderFactory.defaultFactory() .createBinaryDecoder(out.toByteArray(), null); Pair result = reader.read(null, decoder); assertThat(result.left.toString(), is("L")); assertThat(result.right.toString(), is("R"));Avro data filesAvro’s object container file format is for storing sequences of Avro objects. It is verysimilar in design to Hadoop’s sequence files, which are described in “Sequence-File” on page 116. The main difference is that Avro data files are designed to be portableacross languages, so, for example, you can write a file in Python and read it in C (wewill do exactly this in the next section).A data file has a header containing metadata, including the Avro schema and a syncmarker, followed by a series of (optionally compressed) blocks containing the serializedAvro objects. Blocks are separated by a sync marker that is unique to the file (the markerfor a particular file is found in the header) and that permits rapid resynchronizationwith a block boundary after seeking to an arbitrary point in the file, such as an HDFSblock boundary. Thus, Avro data files are splittable, which makes them amenable toefficient MapReduce processing.Writing Avro objects to a data file is similar to writing to a stream. We use aDatumWriter, as before, but instead of using an Encoder, we create a DataFileWriterinstance with the DatumWriter. Then we can create a new data file (which, by conven-tion, has a .avro extension) and append objects to it: File file = new File("data.avro"); DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema); DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(writer); dataFileWriter.create(schema, file); dataFileWriter.append(datum); dataFileWriter.close();The objects that we write to the data file must conform to the file’s schema, otherwisean exception will be thrown when we call append().This example demonstrates writing to a local file (java.io.File in the previous snippet),but we can write to any java.io.OutputStream by using the overloaded create() methodon DataFileWriter. To write a file to HDFS, for example, get an OutputStream by callingcreate() on FileSystem (see “Writing Data” on page 55). Serialization | 109

128.
Reading back objects from a data file is similar to the earlier case of reading objectsfrom an in-memory stream, with one important difference: we don’t have to specify aschema since it is read from the file metadata. Indeed, we can get the schema from theDataFileReader instance, using getSchema(), and verify that it is the same as the one weused to write the original object with: DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(); DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(file, reader); assertThat("Schema is the same", schema, is(dataFileReader.getSchema()));DataFileReader is a regular Java iterator, so we can iterate through its data objects bycalling its hasNext() and next() methods. The following snippet checks that there isonly one record, and that it has the expected field values: assertThat(dataFileReader.hasNext(), is(true)); GenericRecord result = dataFileReader.next(); assertThat(result.get("left").toString(), is("L")); assertThat(result.get("right").toString(), is("R")); assertThat(dataFileReader.hasNext(), is(false));Rather than using the usual next() method, however, it is preferable to use the over-loaded form that takes an instance of the object to be returned (in this case,GenericRecord), since it will reuse the object and save allocation and garbage collectioncosts for files containing many objects. The following is idiomatic: GenericRecord record = null; while (dataFileReader.hasNext()) { record = dataFileReader.next(record); // process record }If object reuse is not important, you can use this shorter form: for (GenericRecord record : dataFileReader) { // process record }For the general case of reading a file on a Hadoop file system, use Avro’s FsInput tospecify the input file using a Hadoop Path object. DataFileReader actually offers randomaccess to Avro data file (via its seek() and sync() methods); however, in many cases,sequential streaming access is sufficient, for which DataFileStream should be used.DataFileStream can read from any Java InputStream.InteroperabilityTo demonstrate Avro’s language interoperability, let’s write a data file using onelanguage (Python) and read it back with another (C).Python API. The program in Example 4-10 reads comma-separated strings from standardinput and writes them as Pair records to an Avro data file. Like the Java code for writinga data file, we create a DatumWriter and a DataFileWriter object. Notice that we have110 | Chapter 4: Hadoop I/O

130.
Example 4-11. A C program for reading Avro record pairs from a data file#include <avro.h>#include <stdio.h>#include <stdlib.h>int main(int argc, char *argv[]) { if (argc != 2) { fprintf(stderr, "Usage: dump_pairs <data_file>n"); exit(EXIT_FAILURE); } const char *avrofile = argv[1]; avro_schema_error_t error; avro_file_reader_t filereader; avro_datum_t pair; avro_datum_t left; avro_datum_t right; int rval; char *p; avro_file_reader(avrofile, &filereader); while (1) { rval = avro_file_reader_read(filereader, NULL, &pair); if (rval) break; if (avro_record_get(pair, "left", &left) == 0) { avro_string_get(left, &p); fprintf(stdout, "%s,", p); } if (avro_record_get(pair, "right", &right) == 0) { avro_string_get(right, &p); fprintf(stdout, "%sn", p); } } avro_file_reader_close(filereader); return 0;}The core of the program does three things: 1. opens a file reader of type avro_file_reader_t by calling Avro’s avro_file_reader function,§ 2. reads Avro data from the file reader with the avro_file_reader_read function in a while loop until there are no pairs left (as determined by the return value rval), and 3. closes the file reader with avro_file_reader_close.The avro_file_reader_read function accepts a schema as its second argument to sup-port the case where the schema for reading is different to the one used when the file‡ For the general case, the Avro tools JAR file has a tojson command that dumps the contents of a Avro data file as JSON.§ Avro functions and types have a avro_ prefix and are defined in the avro.h header file.112 | Chapter 4: Hadoop I/O

131.
was written (this is explained in the next section), but we simply pass in NULL, whichtells Avro to use the data file’s schema. The third argument is a pointer to aavro_datum_t object, which is populated with the contents of the next record read fromthe file. We unpack the pair structure into its fields by calling avro_record_get, andthen we extract the value of these fields as strings using avro_string_get, which weprint to the console.Running the program using the output of the Python program prints the original input: % ./dump_pairs pairs.avro a,1 c,2 b,3 b,2We have successfully exchanged complex data between two Avro implementations.Schema resolutionWe can choose to use a different schema for reading the data back (the reader’sschema) to the one we used to write it (the writer’s schema). This is a powerful tool,since it enables schema evolution. To illustrate, consider a new schema for string pairs,with an added description field: { "type": "record", "name": "Pair", "doc": "A pair of strings with an added field.", "fields": [ {"name": "left", "type": "string"}, {"name": "right", "type": "string"}, {"name": "description", "type": "string", "default": ""} ] }We can use this schema to read the data we serialized earlier, since, crucially, we havegiven the description field a default value (the empty string‖), which Avro will use whenthere is no field defined in the records it is reading. Had we omitted the defaultattribute, we would get an error when trying to read the old data. To make the default value null, rather than the empty string, we would instead define the description field using a union with the null Avro type: {"name": "description", "type": ["null", "string"], "default": "null"}‖ Default values for fields are encoded using JSON. See the Avro specification for a description of this encoding for each data type. Serialization | 113

132.
When the reader’s schema is different from the writer’s, we use the constructor forGenericDatumReader that takes two schema objects, the writer’s and the reader’s, in thatorder: DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(schema, newSchema); Decoder decoder = DecoderFactory.defaultFactory() .createBinaryDecoder(out.toByteArray(), null); GenericRecord result = reader.read(null, decoder); assertThat(result.get("left").toString(), is("L")); assertThat(result.get("right").toString(), is("R")); assertThat(result.get("description").toString(), is(""));For data files, which have the writer’s schema stored in the metadata, we only need tospecify the readers’s schema explicitly, which we can do by passing null for the writer’sschema: DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(null, newSchema);Another common use of a different reader’s schema is to drop fields in a record, anoperation called projection. This is useful when you have records with a large numberof fields and you only want to read some of them. For example, this schema can beused to get only the right field of a Pair: { "type": "record", "name": "Pair", "doc": "The right field of a pair of strings.", "fields": [ {"name": "right", "type": "string"} ] }The rules for schema resolution have a direct bearing on how schemas may evolve fromone version to the next, and are spelled out in the Avro specification for all Avro types.A summary of the rules for record evolution from the point of view of readers andwriters (or servers and clients) is presented in Table 4-11.Table 4-11. Schema resolution of records New schema Writer Reader Action Added field Old New The reader uses the default value of the new field, since it is not written by the writer. New Old The reader does not know about the new field written by the writer, so it is ignored. (Projection). Removed field Old New The reader ignores the removed field. (Projection). New Old The removed field is not written by the writer. If the old schema had a default defined for the field, then the reader uses this, otherwise it gets an error. In this case, it is best to update the reader’s schema at the same time as, or before, the writer’s.114 | Chapter 4: Hadoop I/O

133.
Sort orderAvro defines a sort order for objects. For most Avro types, the order is the natural oneyou would expect—for example, numeric types are ordered by ascending numericvalue. Others are a little more subtle—enums are compared by the order in which thesymbol is defined and not by the value of the symbol string, for instance.All types except record have preordained rules for their sort order as described in theAvro specification; they cannot be overridden by the user. For records, however, youcan control the sort order by specifying the order attribute for a field. It takes one ofthree values: ascending (the default), descending (to reverse the order), or ignore (sothe field is skipped for comparison purposes).For example, the following schema (SortedPair.avsc) defines an ordering of Pair recordsby the right field in descending order. The left field is ignored for the purposes ofordering, but it is still present in the projection: { "type": "record", "name": "Pair", "doc": "A pair of strings, sorted by right field descending.", "fields": [ {"name": "left", "type": "string", "order": "ignore"}, {"name": "right", "type": "string", "order": "descending"} ] }The record’s fields are compared pairwise in the document order of the reader’s schema.Thus, by specifying an appropriate reader’s schema, you can impose an arbitraryordering on data records. This schema (SwitchedPair.avsc) defines a sort order by theright field, then the left: { "type": "record", "name": "Pair", "doc": "A pair of strings, sorted by right then left.", "fields": [ {"name": "right", "type": "string"}, {"name": "left", "type": "string"} ] }Avro implements efficient binary comparisons. That is to say, Avro does not have todeserialize a binary data into objects to perform the comparison, since it can insteadwork directly on the byte streams.# In the case of the original Pair schema (with noorder attributes), for example, Avro implements the binary comparison as follows.#A useful consequence of this property is that you can compute an Avro datum’s hash code from either the object or the binary representation (the latter by using the static hashCode() method on BinaryData) and get the same result in both cases. Serialization | 115

134.
The first field, left, is a UTF-8-encoded string, for which Avro can compare the byteslexicographically. If they differ, then the order is determined, and Avro can stop thecomparison there. Otherwise, if the two byte sequences are the same, it compares thesecond two (right) fields, again lexicographically at the byte level since the field isanother UTF-8 string.Notice that this description of a comparison function has exactly the same logic as thebinary comparator we wrote for Writables in “Implementing a RawComparator forspeed” on page 99. The great thing is that Avro provides the comparator for us, so wedon’t have to write and maintain this code. It’s also easy to change the sort order justby changing the reader’s schema. For the SortedPair.avsc or SwitchedPair.avsc schemas,the comparison function Avro uses is essentially the same as the one just described: thedifference is in which fields are considered, the order in which they are considered, andwhether the order is ascending or descending.Avro MapReduceAvro provides a number of classes for making it easy to run MapReduce programs onAvro data. For example, AvroMapper and AvroReducer in the org.apache.avro.mapredpackage are specializations of Hadoop’s (old style) Mapper and Reducer classes. Theyeliminate the key-value distinction for inputs and outputs, since Avro data files are justa sequence of values. However, intermediate data is still divided into key-value pairsfor the shuffle. Avro’s MapReduce integration was being added as this edition went topress, but you can find example code at the website accompanying this book.For languages other than Java, Avro provides a connector framework (in theorg.apache.avro.mapred.tether package). At the time of writing, there are no bindingsfor other languages, but it is expected these will be added in future releases.File-Based Data StructuresFor some applications, you need a specialized data structure to hold your data. Fordoing MapReduce-based processing, putting each blob of binary data into its own filedoesn’t scale, so Hadoop developed a number of higher-level containers for thesesituations.SequenceFileImagine a logfile, where each log record is a new line of text. If you want to log binarytypes, plain text isn’t a suitable format. Hadoop’s SequenceFile class fits the bill in thissituation, providing a persistent data structure for binary key-value pairs. To use it asa logfile format, you would choose a key, such as timestamp represented by a LongWritable, and the value is a Writable that represents the quantity being logged.116 | Chapter 4: Hadoop I/O

135.
SequenceFiles also work well as containers for smaller files. HDFS and MapReduceare optimized for large files, so packing files into a SequenceFile makes storingand processing the smaller files more efficient. (“Processing a whole file as a re-cord” on page 206 contains a program to pack files into a SequenceFile.*)Writing a SequenceFileTo create a SequenceFile, use one of its createWriter() static methods, which returnsa SequenceFile.Writer instance. There are several overloaded versions, but they allrequire you to specify a stream to write to (either a FSDataOutputStream or a FileSystem and Path pairing), a Configuration object, and the key and value types. Optionalarguments include the compression type and codec, a Progressable callback to be in-formed of write progress, and a Metadata instance to be stored in the SequenceFileheader.The keys and values stored in a SequenceFile do not necessarily need to be Writable.Any types that can be serialized and deserialized by a Serialization may be used.Once you have a SequenceFile.Writer, you then write key-value pairs, using theappend() method. Then when you’ve finished, you call the close() method ( SequenceFile.Writer implements java.io.Closeable).Example 4-12 shows a short program to write some key-value pairs to a SequenceFile, using the API just described.Example 4-12. Writing a SequenceFilepublic class SequenceFileWriteDemo { private static final String[] DATA = { "One, two, buckle my shoe", "Three, four, shut the door", "Five, six, pick up sticks", "Seven, eight, lay them straight", "Nine, ten, a big fat hen" }; public static void main(String[] args) throws IOException { String uri = args[0]; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(uri), conf); Path path = new Path(uri); IntWritable key = new IntWritable(); Text value = new Text(); SequenceFile.Writer writer = null; try { writer = SequenceFile.createWriter(fs, conf, path,* In a similar vein, the blog post “A Million Little Files” by Stuart Sierra includes code for converting a tar file into a SequenceFile, http://stuartsierra.com/2008/04/24/a-million-little-files. File-Based Data Structures | 117

137.
and a value argument, and reads the next key and value in the stream into thesevariables: public boolean next(Writable key, Writable val)The return value is true if a key-value pair was read and false if the end of the file hasbeen reached.For other, non Writable serialization frameworks (such as Apache Thrift), you shoulduse these two methods: public Object next(Object key) throws IOException public Object getCurrentValue(Object val) throws IOExceptionIn this case, you need to make sure that the serialization you want to use has been setin the io.serializations property; see “Serialization Frameworks” on page 101.If the next() method returns a non-null object, a key-value pair was read from thestream, and the value can be retrieved using the getCurrentValue() method. Otherwise,if next() returns null, the end of the file has been reached.The program in Example 4-13 demonstrates how to read a sequence file that hasWritable keys and values. Note how the types are discovered from the SequenceFile.Reader via calls to getKeyClass() and getValueClass(), then ReflectionUtils isused to create an instance for the key and an instance for the value. By using this tech-nique, the program can be used with any sequence file that has Writable keys and values.Example 4-13. Reading a SequenceFilepublic class SequenceFileReadDemo { public static void main(String[] args) throws IOException { String uri = args[0]; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(uri), conf); Path path = new Path(uri); SequenceFile.Reader reader = null; try { reader = new SequenceFile.Reader(fs, path, conf); Writable key = (Writable) ReflectionUtils.newInstance(reader.getKeyClass(), conf); Writable value = (Writable) ReflectionUtils.newInstance(reader.getValueClass(), conf); long position = reader.getPosition(); while (reader.next(key, value)) { String syncSeen = reader.syncSeen() ? "*" : ""; System.out.printf("[%s%s]t%st%sn", position, syncSeen, key, value); position = reader.getPosition(); // beginning of next record } } finally { IOUtils.closeStream(reader); } File-Based Data Structures | 119

138.
}}Another feature of the program is that it displays the position of the sync points in thesequence file. A sync point is a point in the stream that can be used to resynchronizewith a record boundary if the reader is “lost”—for example, after seeking to an arbitraryposition in the stream. Sync points are recorded by SequenceFile.Writer, which insertsa special entry to mark the sync point every few records as a sequence file is beingwritten. Such entries are small enough to incur only a modest storage overhead—lessthan 1%. Sync points always align with record boundaries.Running the program in Example 4-13 shows the sync points in the sequence file asasterisks. The first one occurs at position 2021 (the second one occurs at position 4075,but is not shown in the output): % hadoop SequenceFileReadDemo numbers.seq [128] 100 One, two, buckle my shoe [173] 99 Three, four, shut the door [220] 98 Five, six, pick up sticks [264] 97 Seven, eight, lay them straight [314] 96 Nine, ten, a big fat hen [359] 95 One, two, buckle my shoe [404] 94 Three, four, shut the door [451] 93 Five, six, pick up sticks [495] 92 Seven, eight, lay them straight [545] 91 Nine, ten, a big fat hen [590] 90 One, two, buckle my shoe ... [1976] 60 One, two, buckle my shoe [2021*] 59 Three, four, shut the door [2088] 58 Five, six, pick up sticks [2132] 57 Seven, eight, lay them straight [2182] 56 Nine, ten, a big fat hen ... [4557] 5 One, two, buckle my shoe [4602] 4 Three, four, shut the door [4649] 3 Five, six, pick up sticks [4693] 2 Seven, eight, lay them straight [4743] 1 Nine, ten, a big fat henThere are two ways to seek to a given position in a sequence file. The first is theseek() method, which positions the reader at the given point in the file. For example,seeking to a record boundary works as expected: reader.seek(359); assertThat(reader.next(key, value), is(true)); assertThat(((IntWritable) key).get(), is(95));But if the position in the file is not at a record boundary, the reader fails when thenext() method is called: reader.seek(360); reader.next(key, value); // fails with IOException120 | Chapter 4: Hadoop I/O

139.
The second way to find a record boundary makes use of sync points. The sync(longposition) method on SequenceFile.Reader positions the reader at the next sync pointafter position. (If there are no sync points in the file after this position, then the readerwill be positioned at the end of the file.) Thus, we can call sync() with any position inthe stream—a nonrecord boundary, for example—and the reader will reestablish itselfat the next sync point so reading can continue: reader.sync(360); assertThat(reader.getPosition(), is(2021L)); assertThat(reader.next(key, value), is(true)); assertThat(((IntWritable) key).get(), is(59)); SequenceFile.Writer has a method called sync() for inserting a sync point at the current position in the stream. This is not to be confused with the identically named but otherwise unrelated sync() method defined by the Syncable interface for synchronizing buffers to the underlying device.Sync points come into their own when using sequence files as input to MapReduce,since they permit the file to be split, so different portions of it can be processed inde-pendently by separate map tasks. See “SequenceFileInputFormat” on page 213.Displaying a SequenceFile with the command-line interfaceThe hadoop fs command has a -text option to display sequence files in textual form.It looks at a file’s magic number so that it can attempt to detect the type of the file andappropriately convert it to text. It can recognize gzipped files and sequence files; oth-erwise, it assumes the input is plain text.For sequence files, this command is really useful only if the keys and values have ameaningful string representation (as defined by the toString() method). Also, if youhave your own key or value classes, then you will need to make sure they are on Ha-doop’s classpath.Running it on the sequence file we created in the previous section gives the followingoutput: % hadoop fs -text numbers.seq | head 100 One, two, buckle my shoe 99 Three, four, shut the door 98 Five, six, pick up sticks 97 Seven, eight, lay them straight 96 Nine, ten, a big fat hen 95 One, two, buckle my shoe 94 Three, four, shut the door 93 Five, six, pick up sticks 92 Seven, eight, lay them straight 91 Nine, ten, a big fat hen File-Based Data Structures | 121

140.
Sorting and merging SequenceFilesThe most powerful way of sorting (and merging) one or more sequence files is to useMapReduce. MapReduce is inherently parallel and will let you specify the number ofreducers to use, which determines the number of output partitions. For example, byspecifying one reducer, you get a single output file. We can use the sort example thatcomes with Hadoop by specifying that the input and output are sequence files, and bysetting the key and value types: % hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar sort -r 1 -inFormat org.apache.hadoop.mapred.SequenceFileInputFormat -outFormat org.apache.hadoop.mapred.SequenceFileOutputFormat -outKey org.apache.hadoop.io.IntWritable -outValue org.apache.hadoop.io.Text numbers.seq sorted % hadoop fs -text sorted/part-00000 | head 1 Nine, ten, a big fat hen 2 Seven, eight, lay them straight 3 Five, six, pick up sticks 4 Three, four, shut the door 5 One, two, buckle my shoe 6 Nine, ten, a big fat hen 7 Seven, eight, lay them straight 8 Five, six, pick up sticks 9 Three, four, shut the door 10 One, two, buckle my shoeSorting is covered in more detail in “Sorting” on page 232.As an alternative to using MapReduce for sort/merge, there is a SequenceFile.Sorterclass that has a number of sort() and merge() methods. These functions predate Map-Reduce and are lower-level functions than MapReduce (for example, to get parallelism,you need to partition your data manually), so in general MapReduce is the preferredapproach to sort and merge sequence files.The SequenceFile formatA sequence file consists of a header followed by one or more records (see Figure 4-2).The first three bytes of a sequence file are the bytes SEQ, which acts a magic number,followed by a single byte representing the version number. The header contains otherfields including the names of the key and value classes, compression details, user-defined metadata, and the sync marker.† Recall that the sync marker is used to allowa reader to synchronize to a record boundary from any position in the file. Each file hasa randomly generated sync marker, whose value is stored in the header. Sync markersappear between records in the sequence file. They are designed to incur less than a 1%storage overhead, so they don’t necessarily appear between every pair of records (suchis the case for short records).† Full details of the format of these fields may be found in SequenceFile’s documentation and source code.122 | Chapter 4: Hadoop I/O

141.
Figure 4-2. The internal structure of a sequence file with no compression and record compressionThe internal format of the records depends on whether compression is enabled, and ifit is, whether it is record compression or block compression.If no compression is enabled (the default), then each record is made up of the recordlength (in bytes), the key length, the key, and then the value. The length fields arewritten as four-byte integers adhering to the contract of the writeInt() method ofjava.io.DataOutput. Keys and values are serialized using the Serialization defined forthe class being written to the sequence file.The format for record compression is almost identical to no compression, except thevalue bytes are compressed using the codec defined in the header. Note that keys arenot compressed.Block compression compresses multiple records at once; it is therefore more compactthan and should generally be preferred over record compression because it has theopportunity to take advantage of similarities between records. (See Figure 4-3.) Recordsare added to a block until it reaches a minimum size in bytes, defined by theio.seqfile.compress.blocksize property: the default is 1 million bytes. A sync markeris written before the start of every block. The format of a block is a field indicating thenumber of records in the block, followed by four compressed fields: the key lengths,the keys, the value lengths, and the values.MapFileA MapFile is a sorted SequenceFile with an index to permit lookups by key. MapFile canbe thought of as a persistent form of java.util.Map (although it doesn’t implement thisinterface), which is able to grow beyond the size of a Map that is kept in memory. File-Based Data Structures | 123

142.
Figure 4-3. The internal structure of a sequence file with block compressionWriting a MapFileWriting a MapFile is similar to writing a SequenceFile: you create an instance ofMapFile.Writer, then call the append() method to add entries in order. (Attempting toadd entries out of order will result in an IOException.) Keys must be instances ofWritableComparable, and values must be Writable—contrast this to SequenceFile,which can use any serialization framework for its entries.The program in Example 4-14 creates a MapFile, and writes some entries to it. It is verysimilar to the program in Example 4-12 for creating a SequenceFile.Example 4-14. Writing a MapFilepublic class MapFileWriteDemo { private static final String[] DATA = { "One, two, buckle my shoe", "Three, four, shut the door", "Five, six, pick up sticks", "Seven, eight, lay them straight", "Nine, ten, a big fat hen" }; public static void main(String[] args) throws IOException { String uri = args[0]; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(uri), conf); IntWritable key = new IntWritable(); Text value = new Text(); MapFile.Writer writer = null; try { writer = new MapFile.Writer(conf, fs, uri, key.getClass(), value.getClass()); for (int i = 0; i < 1024; i++) { key.set(i + 1); value.set(DATA[i % DATA.length]);124 | Chapter 4: Hadoop I/O

144.
Since the index is only a partial index of keys, MapFile is not able to provide methodsto enumerate, or even count, all the keys it contains. The only way to perform theseoperations is to read the whole file.Reading a MapFileIterating through the entries in order in a MapFile is similar to the procedure for aSequenceFile: you create a MapFile.Reader, then call the next() method until it returnsfalse, signifying that no entry was read because the end of the file was reached: public boolean next(WritableComparable key, Writable val) throws IOExceptionA random access lookup can be performed by calling the get() method: public Writable get(WritableComparable key, Writable val) throws IOExceptionThe return value is used to determine if an entry was found in the MapFile; if it’s null,then no value exists for the given key. If key was found, then the value for that key isread into val, as well as being returned from the method call.It might be helpful to understand how this is implemented. Here is a snippet of codethat retrieves an entry for the MapFile we created in the previous section: Text value = new Text(); reader.get(new IntWritable(496), value); assertThat(value.toString(), is("One, two, buckle my shoe"));For this operation, the MapFile.Reader reads the index file into memory (this is cachedso that subsequent random access calls will use the same in-memory index). The readerthen performs a binary search on the in-memory index to find the key in the index thatis less than or equal to the search key, 496. In this example, the index key found is 385,with value 18030, which is the offset in the data file. Next the reader seeks to this offsetin the data file and reads entries until the key is greater than or equal to the search key,496. In this case, a match is found and the value is read from the data file. Overall, alookup takes a single disk seek and a scan through up to 128 entries on disk. For arandom-access read, this is actually very efficient.The getClosest() method is like get() except it returns the “closest” match to thespecified key, rather than returning null on no match. More precisely, if the MapFilecontains the specified key, then that is the entry returned; otherwise, the key in theMapFile that is immediately after (or before, according to a boolean argument) thespecified key is returned.A very large MapFile’s index can take up a lot of memory. Rather than reindex to changethe index interval, it is possible to load only a fraction of the index keys into memorywhen reading the MapFile by setting the io.map.index.skip property. This property isnormally 0, which means no index keys are skipped; a value of 1 means skip one keyfor every key in the index (so every other key ends up in the index), 2 means skip twokeys for every key in the index (so one third of the keys end up in the index), and so126 | Chapter 4: Hadoop I/O

145.
on. Larger skip values save memory but at the expense of lookup time, since moreentries have to be scanned on disk, on average.Converting a SequenceFile to a MapFileOne way of looking at a MapFile is as an indexed and sorted SequenceFile. So it’s quitenatural to want to be able to convert a SequenceFile into a MapFile. We covered howto sort a SequenceFile in “Sorting and merging SequenceFiles” on page 122, so here welook at how to create an index for a SequenceFile. The program in Example 4-15 hingesaround the static utility method fix() on MapFile, which re-creates the index for aMapFile.Example 4-15. Re-creating the index for a MapFilepublic class MapFileFixer { public static void main(String[] args) throws Exception { String mapUri = args[0]; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(mapUri), conf); Path map = new Path(mapUri); Path mapData = new Path(map, MapFile.DATA_FILE_NAME); // Get key and value types from data sequence file SequenceFile.Reader reader = new SequenceFile.Reader(fs, mapData, conf); Class keyClass = reader.getKeyClass(); Class valueClass = reader.getValueClass(); reader.close(); // Create the map file index file long entries = MapFile.fix(fs, map, keyClass, valueClass, false, conf); System.out.printf("Created MapFile %s with %d entriesn", map, entries); }}The fix() method is usually used for re-creating corrupted indexes, but since it createsa new index from scratch, it’s exactly what we need here. The recipe is as follows: 1. Sort the sequence file numbers.seq into a new directory called number.map that will become the MapFile (if the sequence file is already sorted, then you can skip this step. Instead, copy it to a file number.map/data, then go to step 3): % hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar sort -r 1 -inFormat org.apache.hadoop.mapred.SequenceFileInputFormat -outFormat org.apache.hadoop.mapred.SequenceFileOutputFormat -outKey org.apache.hadoop.io.IntWritable -outValue org.apache.hadoop.io.Text numbers.seq numbers.map 2. Rename the MapReduce output to be the data file: % hadoop fs -mv numbers.map/part-00000 numbers.map/data File-Based Data Structures | 127

147.
CHAPTER 5 Developing a MapReduce ApplicationIn Chapter 2, we introduced the MapReduce model. In this chapter, we look at thepractical aspects of developing a MapReduce application in Hadoop.Writing a program in MapReduce has a certain flow to it. You start by writing yourmap and reduce functions, ideally with unit tests to make sure they do what you expect.Then you write a driver program to run a job, which can run from your IDE using asmall subset of the data to check that it is working. If it fails, then you can use yourIDE’s debugger to find the source of the problem. With this information, you canexpand your unit tests to cover this case and improve your mapper or reducer as ap-propriate to handle such input correctly.When the program runs as expected against the small dataset, you are ready to unleashit on a cluster. Running against the full dataset is likely to expose some more issues,which you can fix as before, by expanding your tests and mapper or reducer to handlethe new cases. Debugging failing programs in the cluster is a challenge, but Hadoopprovides some tools to help, such as an IsolationRunner, which allows you to run atask over the same input on which it failed, with a debugger attached, if necessary.After the program is working, you may wish to do some tuning, first by running throughsome standard checks for making MapReduce programs faster and then by doing taskprofiling. Profiling distributed programs is not trivial, but Hadoop has hooks to aid theprocess.Before we start writing a MapReduce program, we need to set up and configure thedevelopment environment. And to do that, we need to learn a bit about how Hadoopdoes configuration. 129

148.
The Configuration APIComponents in Hadoop are configured using Hadoop’s own configuration API. Aninstance of the Configuration class (found in the org.apache.hadoop.conf package)represents a collection of configuration properties and their values. Each property isnamed by a String, and the type of a value may be one of several types, including Javaprimitives such as boolean, int, long, float, and other useful types such as String, Class,java.io.File, and collections of Strings.Configurations read their properties from resources—XML files with a simple structurefor defining name-value pairs. See Example 5-1.Example 5-1. A simple configuration file, configuration-1.xml<?xml version="1.0"?><configuration> <property> <name>color</name> <value>yellow</value> <description>Color</description> </property> <property> <name>size</name> <value>10</value> <description>Size</description> </property> <property> <name>weight</name> <value>heavy</value> <final>true</final> <description>Weight</description> </property> <property> <name>size-weight</name> <value>${size},${weight}</value> <description>Size and weight</description> </property></configuration>Assuming this configuration file is in a file called configuration-1.xml, we can access itsproperties using a piece of code like this: Configuration conf = new Configuration(); conf.addResource("configuration-1.xml"); assertThat(conf.get("color"), is("yellow")); assertThat(conf.getInt("size", 0), is(10)); assertThat(conf.get("breadth", "wide"), is("wide"));130 | Chapter 5: Developing a MapReduce Application

149.
There are a couple of things to note: type information is not stored in the XML file;instead, properties can be interpreted as a given type when they are read. Also, theget() methods allow you to specify a default value, which is used if the property is notdefined in the XML file, as in the case of breadth here.Combining ResourcesThings get interesting when more than one resource is used to define a configuration.This is used in Hadoop to separate out the default properties for the system, definedinternally in a file called core-default.xml, from the site-specific overrides, in core-site.xml. The file in Example 5-2 defines the size and weight properties.Example 5-2. A second configuration file, configuration-2.xml<?xml version="1.0"?><configuration> <property> <name>size</name> <value>12</value> </property> <property> <name>weight</name> <value>light</value> </property></configuration>Resources are added to a Configuration in order: Configuration conf = new Configuration(); conf.addResource("configuration-1.xml"); conf.addResource("configuration-2.xml");Properties defined in resources that are added later override the earlier definitions. Sothe size property takes its value from the second configuration file, configuration-2.xml: assertThat(conf.getInt("size", 0), is(12));However, properties that are marked as final cannot be overridden in later definitions.The weight property is final in the first configuration file, so the attempt to override itin the second fails, and it takes the value from the first: assertThat(conf.get("weight"), is("heavy"));Attempting to override final properties usually indicates a configuration error, so thisresults in a warning message being logged to aid diagnosis. Administrators mark prop-erties as final in the daemon’s site files that they don’t want users to change in theirclient-side configuration files or job submission parameters. The Configuration API | 131

150.
Variable ExpansionConfiguration properties can be defined in terms of other properties, or system prop-erties. For example, the property size-weight in the first configuration file is definedas ${size},${weight}, and these properties are expanded using the values found in theconfiguration: assertThat(conf.get("size-weight"), is("12,heavy"));System properties take priority over properties defined in resource files: System.setProperty("size", "14"); assertThat(conf.get("size-weight"), is("14,heavy"));This feature is useful for overriding properties on the command line by using-Dproperty=value JVM arguments.Note that while configuration properties can be defined in terms of system properties,unless system properties are redefined using configuration properties, they are not ac-cessible through the configuration API. Hence: System.setProperty("length", "2"); assertThat(conf.get("length"), is((String) null));Configuring the Development EnvironmentThe first step is to download the version of Hadoop that you plan to use and unpackit on your development machine (this is described in Appendix A). Then, in your fa-vorite IDE, create a new project and add all the JAR files from the top level of theunpacked distribution and from the lib directory to the classpath. You will then be ableto compile Java Hadoop programs and run them in local (standalone) mode within theIDE. For Eclipse users, there is a plug-in available for browsing HDFS and launching MapReduce programs. Instructions are available on the Ha- doop wiki at http://wiki.apache.org/hadoop/EclipsePlugIn. Alternatively, Karmasphere provides Eclipse and NetBeans plug-ins for developing and running MapReduce jobs and browsing Hadoop clus- ters.Managing ConfigurationWhen developing Hadoop applications, it is common to switch between running theapplication locally and running it on a cluster. In fact, you may have several clustersyou work with, or you may have a local “pseudo-distributed” cluster that you like totest on (a pseudo-distributed cluster is one whose daemons all run on the local machine;setting up this mode is covered in Appendix A, too).132 | Chapter 5: Developing a MapReduce Application

151.
One way to accommodate these variations is to have Hadoop configuration files con-taining the connection settings for each cluster you run against, and specify which oneyou are using when you run Hadoop applications or tools. As a matter of best practice,it’s recommended to keep these files outside Hadoop’s installation directory tree, asthis makes it easy to switch between Hadoop versions without duplicating or losingsettings.For the purposes of this book, we assume the existence of a directory called conf thatcontains three configuration files: hadoop-local.xml, hadoop-localhost.xml, andhadoop-cluster.xml (these are available in the example code for this book). Note thatthere is nothing special about the names of these files—they are just convenient waysto package up some configuration settings. (Compare this to Table A-1 in Appen-dix A, which sets out the equivalent server-side configurations.)The hadoop-local.xml file contains the default Hadoop configuration for the defaultfilesystem and the jobtracker: <?xml version="1.0"?> <configuration> <property> <name>fs.default.name</name> <value>file:///</value> </property> <property> <name>mapred.job.tracker</name> <value>local</value> </property> </configuration>The settings in hadoop-localhost.xml point to a namenode and a jobtracker both run-ning on localhost: <?xml version="1.0"?> <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost/</value> </property> <property> <name>mapred.job.tracker</name> <value>localhost:8021</value> </property> </configuration> Configuring the Development Environment | 133

152.
Finally, hadoop-cluster.xml contains details of the cluster’s namenode and jobtrackeraddresses. In practice, you would name the file after the name of the cluster, ratherthan “cluster” as we have here: <?xml version="1.0"?> <configuration> <property> <name>fs.default.name</name> <value>hdfs://namenode/</value> </property> <property> <name>mapred.job.tracker</name> <value>jobtracker:8021</value> </property> </configuration>You can add other configuration properties to these files as needed. For example, if youwanted to set your Hadoop username for a particular cluster, you could do it in theappropriate file. Setting User Identity The user identity that Hadoop uses for permissions in HDFS is determined by running the whoami command on the client system. Similarly, the group names are derived from the output of running groups. If, however, your Hadoop user identity is different from the name of your user account on your client machine, then you can explicitly set your Hadoop username and group names by setting the hadoop.job.ugi property. The username and group names are specified as a comma-separated list of strings (e.g., preston,directors,inventors would set the username to preston and the group names to directors and inventors). You can set the user identity that the HDFS web interface runs as by setting dfs.web.ugi using the same syntax. By default, it is webuser,webgroup, which is not a super user, so system files are not accessible through the web interface. Notice that, by default, there is no authentication with this system. See “Secur- ity” on page 281 for how to use Kerberos authentication with Hadoop.With this setup, it is easy to use any configuration with the -conf command-line switch.For example, the following command shows a directory listing on the HDFS serverrunning in pseudo-distributed mode on localhost: % hadoop fs -conf conf/hadoop-localhost.xml -ls . Found 2 items drwxr-xr-x - tom supergroup 0 2009-04-08 10:32 /user/tom/input drwxr-xr-x - tom supergroup 0 2009-04-08 13:09 /user/tom/output134 | Chapter 5: Developing a MapReduce Application

153.
If you omit the -conf option, then you pick up the Hadoop configuration in the confsubdirectory under $HADOOP_INSTALL. Depending on how you set this up, this may befor a standalone setup or a pseudo-distributed cluster.Tools that come with Hadoop support the -conf option, but it’s also straightforwardto make your programs (such as programs that run MapReduce jobs) support it, too,using the Tool interface.GenericOptionsParser, Tool, and ToolRunnerHadoop comes with a few helper classes for making it easier to run jobs from thecommand line. GenericOptionsParser is a class that interprets common Hadoopcommand-line options and sets them on a Configuration object for your application touse as desired. You don’t usually use GenericOptionsParser directly, as it’s moreconvenient to implement the Tool interface and run your application with theToolRunner, which uses GenericOptionsParser internally: public interface Tool extends Configurable { int run(String [] args) throws Exception; }Example 5-3 shows a very simple implementation of Tool, for printing the keys andvalues of all the properties in the Tool’s Configuration object.Example 5-3. An example Tool implementation for printing the properties in a Configurationpublic class ConfigurationPrinter extends Configured implements Tool { static { Configuration.addDefaultResource("hdfs-default.xml"); Configuration.addDefaultResource("hdfs-site.xml"); Configuration.addDefaultResource("mapred-default.xml"); Configuration.addDefaultResource("mapred-site.xml"); } @Override public int run(String[] args) throws Exception { Configuration conf = getConf(); for (Entry<String, String> entry: conf) { System.out.printf("%s=%sn", entry.getKey(), entry.getValue()); } return 0; } public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run(new ConfigurationPrinter(), args); System.exit(exitCode); }} Configuring the Development Environment | 135

154.
We make ConfigurationPrinter a subclass of Configured, which is an implementationof the Configurable interface. All implementations of Tool need to implementConfigurable (since Tool extends it), and subclassing Configured is often the easiest wayto achieve this. The run() method obtains the Configuration using Configurable’sgetConf() method and then iterates over it, printing each property to standard output.The static block makes sure that the HDFS and MapReduce configurations are pickedup in addition to the core ones (which Configuration knows about already).ConfigurationPrinter’s main() method does not invoke its own run() method directly.Instead, we call ToolRunner’s static run() method, which takes care of creating aConfiguration object for the Tool, before calling its run() method. ToolRunner also usesa GenericOptionsParser to pick up any standard options specified on the command lineand set them on the Configuration instance. We can see the effect of picking up theproperties specified in conf/hadoop-localhost.xml by running the following command: % hadoop ConfigurationPrinter -conf conf/hadoop-localhost.xml | grep mapred.job.tracker= mapred.job.tracker=localhost:8021 Which Properties Can I Set? ConfigurationPrinter is a useful tool for telling you what a property is set to in your environment. You can also see the default settings for all the public properties in Hadoop by looking in the docs directory of your Hadoop installation for HTML files called core- default.html, hdfs-default.html and mapred-default.html. Each property has a descrip- tion that explains what it is for and what values it can be set to. Be aware that some properties have no effect when set in the client configuration. For example, if in your job submission you set mapred.tasktracker.map.tasks.maximum with the expectation that it would change the number of task slots for the tasktrackers run- ning your job, then you would be disappointed, since this property only is only honored if set in the tasktracker’s mapred-site.html file. In general, you can tell the component where a property should be set by its name, so the fact that mapred.task tracker.map.tasks.maximum starts with mapred.tasktracker gives you a clue that it can be set only for the tasktracker daemon. This is not a hard and fast rule, however, so in some cases you may need to resort to trial and error, or even reading the source. We discuss many of Hadoop’s most important configuration properties throughout this book. You can find a configuration property reference on the book’s website at http://www.hadoopbook.com.GenericOptionsParser also allows you to set individual properties. For example: % hadoop ConfigurationPrinter -D color=yellow | grep color color=yellow136 | Chapter 5: Developing a MapReduce Application

155.
The -D option is used to set the configuration property with key color to the valueyellow. Options specified with -D take priority over properties from the configurationfiles. This is very useful: you can put defaults into configuration files and then overridethem with the -D option as needed. A common example of this is setting the numberof reducers for a MapReduce job via -D mapred.reduce.tasks=n. This will override thenumber of reducers set on the cluster or set in any client-side configuration files.The other options that GenericOptionsParser and ToolRunner support are listed in Ta-ble 5-1. You can find more on Hadoop’s configuration API in “The ConfigurationAPI” on page 130. Do not confuse setting Hadoop properties using the -D property=value option to GenericOptionsParser (and ToolRunner) with setting JVM system properties using the -Dproperty=value option to the java command. The syntax for JVM system properties does not allow any whitespace between the D and the property name, whereas GenericOptionsParser requires them to be separated by whitespace. JVM system properties are retrieved from the java.lang.System class, whereas Hadoop properties are accessible only from a Configuration object. So, the following command will print nothing, since the System class is not used by ConfigurationPrinter: % hadoop -Dcolor=yellow ConfigurationPrinter | grep color If you want to be able to set configuration through system properties, then you need to mirror the system properties of interest in the configuration file. See “Variable Expansion” on page 132 for further discussion.Table 5-1. GenericOptionsParser and ToolRunner options Option Description -D property=value Sets the given Hadoop configuration property to the given value. Overrides any default or site properties in the configuration, and any properties set via the -conf option. -conf filename ... Adds the given files to the list of resources in the configuration. This is a convenient way to set site properties or to set a number of properties at once. -fs uri Sets the default filesystem to the given URI. Shortcut for -D fs.default.name=uri -jt host:port Sets the jobtracker to the given host and port. Shortcut for -D mapred.job.tracker=host:port -files file1,file2,... Copies the specified files from the local filesystem (or any filesystem if a scheme is specified) to the shared filesystem used by the jobtracker (usually HDFS) and makes them available to MapReduce programs in the task’s working directory. (See “Distributed Cache” on page 253 for more on the distributed cache mechanism for copying files to tasktracker machines.) -archives Copies the specified archives from the local filesystem (or any filesystem if a scheme is archive1,archive2,... specified) to the shared filesystem used by the jobtracker (usually HDFS), unarchives Configuring the Development Environment | 137

156.
Option Description them, and makes them available to MapReduce programs in the task’s working directory. -libjars jar1,jar2,... Copies the specified JAR files from the local filesystem (or any filesystem if a scheme is specified) to the shared filesystem used by the jobtracker (usually HDFS), and adds them to the MapReduce task’s classpath. This option is a useful way of shipping JAR files that a job is dependent on.Writing a Unit TestThe map and reduce functions in MapReduce are easy to test in isolation, which is aconsequence of their functional style. For known inputs, they produce known outputs.However, since outputs are written to an OutputCollector, rather than simply beingreturned from the method call, the OutputCollector needs to be replaced with a mockso that its outputs can be verified. There are several Java mock object frameworks thatcan help build mocks; here we use Mockito, which is noted for its clean syntax, althoughany mock framework should work just as well.*All of the tests described here can be run from within an IDE.MapperThe test for the mapper is shown in Example 5-4.Example 5-4. Unit test for MaxTemperatureMapperimport static org.mockito.Mockito.*;import java.io.IOException;import org.apache.hadoop.io.*;import org.apache.hadoop.mapred.OutputCollector;import org.junit.*;public class MaxTemperatureMapperTest { @Test public void processesValidRecord() throws IOException { MaxTemperatureMapper mapper = new MaxTemperatureMapper(); Text value = new Text("0043011990999991950051518004+68750+023550FM-12+0382" + // Year ^^^^ "99999V0203201N00261220001CN9999999N9-00111+99999999999"); // Temperature ^^^^^ OutputCollector<Text, IntWritable> output = mock(OutputCollector.class); mapper.map(null, value, output, null);* See also the MRUnit contrib module, which aims to make unit testing MapReduce programs easier.138 | Chapter 5: Developing a MapReduce Application

157.
verify(output).collect(new Text("1950"), new IntWritable(-11)); }}The test is very simple: it passes a weather record as input to the mapper, then checksthe output is the year and temperature reading. The input key and Reporter are bothignored by the mapper, so we can pass in anything, including null as we do here. Tocreate a mock OutputCollector, we call Mockito’s mock() method (a static import),passing the class of the type we want to mock. Then we invoke the mapper’s map()method, which executes the code being tested. Finally, we verify that the mock objectwas called with the correct method and arguments, using Mockito’s verify() method(again, statically imported). Here we verify that OutputCollector’s collect() methodwas called with a Text object representing the year (1950) and an IntWritable repre-senting the temperature (−1.1°C).Proceeding in a test-driven fashion, we create a Mapper implementation that passes thetest (see Example 5-5). Since we will be evolving the classes in this chapter, each is putin a different package indicating its version for ease of exposition. For example, v1.MaxTemperatureMapper is version 1 of MaxTemperatureMapper. In reality, of course, you wouldevolve classes without repackaging them.Example 5-5. First version of a Mapper that passes MaxTemperatureMapperTestpublic class MaxTemperatureMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); String year = line.substring(15, 19); int airTemperature = Integer.parseInt(line.substring(87, 92)); output.collect(new Text(year), new IntWritable(airTemperature)); }}This is a very simple implementation, which pulls the year and temperature fields fromthe line and emits them in the OutputCollector. Let’s add a test for missing values,which in the raw data are represented by a temperature of +9999: @Test public void ignoresMissingTemperatureRecord() throws IOException { MaxTemperatureMapper mapper = new MaxTemperatureMapper(); Text value = new Text("0043011990999991950051518004+68750+023550FM-12+0382" + // Year ^^^^ "99999V0203201N00261220001CN9999999N9+99991+99999999999"); // Temperature ^^^^^ OutputCollector<Text, IntWritable> output = mock(OutputCollector.class); mapper.map(null, value, output, null); Writing a Unit Test | 139

160.
conf.setMapperClass(MaxTemperatureMapper.class); conf.setCombinerClass(MaxTemperatureReducer.class); conf.setReducerClass(MaxTemperatureReducer.class); JobClient.runJob(conf); return 0; } public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run(new MaxTemperatureDriver(), args); System.exit(exitCode); }}MaxTemperatureDriver implements the Tool interface, so we get the benefit of being ableto set the options that GenericOptionsParser supports. The run() method constructsand configures a JobConf object, before launching a job described by the JobConf.Among the possible job configuration parameters, we set the input and output filepaths, the mapper, reducer and combiner classes, and the output types (the input typesare determined by the input format, which defaults to TextInputFormat and has LongWritable keys and Text values). It’s also a good idea to set a name for the job so thatyou can pick it out in the job list during execution and after it has completed. By default,the name is the name of the JAR file, which is normally not particularly descriptive.Now we can run this application against some local files. Hadoop comes with a localjob runner, a cut-down version of the MapReduce execution engine for running Map-Reduce jobs in a single JVM. It’s designed for testing and is very convenient for use inan IDE, since you can run it in a debugger to step through the code in your mapper andreducer. The local job runner is only designed for simple testing of MapReduce programs, so inevitably it differs from the full MapReduce implemen- tation. The biggest difference is that it can’t run more than one reducer. (It can support the zero reducer case, too.) This is normally not a prob- lem, as most applications can work with one reducer, although on a cluster you would choose a larger number to take advantage of paral- lelism. The thing to watch out for is that even if you set the number of reducers to a value over one, the local runner will silently ignore the setting and use a single reducer. The local job runner also has no support for the DistributedCache fea- ture (described in “Distributed Cache” on page 253). Neither of these limitations is inherent in the local job runner, and future versions of Hadoop may relax these restrictions.142 | Chapter 5: Developing a MapReduce Application

161.
The local job runner is enabled by a configuration setting. Normally,mapred.job.tracker is a host:port pair to specify the address of the jobtracker, but whenit has the special value of local, the job is run in-process without accessing an externaljobtracker.From the command line, we can run the driver by typing: % hadoop v2.MaxTemperatureDriver -conf conf/hadoop-local.xml input/ncdc/micro max-tempEquivalently, we could use the -fs and -jt options provided by GenericOptionsParser: % hadoop v2.MaxTemperatureDriver -fs file:/// -jt local input/ncdc/micro max-tempThis command executes MaxTemperatureDriver using input from the local input/ncdc/micro directory, producing output in the local max-temp directory. Note that althoughwe’ve set -fs so we use the local filesystem (file:///), the local job runner will actuallywork fine against any filesystem, including HDFS (and it can be handy to do this if youhave a few files that are on HDFS).When we run the program, it fails and prints the following exception: java.lang.NumberFormatException: For input string: "+0000"Fixing the mapperThis exception shows that the map method still can’t parse positive temperatures. (Ifthe stack trace hadn’t given us enough information to diagnose the fault, we could runthe test in a local debugger, since it runs in a single JVM.) Earlier, we made it handlethe special case of missing temperature, +9999, but not the general case of any positivetemperature. With more logic going into the mapper, it makes sense to factor out aparser class to encapsulate the parsing logic; see Example 5-8 (now on version 3).Example 5-8. A class for parsing weather records in NCDC formatpublic class NcdcRecordParser { private static final int MISSING_TEMPERATURE = 9999; private String year; private int airTemperature; private String quality; public void parse(String record) { year = record.substring(15, 19); String airTemperatureString; // Remove leading plus sign as parseInt doesnt like them if (record.charAt(87) == +) { airTemperatureString = record.substring(88, 92); } else { airTemperatureString = record.substring(87, 92); } airTemperature = Integer.parseInt(airTemperatureString); quality = record.substring(92, 93); Running Locally on Test Data | 143

162.
} public void parse(Text record) { parse(record.toString()); } public boolean isValidTemperature() { return airTemperature != MISSING_TEMPERATURE && quality.matches("[01459]"); } public String getYear() { return year; } public int getAirTemperature() { return airTemperature; }}The resulting mapper is much simpler (see Example 5-9). It just calls the parser’sparse() method, which parses the fields of interest from a line of input, checks whethera valid temperature was found using the isValidTemperature() query method, and if itwas, retrieves the year and the temperature using the getter methods on the parser.Notice that we also check the quality status field as well as missing temperatures inisValidTemperature() to filter out poor temperature readings.Another benefit of creating a parser class is that it makes it easy to write related mappersfor similar jobs without duplicating code. It also gives us the opportunity to write unittests directly against the parser, for more targeted testing.Example 5-9. A Mapper that uses a utility class to parse recordspublic class MaxTemperatureMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private NcdcRecordParser parser = new NcdcRecordParser(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { parser.parse(value); if (parser.isValidTemperature()) { output.collect(new Text(parser.getYear()), new IntWritable(parser.getAirTemperature())); } }}With these changes, the test passes.144 | Chapter 5: Developing a MapReduce Application

163.
Testing the DriverApart from the flexible configuration options offered by making your application im-plement Tool, you also make it more testable because it allows you to inject an arbitraryConfiguration. You can take advantage of this to write a test that uses a local job runnerto run a job against known input data, which checks that the output is as expected.There are two approaches to doing this. The first is to use the local job runner and runthe job against a test file on the local filesystem. The code in Example 5-10 gives anidea of how to do this.Example 5-10. A test for MaxTemperatureDriver that uses a local, in-process job runner @Test public void test() throws Exception { JobConf conf = new JobConf(); conf.set("fs.default.name", "file:///"); conf.set("mapred.job.tracker", "local"); Path input = new Path("input/ncdc/micro"); Path output = new Path("output"); FileSystem fs = FileSystem.getLocal(conf); fs.delete(output, true); // delete old output MaxTemperatureDriver driver = new MaxTemperatureDriver(); driver.setConf(conf); int exitCode = driver.run(new String[] { input.toString(), output.toString() }); assertThat(exitCode, is(0)); checkOutput(conf, output); }The test explicitly sets fs.default.name and mapred.job.tracker so it uses the localfilesystem and the local job runner. It then runs the MaxTemperatureDriver via its Toolinterface against a small amount of known data. At the end of the test, the checkOutput() method is called to compare the actual output with the expected output, line byline.The second way of testing the driver is to run it using a “mini-” cluster. Hadoop has apair of testing classes, called MiniDFSCluster and MiniMRCluster, which provide a pro-grammatic way of creating in-process clusters. Unlike the local job runner, these allowtesting against the full HDFS and MapReduce machinery. Bear in mind, too, that task-trackers in a mini-cluster launch separate JVMs to run tasks in, which can make de-bugging more difficult.Mini-clusters are used extensively in Hadoop’s own automated test suite, but they canbe used for testing user code, too. Hadoop’s ClusterMapReduceTestCase abstract classprovides a useful base for writing such a test, handles the details of starting and stopping Running Locally on Test Data | 145

164.
the in-process HDFS and MapReduce clusters in its setUp() and tearDown() methods,and generates a suitable JobConf object that is configured to work with them. Subclassesneed populate only data in HDFS (perhaps by copying from a local file), run a Map-Reduce job, then confirm the output is as expected. Refer to the MaxTemperatureDriverMiniTest class in the example code that comes with this book for the listing.Tests like this serve as regression tests, and are a useful repository of input edge casesand their expected results. As you encounter more test cases, you can simply add themto the input file and update the file of expected output accordingly.Running on a ClusterNow that we are happy with the program running on a small test dataset, we are readyto try it on the full dataset on a Hadoop cluster. Chapter 9 covers how to set up a fullydistributed cluster, although you can also work through this section on a pseudo-distributed cluster.PackagingWe don’t need to make any modifications to the program to run on a cluster ratherthan on a single machine, but we do need to package the program as a JAR file to sendto the cluster. This is conveniently achieved using Ant, using a task such as this (youcan find the complete build file in the example code): <jar destfile="job.jar" basedir="${classes.dir}"/>If you have a single job per JAR, then you can specify the main class to run in the JARfile’s manifest. If the main class is not in the manifest, then it must be specified on thecommand line (as you will see shortly). Also, any dependent JAR files should be pack-aged in a lib subdirectory in the JAR file. (This is analogous to a Java Web applicationarchive, or WAR file, except in that case the JAR files go in a WEB-INF/lib subdirectoryin the WAR file.)Launching a JobTo launch the job, we need to run the driver, specifying the cluster that we want to runthe job on with the -conf option (we could equally have used the -fs and -jt options): % hadoop jar job.jar v3.MaxTemperatureDriver -conf conf/hadoop-cluster.xml input/ncdc/all max-tempThe runJob() method on JobClient launches the job and polls for progress, writing aline summarizing the map and reduce’s progress whenever either changes. Here’s theoutput (some lines have been removed for clarity): 09/04/11 08:15:52 INFO mapred.FileInputFormat: Total input paths to process : 101 09/04/11 08:15:53 INFO mapred.JobClient: Running job: job_200904110811_0002 09/04/11 08:15:54 INFO mapred.JobClient: map 0% reduce 0%146 | Chapter 5: Developing a MapReduce Application

166.
Tasks belong to a job, and their IDs are formed by replacing the job prefix of a job ID with a task prefix, and adding a suffix to identify the task within the job. For example: task_200904110811_0002_m_000003 is the fourth (000003, task IDs are 0-based) map (m) task of the job with ID job_200904110811_0002. The task IDs are created for a job when it is initialized, so they do not necessarily dictate the order that the tasks will be executed in. Tasks may be executed more than once, due to failure (see “Task Fail- ure” on page 173) or speculative execution (see “Speculative Execu- tion” on page 183), so to identify different instances of a task execution, task attempts are given unique IDs on the jobtracker. For example: attempt_200904110811_0002_m_000003_0 is the first (0, attempt IDs are 0-based) attempt at running task task_200904110811_0002_m_000003. Task attempts are allocated during the job run as needed, so their ordering represents the order that they were created for tasktrackers to run. The final count in the task attempt ID is incremented by 1,000 if the job is restarted after the jobtracker is restarted and recovers its running jobs.The MapReduce Web UIHadoop comes with a web UI for viewing information about your jobs. It is useful forfollowing a job’s progress while it is running, as well as finding job statistics and logsafter the job has completed. You can find the UI at http://jobtracker-host:50030/.The jobtracker pageA screenshot of the home page is shown in Figure 5-1. The first section of the page givesdetails of the Hadoop installation, such as the version number and when it was com-piled, and the current state of the jobtracker (in this case, running), and when it wasstarted.Next is a summary of the cluster, which has measures of cluster capacity and utilization.This shows the number of maps and reduces currently running on the cluster, the totalnumber of job submissions, the number of tasktracker nodes currently available, andthe cluster’s capacity: in terms of the number of map and reduce slots available acrossthe cluster (“Map Task Capacity” and “Reduce Task Capacity”), and the number ofavailable slots per node, on average. The number of tasktrackers that have been black-listed by the jobtracker is listed as well (blacklisting is discussed in “Tasktracker Fail-ure” on page 175).Below the summary, there is a section about the job scheduler that is running (here thedefault). You can click through to see job queues.148 | Chapter 5: Developing a MapReduce Application

167.
Further down, we see sections for running, (successfully) completed, and failed jobs.Each of these sections has a table of jobs, with a row per job that shows the job’s ID,owner, name (as set using JobConf’s setJobName() method, which sets themapred.job.name property) and progress information.Finally, at the foot of the page, there are links to the jobtracker’s logs, and the job-tracker’s history: information on all the jobs that the jobtracker has run. The maindisplay displays only 100 jobs (configurable via the mapred.jobtracker.completeuserjobs.maximum property), before consigning them to the history page. Note also that thejob history is persistent, so you can find jobs here from previous runs of the jobtracker.Figure 5-1. Screenshot of the jobtracker page Running on a Cluster | 149

168.
Job History Job history refers to the events and configuration for a completed job. It is retained whether the job was successful or not. Job history is used to support job recovery after a jobtracker restart (see the mapred.jobtracker.restart.recover property), as well as providing interesting information for the user running a job. Job history files are stored on the local filesystem of the jobtracker in a history subdir- ectory of the logs directory. It is possible to set the location to an arbitrary Hadoop filesystem via the hadoop.job.history.location property. The jobtracker’s history files are kept for 30 days before being deleted by the system. A second copy is also stored for the user in the _logs/history subdirectory of the job’s output directory. This location may be overridden by setting hadoop.job.history.user.location. By setting it to the special value none, no user job history is saved, although job history is still saved centrally. A user’s job history files are never deleted by the system. The history log includes job, task, and attempt events, all of which are stored in a plain- text file. The history for a particular job may be viewed through the web UI, or via the command line, using hadoop job -history (which you point at the job’s output directory).The job pageClicking on a job ID brings you to a page for the job, illustrated in Figure 5-2. At thetop of the page is a summary of the job, with basic information such as job owner andname, and how long the job has been running for. The job file is the consolidatedconfiguration file for the job, containing all the properties and their values that were ineffect during the job run. If you are unsure of what a particular property was set to, youcan click through to inspect the file.While the job is running, you can monitor its progress on this page, which periodicallyupdates itself. Below the summary is a table that shows the map progress and the reduceprogress. “Num Tasks” shows the total number of map and reduce tasks for this job(a row for each). The other columns then show the state of these tasks: “Pending”(waiting to run), “Running,” “Complete” (successfully run), “Killed” (tasks that havefailed—this column would be more accurately labeled “Failed”). The final columnshows the total number of failed and killed task attempts for all the map or reduce tasksfor the job (task attempts may be marked as killed if they are a speculative executionduplicate, if the tasktracker they are running on dies or if they are killed by a user). See“Task Failure” on page 173 for background on task failure.Further down the page, you can find completion graphs for each task that show theirprogress graphically. The reduce completion graph is divided into the three phases ofthe reduce task: copy (when the map outputs are being transferred to the reduce’stasktracker), sort (when the reduce inputs are being merged), and reduce (when the150 | Chapter 5: Developing a MapReduce Application

169.
reduce function is being run to produce the final output). The phases are described inmore detail in “Shuffle and Sort” on page 177.In the middle of the page is a table of job counters. These are dynamically updatedduring the job run, and provide another useful window into the job’s progress andgeneral health. There is more information about what these counters mean in “Built-in Counters” on page 225.Retrieving the ResultsOnce the job is finished, there are various ways to retrieve the results. Each reducerproduces one output file, so there are 30 part files named part-00000 to part-00029 inthe max-temp directory. As their names suggest, a good way to think of these “part” files is as parts of the max-temp “file.” If the output is large (which it isn’t in this case), then it is important to have multiple parts so that more than one reducer can work in parallel. Usually, if a file is in this partitioned form, it can still be used easily enough: as the input to another MapReduce job, for example. In some cases, you can exploit the structure of multiple partitions to do a map- side join, for example, (“Map-Side Joins” on page 247) or a MapFile lookup (“An application: Partitioned MapFile lookups” on page 235).This job produces a very small amount of output, so it is convenient to copy it fromHDFS to our development machine. The -getmerge option to the hadoop fs commandis useful here, as it gets all the files in the directory specified in the source pattern andmerges them into a single file on the local filesystem: % hadoop fs -getmerge max-temp max-temp-local % sort max-temp-local | tail 1991 607 1992 605 1993 567 1994 568 1995 567 1996 561 1997 565 1998 568 1999 568 2000 558We sorted the output, as the reduce output partitions are unordered (owing to the hashpartition function). Doing a bit of postprocessing of data from MapReduce is verycommon, as is feeding it into analysis tools, such as R, a spreadsheet, or even a relationaldatabase. Running on a Cluster | 151

171.
Another way of retrieving the output if it is small is to use the -cat option to print theoutput files to the console: % hadoop fs -cat max-temp/*On closer inspection, we see that some of the results don’t look plausible. For instance,the maximum temperature for 1951 (not shown here) is 590°C! How do we find outwhat’s causing this? Is it corrupt input data or a bug in the program?Debugging a JobThe time-honored way of debugging programs is via print statements, and this is cer-tainly possible in Hadoop. However, there are complications to consider: with pro-grams running on tens, hundreds, or thousands of nodes, how do we find and examinethe output of the debug statements, which may be scattered across these nodes? Forthis particular case, where we are looking for (what we think is) an unusual case, wecan use a debug statement to log to standard error, in conjunction with a message toupdate the task’s status message to prompt us to look in the error log. The web UImakes this easy, as we will see.We also create a custom counter to count the total number of records with implausibletemperatures in the whole dataset. This gives us valuable information about how todeal with the condition—if it turns out to be a common occurrence, then we mightneed to learn more about the condition and how to extract the temperature in thesecases, rather than simply dropping the record. In fact, when trying to debug a job, youshould always ask yourself if you can use a counter to get the information you need tofind out what’s happening. Even if you need to use logging or a status message, it maybe useful to use a counter to gauge the extent of the problem. (There is more on countersin “Counters” on page 225.)If the amount of log data you produce in the course of debugging is large, then you’vegot a couple of options. The first is to write the information to the map’s output, ratherthan to standard error, for analysis and aggregation by the reduce. This approach usu-ally necessitates structural changes to your program, so start with the other techniquesfirst. Alternatively, you can write a program (in MapReduce of course) to analyze thelogs produced by your job.We add our debugging to the mapper (version 4), as opposed to the reducer, as wewant to find out what the source data causing the anomalous output looks like: public class MaxTemperatureMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { enum Temperature { OVER_100 } private NcdcRecordParser parser = new NcdcRecordParser(); Running on a Cluster | 153

172.
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { parser.parse(value); if (parser.isValidTemperature()) { int airTemperature = parser.getAirTemperature(); if (airTemperature > 1000) { System.err.println("Temperature over 100 degrees for input: " + value); reporter.setStatus("Detected possibly corrupt record: see logs."); reporter.incrCounter(Temperature.OVER_100, 1); } output.collect(new Text(parser.getYear()), new IntWritable(airTemperature)); } } }If the temperature is over 100°C (represented by 1000, since temperatures are in tenthsof a degree), we print a line to standard error with the suspect line, as well as updatingthe map’s status message using the setStatus() method on Reporter directing us tolook in the log. We also increment a counter, which in Java is represented by a field ofan enum type. In this program, we have defined a single field OVER_100 as a way to countthe number of records with a temperature of over 100°C.With this modification, we recompile the code, re-create the JAR file, then rerun thejob, and while it’s running go to the tasks page.The tasks pageThe job page has a number of links for look at the tasks in a job in more detail. Forexample, by clicking on the “map” link, you are brought to a page that lists informationfor all of the map tasks on one page. You can also see just the completed tasks. Thescreenshot in Figure 5-3 shows a portion of this page for the job run with our debuggingstatements. Each row in the table is a task, and it provides such information as the startand end times for each task, any errors reported back from the tasktracker, and a linkto view the counters for an individual task.The “Status” column can be helpful for debugging, since it shows a task’s latest statusmessage. Before a task starts, it shows its status as “initializing,” then once it startsreading records it shows the split information for the split it is reading as a filenamewith a byte offset and length. You can see the status we set for debugging for tasktask_200904110811_0003_m_000044, so let’s click through to the logs page to find theassociated debug message. (Notice, too, that there is an extra counter for this task, sinceour user counter has a nonzero count for this task.)The task details pageFrom the tasks page, you can click on any task to get more information about it. Thetask details page, shown in Figure 5-4, shows each task attempt. In this case, there was154 | Chapter 5: Developing a MapReduce Application

173.
one task attempt, which completed successfully. The table provides further useful data,such as the node the task attempt ran on, and links to task logfiles and counters.The “Actions” column contains links for killing a task attempt. By default, this is dis-abled, making the web UI a read-only interface. Set webinterface.private.actions totrue to enable the actions links.Figure 5-3. Screenshot of the tasks pageFigure 5-4. Screenshot of the task details page By setting webinterface.private.actions to true, you also allow anyone with access to the HDFS web interface to delete files. The dfs.web.ugi property determines the user that the HDFS web UI runs as, thus con- trolling which files may be viewed and deleted. Running on a Cluster | 155

174.
For map tasks, there is also a section showing which nodes the input split was locatedon.By following one of the links to the logfiles for the successful task attempt (you can seethe last 4 KB or 8 KB of each logfile, or the entire file), we can find the suspect inputrecord that we logged (the line is wrapped and truncated to fit on the page): Temperature over 100 degrees for input: 0335999999433181957042302005+37950+139117SAO +0004RJSN V020113590031500703569999994 33201957010100005+35317+139650SAO +000899999V02002359002650076249N004000599+0067...This record seems to be in a different format to the others. For one thing, there arespaces in the line, which are not described in the specification.When the job has finished, we can look at the value of the counter we defined to seehow many records over 100°C there are in the whole dataset. Counters are accessiblevia the web UI or the command line: % hadoop job -counter job_200904110811_0003 v4.MaxTemperatureMapper$Temperature OVER_100 3The -counter option takes the job ID, counter group name (which is the fully qualifiedclassname here), and the counter name (the enum name). There are only three mal-formed records in the entire dataset of over a billion records. Throwing out bad recordsis standard for many big data problems, although we need to be careful in this case,since we are looking for an extreme value—the maximum temperature rather than anaggregate measure. Still, throwing away three records is probably not going to changethe result. Hadoop User Logs Hadoop produces logs in various places, for various audiences. These are summarized in Table 5-2. As you have seen in this section, MapReduce task logs are accessible through the web UI, which is the most convenient way to view them. You can also find the logfiles on the local filesystem of the tasktracker that ran the task attempt, in a directory named by the task attempt. If task JVM reuse is enabled (“Task JVM Reuse” on page 184), then each logfile accumulates the logs for the entire JVM run, so multiple task attempts will be found in each logfile. The web UI hides this by showing only the portion that is relevant for the task attempt being viewed. It is straightforward to write to these logfiles. Anything written to standard output, or standard error, is directed to the relevant logfile. (Of course, in Streaming, standard output is used for the map or reduce output, so it will not show up in the standard output log.) In Java, you can write to the task’s syslog file if you wish by using the Apache Commons Logging API. The actual logging is done by log4j in this case: the relevant log4j appender is called TLA (Task Log Appender) in the log4j.properties file in Hadoop’s configuration directory.156 | Chapter 5: Developing a MapReduce Application

175.
There are some controls for managing retention and size of task logs. By default, logs are deleted after a minimum of 24 hours (set using the mapred.userlog.retain.hours property). You can also set a cap on the maximum size of each logfile using the mapred.userlog.limit.kb property, which is 0 by default, meaning there is no cap.Table 5-2. Hadoop logs Logs Primary audience Description Further information System daemon logs Administrators Each Hadoop daemon produces a logfile (using “System log- log4j) and another file that combines standard files” on page 271. out and error. Written in the directory defined by the HADOOP_LOG_DIR environment variable. HDFS audit logs Administrators A log of all HDFS requests, turned off by default. “Audit Log- Written to the namenode’s log, although this is ging” on page 300. configurable. MapReduce job history logs Users A log of the events (such as task completion) “Job His- that occur in the course of running a job. Saved tory” on page 150. centrally on the jobtracker, and in the job’s out- put directory in a _logs/history subdirectory. MapReduce task logs Users Each tasktracker child process produces a logfile See next section. using log4j (called syslog), a file for data sent to standard out (stdout), and a file for standard error (stderr). Written in the userlogs subdirec- tory of the directory defined by the HADOOP_LOG_DIR environment variable.Handling malformed dataCapturing input data that causes a problem is valuable, as we can use it in a test tocheck that the mapper does the right thing: @Test public void parsesMalformedTemperature() throws IOException { MaxTemperatureMapper mapper = new MaxTemperatureMapper(); Text value = new Text("0335999999433181957042302005+37950+139117SAO +0004" + // Year ^^^^ "RJSN V02011359003150070356999999433201957010100005+353"); // Temperature ^^^^^ OutputCollector<Text, IntWritable> output = mock(OutputCollector.class); Reporter reporter = mock(Reporter.class); mapper.map(null, value, output, reporter); verify(output, never()).collect(any(Text.class), any(IntWritable.class)); verify(reporter).incrCounter(MaxTemperatureMapper.Temperature.MALFORMED, 1); } Running on a Cluster | 157

176.
The record that was causing the problem is of a different format to the other lines we’veseen. Example 5-11 shows a modified program (version 5) using a parser that ignoreseach line with a temperature field that does not have a leading sign (plus or minus).We’ve also introduced a counter to measure the number of records that we are ignoringfor this reason.Example 5-11. Mapper for maximum temperature examplepublic class MaxTemperatureMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { enum Temperature { MALFORMED } private NcdcRecordParser parser = new NcdcRecordParser(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { parser.parse(value); if (parser.isValidTemperature()) { int airTemperature = parser.getAirTemperature(); output.collect(new Text(parser.getYear()), new IntWritable(airTemperature)); } else if (parser.isMalformedTemperature()) { System.err.println("Ignoring possibly corrupt input: " + value); reporter.incrCounter(Temperature.MALFORMED, 1); } }}Using a Remote DebuggerWhen a task fails and there is not enough information logged to diagnose the error,you may want to resort to running a debugger for that task. This is hard to arrangewhen running the job on a cluster, as you don’t know which node is going to processwhich part of the input, so you can’t set up your debugger ahead of the failure. Instead,you run the job with a property set that instructs Hadoop to keep all the intermediatedata generated during the job run. This data can then be used to rerun the failing taskin isolation with a debugger attached. Note that the task is run in situ, on the samenode that it failed on, which increases the chances of the error being reproducible.†First, set the configuration property keep.failed.task.files to true, so that when tasksfail, the tasktracker keeps enough information to allow the task to be rerun over thesame input data. Then run the job again and note which node the task fails on, and thetask attempt ID (it begins with the string attempt_) using the web UI.† This feature is currently broken in Hadoop 0.20.2 but will be fixed in 0.21.0.158 | Chapter 5: Developing a MapReduce Application

177.
Next we need to run a special task runner called IsolationRunner with the retained filesas input. Log into the node that the task failed on and look for the directory for thattask attempt. It will be under one of the local MapReduce directories, as set by themapred.local.dir property (covered in more detail in “Important Hadoop DaemonProperties” on page 273). If this property is a comma-separated list of directories (tospread load across the physical disks on a machine), then you may need to look in allof the directories before you find the directory for that particular task attempt. The taskattempt directory is in the following location: mapred.local.dir/taskTracker/jobcache/job-ID/task-attempt-IDThis directory contains various files and directories, including job.xml, which containsall of the job configuration properties in effect during the task attempt, and whichIsolationRunner uses to create a JobConf instance. For map tasks, this directory alsocontains a file containing a serialized representation of the input split, so the same inputdata can be fetched for the task. For reduce tasks, a copy of the map output, whichforms the reduce input, is stored in a directory named output.There is also a directory called work, which is the working directory for the task attempt.We change into this directory to run the IsolationRunner. We need to set some optionsto allow the remote debugger to connect:‡ % export HADOOP_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y, address=8000"The suspend=y option tells the JVM to wait until the debugger has attached beforerunning code. The IsolationRunner is launched with the following command: % hadoop org.apache.hadoop.mapred.IsolationRunner ../job.xmlNext, set breakpoints, attach your remote debugger (all the major Java IDEs supportremote debugging—consult the documentation for instructions), and the task will berun under your control. You can rerun the task any number of times like this. With anyluck, you’ll be able to find and fix the error.During the process, you can use other, standard, Java debugging techniques, such askill -QUIT pid or jstack to get thread dumps.More generally, it’s worth knowing that this technique isn’t only useful for failing tasks.You can keep the intermediate files for successful tasks, too, which may be handy ifyou want to examine a task that isn’t failing. In this case, set the propertykeep.task.files.pattern to a regular expression that matches the IDs of the tasks youwant to keep.‡ You can find details about debugging options on the Java Platform Debugger Architecture web page. Running on a Cluster | 159

178.
Tuning a JobAfter a job is working, the question many developers ask is, “Can I make it run faster?”There are a few Hadoop-specific “usual suspects” that are worth checking to see if theyare responsible for a performance problem. You should run through the checklist inTable 5-3 before you start trying to profile or optimize at the task level.Table 5-3. Tuning checklist Area Best practice Further information Number of How long are your mappers running for? If they are only running for a few seconds “Small files and Com- mappers on average, then you should see if there’s a way to have fewer mappers and bineFileInputFor- make them all run longer, a minute or so, as a rule of thumb. The extent to mat” on page 203 which this is possible depends on the input format you are using. Number of reducers For maximum performance, the number of reducers should be slightly less than “Choosing the Num- the number of reduce slots in the cluster. This allows the reducers to finish in ber of Reduc- one wave and fully utilizes the cluster during the reduce phase. ers” on page 195 Combiners Can your job take advantage of a combiner to reduce the amount of data in “Combiner Func- passing through the shuffle? tions” on page 30 Intermediate Job execution time can almost always benefit from enabling map output “Compressing map compression compression. output” on page 85 Custom If you are using your own custom Writable objects, or custom comparators, “Implementing a serialization then make sure you have implemented RawComparator. RawComparator for speed” on page 99 Shuffle tweaks The MapReduce shuffle exposes around a dozen tuning parameters for memory “Configuration Tun- management, which may help you eke out the last bit of performance. ing” on page 180Profiling TasksLike debugging, profiling a job running on a distributed system like MapReducepresents some challenges. Hadoop allows you to profile a fraction of the tasks in a job,and, as each task completes, pulls down the profile information to your machine forlater analysis with standard profiling tools.Of course, it’s possible, and somewhat easier, to profile a job running in the local jobrunner. And provided you can run with enough input data to exercise the map andreduce tasks, this can be a valuable way of improving the performance of your mappersand reducers. There are a couple of caveats, however. The local job runner is a verydifferent environment from a cluster, and the data flow patterns are very different.Optimizing the CPU performance of your code may be pointless if your MapReducejob is I/O-bound (as many jobs are). To be sure that any tuning is effective, you shouldcompare the new execution time with the old running on a real cluster. Even this iseasier said than done, since job execution times can vary due to resource contentionwith other jobs and the decisions the scheduler makes to do with task placement. Toget a good idea of job execution time under these circumstances, perform a series of160 | Chapter 5: Developing a MapReduce Application

179.
runs (with and without the change) and check whether any improvement is statisticallysignificant.It’s unfortunately true that some problems (such as excessive memory use) can be re-produced only on the cluster, and in these cases the ability to profile in situ isindispensable.The HPROF profilerThere are a number of configuration properties to control profiling, which are alsoexposed via convenience methods on JobConf. The following modification toMaxTemperatureDriver (version 6) will enable remote HPROF profiling. HPROF is aprofiling tool that comes with the JDK that, although basic, can give valuable infor-mation about a program’s CPU and heap usage:§ conf.setProfileEnabled(true); conf.setProfileParams("-agentlib:hprof=cpu=samples,heap=sites,depth=6," + "force=n,thread=y,verbose=n,file=%s"); conf.setProfileTaskRange(true, "0-2");The first line enables profiling, which by default is turned off. (This is equivalent tosetting the configuration property mapred.task.profile to true).Next we set the profile parameters, which are the extra command-line arguments topass to the task’s JVM. (When profiling is enabled, a new JVM is allocated for eachtask, even if JVM reuse is turned on; see “Task JVM Reuse” on page 184.) The defaultparameters specify the HPROF profiler; here we set an extra HPROF option, depth=6,to give more stack trace depth than the HPROF default. The setProfileParams()method on JobConf is equivalent to setting the mapred.task.profile.params.Finally, we specify which tasks we want to profile. We normally only want profileinformation from a few tasks, so we use the setProfileTaskRange() method to specifythe range of task IDs that we want profile information for. We’ve set it to 0-2 (whichis actually the default), which means tasks with IDs 0, 1, and 2 are profiled. The firstargument to the setProfileTaskRange() method dictates whether the range is for mapor reduce tasks: true is for maps, false is for reduces. A set of ranges is permitted, usinga notation that allows open ranges. For example, 0-1,4,6- would specify all tasks exceptthose with IDs 2, 3, and 5. The tasks to profile can also be controlled using themapred.task.profile.maps property for map tasks, and mapred.task.profile.reducesfor reduce tasks.When we run a job with the modified driver, the profile output turns up at the end ofthe job in the directory we launched the job from. Since we are only profiling a fewtasks, we can run the job on a subset of the dataset.§ HPROF uses byte code insertion to profile your code, so you do not need to recompile your application with special options to use it. For more information on HPROF, see “HPROF: A Heap/CPU Profiling Tool in J2SE 5.0,” by Kelly O’Hair at http://java.sun.com/developer/technicalArticles/Programming/HPROF.html. Tuning a Job | 161

181.
However, we know if this is significant only if we can measure an improvement whenrunning the job over the whole dataset. Running each variant five times on an otherwisequiet 11-node cluster showed no statistically significant difference in job executiontime. Of course, this result holds only for this particular combination of code, data,and hardware, so you should perform similar benchmarks to see whether such a changeis significant for your setup.Other profilersAt the time of this writing, the mechanism for retrieving profile output is HPROF-specific. Until this is fixed, it should be possible to use Hadoop’s profiling settings totrigger profiling using any profiler (see the documentation for the particular profiler),although it may be necessary to manually retrieve the profiler’s output from tasktrack-ers for analysis.If the profiler is not installed on all the tasktracker machines, consider using the Dis-tributed Cache (“Distributed Cache” on page 253) for making the profiler binaryavailable on the required machines.MapReduce WorkflowsSo far in this chapter, you have seen the mechanics of writing a program using Map-Reduce. We haven’t yet considered how to turn a data processing problem into theMapReduce model.The data processing you have seen so far in this book is to solve a fairly simple problem(finding the maximum recorded temperature for given years). When the processinggets more complex, this complexity is generally manifested by having more MapReducejobs, rather than having more complex map and reduce functions. In other words, asa rule of thumb, think about adding more jobs, rather than adding complexity to jobs.For more complex problems, it is worth considering a higher-level language than Map-Reduce, such as Pig, Hive, or Cascading. One immediate benefit is that it frees you upfrom having to do the translation into MapReduce jobs, allowing you to concentrateon the analysis you are performing.Finally, the book Data-Intensive Text Processing with MapReduce by Jimmy Lin andChris Dyer (Morgan & Claypool Publishers, 2010, http://mapreduce.me/) is a great re-source for learning more about MapReduce algorithm design, and is highlyrecommended.Decomposing a Problem into MapReduce JobsLet’s look at an example of a more complex problem that we want to translate into aMapReduce workflow. MapReduce Workflows | 163

182.
Imagine that we want to find the mean maximum recorded temperature for every dayof the year and every weather station. In concrete terms, to calculate the mean maxi-mum daily temperature recorded by station 029070-99999, say, on January 1, we takethe mean of the maximum daily temperatures for this station for January 1, 1901;January 1, 1902; and so on up to January 1, 2000.How can we compute this using MapReduce? The computation decomposes mostnaturally into two stages: 1. Compute the maximum daily temperature for every station-date pair. The MapReduce program in this case is a variant of the maximum temperature program, except that the keys in this case are a composite station-date pair, rather than just the year. 2. Compute the mean of the maximum daily temperatures for every station-day- month key. The mapper takes the output from the previous job (station-date, maximum tem- perature) records and projects it into (station-day-month, maximum temperature) records by dropping the year component. The reduce function then takes the mean of the maximum temperatures for each station-day-month key.The output from first stage looks like this for the station we are interested in (themean_max_daily_temp.sh script in the examples provides an implementation inHadoop Streaming): 029070-99999 19010101 0 029070-99999 19020101 -94 ...The first two fields form the key, and the final column is the maximum temperaturefrom all the readings for the given station and date. The second stage averages thesedaily maxima over years to yield: 029070-99999 0101 -68which is interpreted as saying the mean maximum daily temperature on January 1 forstation 029070-99999 over the century is −6.8°C.It’s possible to do this computation in one MapReduce stage, but it takes more workon the part of the programmer.‖The arguments for having more (but simpler) MapReduce stages are that doing so leadsto more composable and more maintainable mappers and reducers. The case studiesin Chapter 16 cover a wide range of real-world problems that were solved using Map-Reduce, and in each case, the data processing task is implemented using two or moreMapReduce jobs. The details in that chapter are invaluable for getting a better idea ofhow to decompose a processing problem into a MapReduce workflow.‖ It’s an interesting exercise to do this. Hint: use “Secondary Sort” on page 241.164 | Chapter 5: Developing a MapReduce Application

183.
It’s possible to make map and reduce functions even more composable than we havedone. A mapper commonly performs input format parsing, projection (selecting therelevant fields), and filtering (removing records that are not of interest). In the mappersyou have seen so far, we have implemented all of these functions in a single mapper.However, there is a case for splitting these into distinct mappers and chaining theminto a single mapper using the ChainMapper library class that comes with Hadoop.Combined with a ChainReducer, you can run a chain of mappers, followed by a reducerand another chain of mappers in a single MapReduce job.Running Dependent JobsWhen there is more than one job in a MapReduce workflow, the question arises: howdo you manage the jobs so they are executed in order? There are several approaches,and the main consideration is whether you have a linear chain of jobs, or a more com-plex directed acyclic graph (DAG) of jobs.For a linear chain, the simplest approach is to run each job one after another, waitinguntil a job completes successfully before running the next: JobClient.runJob(conf1); JobClient.runJob(conf2);If a job fails, the runJob() method will throw an IOException, so later jobs in the pipelinedon’t get executed. Depending on your application, you might want to catch the ex-ception and clean up any intermediate data that was produced by any previous jobs.For anything more complex than a linear chain, there are libraries that can help or-chestrate your workflow (although they are suited to linear chains, or even one-off jobs,too). The simplest is in the org.apache.hadoop.mapred.jobcontrol package: theJobControl class. An instance of JobControl represents a graph of jobs to be run. Youadd the job configurations, then tell the JobControl instance the dependencies betweenjobs. You run the JobControl in a thread, and it runs the jobs in dependency order. Youcan poll for progress, and when the jobs have finished, you can query for all the jobs’statuses and the associated errors for any failures. If a job fails, JobControl won’t runits dependencies.OozieUnlike JobControl, which runs on the client machine submitting the jobs, Oozie(http://yahoo.github.com/oozie/) runs as a server, and a client submits a workflow to theserver. In Oozie, a workflow is a DAG of action nodes and control-flow nodes. An actionnode performs a workflow task, like moving files in HDFS, running a MapReduce jobor running a Pig job. A control-flow node governs the workflow execution betweenactions by allowing such constructs as conditional logic (so different executionbranches may be followed depending on the result of an earlier action node) or parallelexecution. When the workflow completes, Oozie can make an HTTP callback to the MapReduce Workflows | 165

184.
client to inform it of the workflow status. It is also possible to receive callbacks everytime the workflow enters or exits an action node.Oozie allows failed workflows to be re-run from an arbitrary point. This is useful fordealing with transient errors when the early actions in the workflow are time-consuming to execute.166 | Chapter 5: Developing a MapReduce Application

185.
CHAPTER 6 How MapReduce WorksIn this chapter, we look at how MapReduce in Hadoop works in detail. This knowledgeprovides a good foundation for writing more advanced MapReduce programs, whichwe will cover in the following two chapters.Anatomy of a MapReduce Job RunYou can run a MapReduce job with a single line of code: JobClient.runJob(conf). It’svery short, but it conceals a great deal of processing behind the scenes. This sectionuncovers the steps Hadoop takes to run a job.The whole process is illustrated in Figure 6-1. At the highest level, there are four inde-pendent entities: • The client, which submits the MapReduce job. • The jobtracker, which coordinates the job run. The jobtracker is a Java application whose main class is JobTracker. • The tasktrackers, which run the tasks that the job has been split into. Tasktrackers are Java applications whose main class is TaskTracker. • The distributed filesystem (normally HDFS, covered in Chapter 3), which is used for sharing job files between the other entities.Job SubmissionThe runJob() method on JobClient is a convenience method that creates a newJobClient instance and calls submitJob() on it (step 1 in Figure 6-1). Having submittedthe job, runJob() polls the job’s progress once a second and reports the progress to theconsole if it has changed since the last report. When the job is complete, if it wassuccessful, the job counters are displayed. Otherwise, the error that caused the job tofail is logged to the console. 167

186.
Figure 6-1. How Hadoop runs a MapReduce jobThe job submission process implemented by JobClient’s submitJob() method does thefollowing: • Asks the jobtracker for a new job ID (by calling getNewJobId() on JobTracker) (step 2). • Checks the output specification of the job. For example, if the output directory has not been specified or it already exists, the job is not submitted and an error is thrown to the MapReduce program. • Computes the input splits for the job. If the splits cannot be computed, because the input paths don’t exist, for example, then the job is not submitted and an error is thrown to the MapReduce program. • Copies the resources needed to run the job, including the job JAR file, the config- uration file, and the computed input splits, to the jobtracker’s filesystem in a directory named after the job ID. The job JAR is copied with a high replication factor (controlled by the mapred.submit.replication property, which defaults to 10) so that there are lots of copies across the cluster for the tasktrackers to access when they run tasks for the job (step 3).168 | Chapter 6: How MapReduce Works

187.
• Tells the jobtracker that the job is ready for execution (by calling submitJob() on JobTracker) (step 4).Job InitializationWhen the JobTracker receives a call to its submitJob() method, it puts it into an internalqueue from where the job scheduler will pick it up and initialize it. Initialization involvescreating an object to represent the job being run, which encapsulates its tasks, andbookkeeping information to keep track of the tasks’ status and progress (step 5).To create the list of tasks to run, the job scheduler first retrieves the input splits com-puted by the JobClient from the shared filesystem (step 6). It then creates one map taskfor each split. The number of reduce tasks to create is determined by themapred.reduce.tasks property in the JobConf, which is set by the setNumReduceTasks() method, and the scheduler simply creates this number of reduce tasks to berun. Tasks are given IDs at this point.Task AssignmentTasktrackers run a simple loop that periodically sends heartbeat method calls to thejobtracker. Heartbeats tell the jobtracker that a tasktracker is alive, but they also doubleas a channel for messages. As a part of the heartbeat, a tasktracker will indicate whetherit is ready to run a new task, and if it is, the jobtracker will allocate it a task, which itcommunicates to the tasktracker using the heartbeat return value (step 7).Before it can choose a task for the tasktracker, the jobtracker must choose a job to selectthe task from. There are various scheduling algorithms as explained later in this chapter(see “Job Scheduling” on page 175), but the default one simply maintains a prioritylist of jobs. Having chosen a job, the jobtracker now chooses a task for the job.Tasktrackers have a fixed number of slots for map tasks and for reduce tasks: for ex-ample, a tasktracker may be able to run two map tasks and two reduce tasks simulta-neously. (The precise number depends on the number of cores and the amount ofmemory on the tasktracker; see “Memory” on page 269.) The default scheduler fillsempty map task slots before reduce task slots, so if the tasktracker has at least oneempty map task slot, the jobtracker will select a map task; otherwise, it will select areduce task.To choose a reduce task, the jobtracker simply takes the next in its list of yet-to-be-runreduce tasks, since there are no data locality considerations. For a map task, however,it takes account of the tasktracker’s network location and picks a task whose input splitis as close as possible to the tasktracker. In the optimal case, the task is data-local, thatis, running on the same node that the split resides on. Alternatively, the task may berack-local: on the same rack, but not the same node, as the split. Some tasks are neitherdata-local nor rack-local and retrieve their data from a different rack from the one they Anatomy of a MapReduce Job Run | 169

188.
are running on. You can tell the proportion of each type of task by looking at a job’scounters (see “Built-in Counters” on page 225).Task ExecutionNow that the tasktracker has been assigned a task, the next step is for it to run the task.First, it localizes the job JAR by copying it from the shared filesystem to the tasktracker’sfilesystem. It also copies any files needed from the distributed cache by the applicationto the local disk; see “Distributed Cache” on page 253 (step 8). Second, it creates alocal working directory for the task, and un-jars the contents of the JAR into thisdirectory. Third, it creates an instance of TaskRunner to run the task.TaskRunner launches a new Java Virtual Machine (step 9) to run each task in (step 10),so that any bugs in the user-defined map and reduce functions don’t affect the task-tracker (by causing it to crash or hang, for example). It is, however, possible to reusethe JVM between tasks; see “Task JVM Reuse” on page 184.The child process communicates with its parent through the umbilical interface. Thisway it informs the parent of the task’s progress every few seconds until the task iscomplete.Streaming and PipesBoth Streaming and Pipes run special map and reduce tasks for the purpose of launchingthe user-supplied executable and communicating with it (Figure 6-2).In the case of Streaming, the Streaming task communicates with the process (whichmay be written in any language) using standard input and output streams. The Pipestask, on the other hand, listens on a socket and passes the C++ process a port numberin its environment, so that on startup, the C++ process can establish a persistent socketconnection back to the parent Java Pipes task.In both cases, during execution of the task, the Java process passes input key-valuepairs to the external process, which runs it through the user-defined map or reducefunction and passes the output key-value pairs back to the Java process. From thetasktracker’s point of view, it is as if the tasktracker child process ran the map or reducecode itself.Progress and Status UpdatesMapReduce jobs are long-running batch jobs, taking anything from minutes to hoursto run. Because this is a significant length of time, it’s important for the user to getfeedback on how the job is progressing. A job and each of its tasks have a status, whichincludes such things as the state of the job or task (e.g., running, successfully completed,failed), the progress of maps and reduces, the values of the job’s counters, and a status170 | Chapter 6: How MapReduce Works

189.
message or description (which may be set by user code). These statuses change overthe course of the job, so how do they get communicated back to the client?When a task is running, it keeps track of its progress, that is, the proportion of the taskcompleted. For map tasks, this is the proportion of the input that has been processed.For reduce tasks, it’s a little more complex, but the system can still estimate the pro-portion of the reduce input processed. It does this by dividing the total progress intothree parts, corresponding to the three phases of the shuffle (see “Shuffle andSort” on page 177). For example, if the task has run the reducer on half its input, thenthe task’s progress is ⅚, since it has completed the copy and sort phases (⅓ each) andis halfway through the reduce phase (⅙).Figure 6-2. The relationship of the Streaming and Pipes executable to the tasktracker and its child Anatomy of a MapReduce Job Run | 171

190.
What Constitutes Progress in MapReduce? Progress is not always measurable, but nevertheless it tells Hadoop that a task is doing something. For example, a task writing output records is making progress, even though it cannot be expressed as a percentage of the total number that will be written, since the latter figure may not be known, even by the task producing the output. Progress reporting is important, as it means Hadoop will not fail a task that’s making progress. All of the following operations constitute progress: • Reading an input record (in a mapper or reducer) • Writing an output record (in a mapper or reducer) • Setting the status description on a reporter (using Reporter’s setStatus() method) • Incrementing a counter (using Reporter’s incrCounter() method) • Calling Reporter’s progress() methodTasks also have a set of counters that count various events as the task runs (we saw anexample in “A test run” on page 23), either those built into the framework, such as thenumber of map output records written, or ones defined by users.If a task reports progress, it sets a flag to indicate that the status change should be sentto the tasktracker. The flag is checked in a separate thread every three seconds, and ifset it notifies the tasktracker of the current task status. Meanwhile, the tasktracker issending heartbeats to the jobtracker every five seconds (this is a minimum, as theheartbeat interval is actually dependent on the size of the cluster: for larger clusters,the interval is longer), and the status of all the tasks being run by the tasktracker is sentin the call. Counters are sent less frequently than every five seconds, because they canbe relatively high-bandwidth.The jobtracker combines these updates to produce a global view of the status of all thejobs being run and their constituent tasks. Finally, as mentioned earlier, theJobClient receives the latest status by polling the jobtracker every second. Clients canalso use JobClient’s getJob() method to obtain a RunningJob instance, which containsall of the status information for the job.The method calls are illustrated in Figure 6-3.Job CompletionWhen the jobtracker receives a notification that the last task for a job is complete, itchanges the status for the job to “successful.” Then, when the JobClient polls for status,it learns that the job has completed successfully, so it prints a message to tell the userand then returns from the runJob() method.172 | Chapter 6: How MapReduce Works

191.
Figure 6-3. How status updates are propagated through the MapReduce systemThe jobtracker also sends an HTTP job notification if it is configured to do so. Thiscan be configured by clients wishing to receive callbacks, via the job.end.notification.url property.Last, the jobtracker cleans up its working state for the job and instructs tasktrackers todo the same (so intermediate output is deleted, for example).FailuresIn the real world, user code is buggy, processes crash, and machines fail. One of themajor benefits of using Hadoop is its ability to handle such failures and allow your jobto complete.Task FailureConsider first the case of the child task failing. The most common way that this happensis when user code in the map or reduce task throws a runtime exception. If this happens, Failures | 173

192.
the child JVM reports the error back to its parent tasktracker, before it exits. The errorultimately makes it into the user logs. The tasktracker marks the task attempt asfailed, freeing up a slot to run another task.For Streaming tasks, if the Streaming process exits with a nonzero exit code, it is markedas failed. This behavior is governed by the stream.non.zero.exit.is.failure property(the default is true).Another failure mode is the sudden exit of the child JVM—perhaps there is a JVM bugthat causes the JVM to exit for a particular set of circumstances exposed by the Map-Reduce user code. In this case, the tasktracker notices that the process has exited andmarks the attempt as failed.Hanging tasks are dealt with differently. The tasktracker notices that it hasn’t receiveda progress update for a while and proceeds to mark the task as failed. The child JVMprocess will be automatically killed after this period.* The timeout period after whichtasks are considered failed is normally 10 minutes and can be configured on a per-jobbasis (or a cluster basis) by setting the mapred.task.timeout property to a value inmilliseconds.Setting the timeout to a value of zero disables the timeout, so long-running tasks arenever marked as failed. In this case, a hanging task will never free up its slot, and overtime there may be cluster slowdown as a result. This approach should therefore beavoided, and making sure that a task is reporting progress periodically will suffice (see“What Constitutes Progress in MapReduce?” on page 172).When the jobtracker is notified of a task attempt that has failed (by the tasktracker’sheartbeat call), it will reschedule execution of the task. The jobtracker will try to avoidrescheduling the task on a tasktracker where it has previously failed. Furthermore, if atask fails four times (or more), it will not be retried further. This value is configurable:the maximum number of attempts to run a task is controlled by themapred.map.max.attempts property for map tasks and mapred.reduce.max.attempts forreduce tasks. By default, if any task fails four times (or whatever the maximum numberof attempts is configured to), the whole job fails.For some applications, it is undesirable to abort the job if a few tasks fail, as it may bepossible to use the results of the job despite some failures. In this case, the maximumpercentage of tasks that are allowed to fail without triggering job failure can be setfor the job. Map tasks and reduce tasks are controlled independently, usingthe mapred.max.map.failures.percent and mapred.max.reduce.failures.percentproperties.* If a Streaming process hangs, the tasktracker does not try to kill it (although the JVM that launched it will be killed), so you should take precautions to monitor for this scenario, and kill orphaned processes by some other means.174 | Chapter 6: How MapReduce Works

193.
A task attempt may also be killed, which is different from it failing. A task attempt maybe killed because it is a speculative duplicate (for more, see “Speculative Execu-tion” on page 183), or because the tasktracker it was running on failed, and the job-tracker marked all the task attempts running on it as killed. Killed task attempts donot count against the number of attempts to run the task (as set bymapred.map.max.attempts and mapred.reduce.max.attempts), since it wasn’t the task’sfault that an attempt was killed.Users may also kill or fail task attempts using the web UI or the command line (typehadoop job to see the options). Jobs may also be killed by the same mechanisms.Tasktracker FailureFailure of a tasktracker is another failure mode. If a tasktracker fails by crashing, orrunning very slowly, it will stop sending heartbeats to the jobtracker (or send them veryinfrequently). The jobtracker will notice a tasktracker that has stopped sending heart-beats (if it hasn’t received one for 10 minutes, configured via the mapred.tasktracker.expiry.interval property, in milliseconds) and remove it from its pool oftasktrackers to schedule tasks on. The jobtracker arranges for map tasks that were runand completed successfully on that tasktracker to be rerun if they belong to incompletejobs, since their intermediate output residing on the failed tasktracker’s local filesystemmay not be accessible to the reduce task. Any tasks in progress are also rescheduled.A tasktracker can also be blacklisted by the jobtracker, even if the tasktracker has notfailed. A tasktracker is blacklisted if the number of tasks that have failed on it issignificantly higher than the average task failure rate on the cluster. Blacklisted task-trackers can be restarted to remove them from the jobtracker’s blacklist.Jobtracker FailureFailure of the jobtracker is the most serious failure mode. Currently, Hadoop has nomechanism for dealing with failure of the jobtracker—it is a single point of failure—so in this case the job fails. However, this failure mode has a low chance of occurring,since the chance of a particular machine failing is low. It is possible that a future releaseof Hadoop will remove this limitation by running multiple jobtrackers, only one ofwhich is the primary jobtracker at any time (perhaps using ZooKeeper as a coordinationmechanism for the jobtrackers to decide who is the primary; see Chapter 14).Job SchedulingEarly versions of Hadoop had a very simple approach to scheduling users’ jobs: theyran in order of submission, using a FIFO scheduler. Typically, each job would use thewhole cluster, so jobs had to wait their turn. Although a shared cluster offers greatpotential for offering large resources to many users, the problem of sharing resources Job Scheduling | 175

194.
fairly between users requires a better scheduler. Production jobs need to complete in atimely manner, while allowing users who are making smaller ad hoc queries to getresults back in a reasonable time.Later on, the ability to set a job’s priority was added, via the mapred.job.priorityproperty or the setJobPriority() method on JobClient (both of which take one of thevalues VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW). When the job scheduler is choosing thenext job to run, it selects one with the highest priority. However, with the FIFOscheduler, priorities do not support preemption, so a high-priority job can still beblocked by a long-running low priority job that started before the high-priority job wasscheduled.MapReduce in Hadoop comes with a choice of schedulers. The default is the originalFIFO queue-based scheduler, and there are also multiuser schedulers called the FairScheduler and the Capacity Scheduler.The Fair SchedulerThe Fair Scheduler aims to give every user a fair share of the cluster capacity over time.If a single job is running, it gets all of the cluster. As more jobs are submitted, free taskslots are given to the jobs in such a way as to give each user a fair share of the cluster.A short job belonging to one user will complete in a reasonable time even while anotheruser’s long job is running, and the long job will still make progress.Jobs are placed in pools, and by default, each user gets their own pool. A user whosubmits more jobs than a second user will not get any more cluster resources than thesecond, on average. It is also possible to define custom pools with guaranteed minimumcapacities defined in terms of the number of map and reduce slots, and to set weightingsfor each pool.The Fair Scheduler supports preemption, so if a pool has not received its fair share fora certain period of time, then the scheduler will kill tasks in pools running over capacityin order to give the slots to the pool running under capacity.The Fair Scheduler is a “contrib” module. To enable it, place its JAR file on Hadoop’sclasspath, by copying it from Hadoop’s contrib/fairscheduler directory to the lib direc-tory. Then set the mapred.jobtracker.taskScheduler property to: org.apache.hadoop.mapred.FairSchedulerThe Fair Scheduler will work without further configuration, but to take full advantageof its features and how to configure it (including its web interface), refer to READMEin the src/contrib/fairscheduler directory of the distribution.176 | Chapter 6: How MapReduce Works

195.
The Capacity SchedulerThe Capacity Scheduler takes a slightly different approach to multiuser scheduling. Acluster is made up of a number of queues (like the Fair Scheduler’s pools), which maybe hierarchical (so a queue may be the child of another queue), and each queue has anallocated capacity. This is like the Fair Scheduler, except that within each queue, jobsare scheduled using FIFO scheduling (with priorities). In effect, the Capacity Schedulerallows users or organizations (defined using queues) to simulate a separate MapReducecluster with FIFO scheduling for each user or organization. The Fair Scheduler, bycontrast, (which actually also supports FIFO job scheduling within pools as an option,making it like the Capacity Scheduler) enforces fair sharing within each pool, so runningjobs share the pool’s resources.Shuffle and SortMapReduce makes the guarantee that the input to every reducer is sorted by key. Theprocess by which the system performs the sort—and transfers the map outputs to thereducers as inputs—is known as the shuffle.† In this section, we look at how the shuffleworks, as a basic understanding would be helpful, should you need to optimize a Map-Reduce program. The shuffle is an area of the codebase where refinements andimprovements are continually being made, so the following description necessarilyconceals many details (and may change over time, this is for version 0.20). In manyways, the shuffle is the heart of MapReduce and is where the “magic” happens.The Map SideWhen the map function starts producing output, it is not simply written to disk. Theprocess is more involved, and takes advantage of buffering writes in memory and doingsome presorting for efficiency reasons. Figure 6-4 shows what happens.Each map task has a circular memory buffer that it writes the output to. The buffer is100 MB by default, a size which can be tuned by changing the io.sort.mb property.When the contents of the buffer reaches a certain threshold size (io.sort.spill.percent, default 0.80, or 80%), a background thread will start to spill the contents to disk.Map outputs will continue to be written to the buffer while the spill takes place, but ifthe buffer fills up during this time, the map will block until the spill is complete.Spills are written in round-robin fashion to the directories specified by themapred.local.dir property, in a job-specific subdirectory.† The term shuffle is actually imprecise, since in some contexts it refers to only the part of the process where map outputs are fetched by reduce tasks. In this section, we take it to mean the whole process from the point where a map produces output to where a reduce consumes input. Shuffle and Sort | 177

196.
Figure 6-4. Shuffle and sort in MapReduceBefore it writes to disk, the thread first divides the data into partitions correspondingto the reducers that they will ultimately be sent to. Within each partition, the back-ground thread performs an in-memory sort by key, and if there is a combiner function,it is run on the output of the sort.Each time the memory buffer reaches the spill threshold, a new spill file is created, soafter the map task has written its last output record there could be several spill files.Before the task is finished, the spill files are merged into a single partitioned and sortedoutput file. The configuration property io.sort.factor controls the maximum numberof streams to merge at once; the default is 10.If a combiner function has been specified, and the number of spills is at least three (thevalue of the min.num.spills.for.combine property), then the combiner is run before theoutput file is written. Recall that combiners may be run repeatedly over the input with-out affecting the final result. The point is that running combiners makes for a morecompact map output, so there is less data to write to local disk and to transfer to thereducer.It is often a good idea to compress the map output as it is written to disk, since doingso makes it faster to write to disk, saves disk space, and reduces the amount of data totransfer to the reducer. By default, the output is not compressed, but it is easy to enableby setting mapred.compress.map.output to true. The compression library to use is speci-fied by mapred.map.output.compression.codec; see “Compression” on page 77 for moreon compression formats.The output file’s partitions are made available to the reducers over HTTP. The numberof worker threads used to serve the file partitions is controlled by the tasktracker.http.threads property—this setting is per tasktracker, not per map task slot.The default of 40 may need increasing for large clusters running large jobs.178 | Chapter 6: How MapReduce Works

197.
The Reduce SideLet’s turn now to the reduce part of the process. The map output file is sitting on thelocal disk of the tasktracker that ran the map task (note that although map outputsalways get written to the local disk of the map tasktracker, reduce outputs may not be),but now it is needed by the tasktracker that is about to run the reduce task for thepartition. Furthermore, the reduce task needs the map output for its particular partitionfrom several map tasks across the cluster. The map tasks may finish at different times,so the reduce task starts copying their outputs as soon as each completes. This is knownas the copy phase of the reduce task. The reduce task has a small number of copierthreads so that it can fetch map outputs in parallel. The default is five threads, but thisnumber can be changed by setting the mapred.reduce.parallel.copies property. How do reducers know which tasktrackers to fetch map output from? As map tasks complete successfully, they notify their parent tasktracker of the status update, which in turn notifies the jobtracker. These noti- fications are transmitted over the heartbeat communication mechanism described earlier. Therefore, for a given job, the jobtracker knows the mapping between map outputs and tasktrackers. A thread in the reducer periodically asks the jobtracker for map output locations until it has retrieved them all. Tasktrackers do not delete map outputs from disk as soon as the first reducer has retrieved them, as the reducer may fail. Instead, they wait until they are told to delete them by the jobtracker, which is after the job has completed.The map outputs are copied to the reduce tasktracker’s memory if they are small enough(the buffer’s size is controlled by mapred.job.shuffle.input.buffer.percent, whichspecifies the proportion of the heap to use for this purpose); otherwise, they are copiedto disk. When the in-memory buffer reaches a threshold size (controlled bymapred.job.shuffle.merge.percent), or reaches a threshold number of map outputs(mapred.inmem.merge.threshold), it is merged and spilled to disk.As the copies accumulate on disk, a background thread merges them into larger, sortedfiles. This saves some time merging later on. Note that any map outputs that werecompressed (by the map task) have to be decompressed in memory in order to performa merge on them.When all the map outputs have been copied, the reduce task moves into the sortphase (which should properly be called the merge phase, as the sorting was carried outon the map side), which merges the map outputs, maintaining their sort ordering. Thisis done in rounds. For example, if there were 50 map outputs, and the merge factor was10 (the default, controlled by the io.sort.factor property, just like in the map’s merge), Shuffle and Sort | 179

198.
then there would be 5 rounds. Each round would merge 10 files into one, so at the endthere would be five intermediate files.Rather than have a final round that merges these five files into a single sorted file, themerge saves a trip to disk by directly feeding the reduce function in what is the lastphase: the reduce phase. This final merge can come from a mixture of in-memory andon-disk segments. The number of files merged in each round is actually more subtle than this example suggests. The goal is to merge the minimum number of files to get to the merge factor for the final round. So if there were 40 files, the merge would not merge 10 files in each of the four rounds to get 4 files. Instead, the first round would merge only 4 files, and the subsequent three rounds would merge the full 10 files. The 4 merged files, and the 6 (as yet unmerged) files make a total of 10 files for the final round. Note that this does not change the number of rounds, it’s just an opti- mization to minimize the amount of data that is written to disk, since the final round always merges directly into the reduce.During the reduce phase, the reduce function is invoked for each key in the sortedoutput. The output of this phase is written directly to the output filesystem, typicallyHDFS. In the case of HDFS, since the tasktracker node is also running a datanode, thefirst block replica will be written to the local disk.Configuration TuningWe are now in a better position to understand how to tune the shuffle to improveMapReduce performance. The relevant settings, which can be used on a per-job basis(except where noted), are summarized in Tables 6-1 and 6-2, along with the defaults,which are good for general-purpose jobs.The general principle is to give the shuffle as much memory as possible. However, thereis a trade-off, in that you need to make sure that your map and reduce functions getenough memory to operate. This is why it is best to write your map and reduce functionsto use as little memory as possible—certainly they should not use an unboundedamount of memory (by avoiding accumulating values in a map, for example).The amount of memory given to the JVMs in which the map and reduce tasks run isset by the mapred.child.java.opts property. You should try to make this as large aspossible for the amount of memory on your task nodes; the discussion in “Mem-ory” on page 269 goes through the constraints to consider.On the map side, the best performance can be obtained by avoiding multiple spills todisk; one is optimal. If you can estimate the size of your map outputs, then you can setthe io.sort.* properties appropriately to minimize the number of spills. In particular,180 | Chapter 6: How MapReduce Works

199.
you should increase io.sort.mb if you can. There is a MapReduce counter (“Spilledrecords”; see “Counters” on page 225) that counts the total number of records thatwere spilled to disk over the course of a job, which can be useful for tuning. Note thatthe counter includes both map and reduce side spills.On the reduce side, the best performance is obtained when the intermediate data canreside entirely in memory. By default, this does not happen, since for the general caseall the memory is reserved for the reduce function. But if your reduce function has lightmemory requirements, then setting mapred.inmem.merge.threshold to 0 andmapred.job.reduce.input.buffer.percent to 1.0 (or a lower value; see Table 6-2) maybring a performance boost.More generally, Hadoop’s uses a buffer size of 4 KB by default, which is low, so youshould increase this across the cluster (by setting io.file.buffer.size, see also “OtherHadoop Properties” on page 279).In April 2008, Hadoop won the general-purpose terabyte sort benchmark (describedin “TeraByte Sort on Apache Hadoop” on page 553), and one of the optimizationsused was this one of keeping the intermediate data in memory on the reduce side.Table 6-1. Map-side tuning properties Property name Type Default value Description io.sort.mb int 100 The size, in megabytes, of the mem- ory buffer to use while sorting map output. io.sort.record.percent float 0.05 The proportion of io.sort.mb re- served for storing record boundaries of the map outputs. The remaining space is used for the map output re- cords themselves. io.sort.spill.percent float 0.80 The threshold usage proportion for both the map output memory buffer and the record boundaries index to start the process of spilling to disk. io.sort.factor int 10 The maximum number of streams to merge at once when sorting files. This property is also used in the re- duce. It’s fairly common to increase this to 100. min.num.spills.for. int 3 The minimum number of spill files combine needed for the combiner to run (if a combiner is specified). mapred.compress.map. boolean false Compress map outputs. output mapred.map.output. Class org.apache.hadoop.io. The compression codec to use for compression.codec name compress.DefaultCodec map outputs. Shuffle and Sort | 181

200.
Property name Type Default value Description task int 40 The number of worker threads per tracker.http.threads tasktracker for serving the map out- puts to reducers. This is a cluster- wide setting and cannot be set by individual jobs.Table 6-2. Reduce-side tuning properties Property name Type Default value Description mapred.reduce.parallel. int 5 The number of threads used to copy map outputs copies to the reducer. mapred.reduce.copy.backoff int 300 The maximum amount of time, in seconds, to spend retrieving one map output for a reducer before de- claring it as failed. The reducer may repeatedly re- attempt a transfer within this time if it fails (using exponential backoff). io.sort.factor int 10 The maximum number of streams to merge at once when sorting files. This property is also used in the map. mapred.job.shuffle.input. float 0.70 The proportion of total heap size to be allocated to buffer.percent the map outputs buffer during the copy phase of the shuffle. mapred.job.shuffle.merge. float 0.66 The threshold usage proportion for the map outputs percent buffer (defined by mapred.job.shuf fle.input.buffer.percent) for starting the process of merging the outputs and spilling to disk. mapred.inmem.merge.threshold int 1000 The threshold number of map outputs for starting the process of merging the outputs and spilling to disk. A value of 0 or less means there is no threshold, and the spill behavior is governed solely by mapred.job.shuffle.merge.percent. mapred.job.reduce.input. float 0.0 The proportion of total heap size to be used for re- buffer.percent taining map outputs in memory during the reduce. For the reduce phase to begin, the size of map out- puts in memory must be no more than this size. By default, all map outputs are merged to disk before the reduce begins, to give the reduces as much memory as possible. However, if your reducers re- quire less memory, this value may be increased to minimize the number of trips to disk.182 | Chapter 6: How MapReduce Works

201.
Task ExecutionWe saw how the MapReduce system executes tasks in the context of the overall job atthe beginning of the chapter in “Anatomy of a MapReduce Job Run” on page 167. Inthis section, we’ll look at some more controls that MapReduce users have over taskexecution.Speculative ExecutionThe MapReduce model is to break jobs into tasks and run the tasks in parallel to makethe overall job execution time smaller than it would otherwise be if the tasks ran se-quentially. This makes job execution time sensitive to slow-running tasks, as it takesonly one slow task to make the whole job take significantly longer than it would havedone otherwise. When a job consists of hundreds or thousands of tasks, the possibilityof a few straggling tasks is very real.Tasks may be slow for various reasons, including hardware degradation or softwaremis-configuration, but the causes may be hard to detect since the tasks still completesuccessfully, albeit after a longer time than expected. Hadoop doesn’t try to diagnoseand fix slow-running tasks; instead, it tries to detect when a task is running slower thanexpected and launches another, equivalent, task as a backup. This is termed speculativeexecution of tasks.It’s important to understand that speculative execution does not work by launchingtwo duplicate tasks at about the same time so they can race each other. This would bewasteful of cluster resources. Rather, a speculative task is launched only after all thetasks for a job have been launched, and then only for tasks that have been running forsome time (at least a minute) and have failed to make as much progress, on average, asthe other tasks from the job. When a task completes successfully, any duplicate tasksthat are running are killed since they are no longer needed. So if the original task com-pletes before the speculative task, then the speculative task is killed; on the other hand,if the speculative task finishes first, then the original is killed.Speculative execution is an optimization, not a feature to make jobs run more reliably.If there are bugs that sometimes cause a task to hang or slow down, then relying onspeculative execution to avoid these problems is unwise, and won’t work reliably, sincethe same bugs are likely to affect the speculative task. You should fix the bug so thatthe task doesn’t hang or slow down.Speculative execution is turned on by default. It can be enabled or disabled independ-ently for map tasks and reduce tasks, on a cluster-wide basis, or on a per-job basis. Therelevant properties are shown in Table 6-3. Task Execution | 183

202.
Table 6-3. Speculative execution properties Property name Type Default value Description mapred.map.tasks.speculative.execution boolean true Whether extra instances of map tasks may be launched if a task is making slow progress. mapred.reduce.tasks.speculative. boolean true Whether extra instances of re- execution duce tasks may be launched if a task is making slow progress.Why would you ever want to turn off speculative execution? The goal of speculativeexecution is reducing job execution time, but this comes at the cost of cluster efficiency.On a busy cluster, speculative execution can reduce overall throughput, since redun-dant tasks are being executed in an attempt to bring down the execution time for asingle job. For this reason, some cluster administrators prefer to turn it off on the clusterand have users explicitly turn it on for individual jobs. This was especially relevant forolder versions of Hadoop, when speculative execution could be overly aggressive inscheduling speculative tasks.Task JVM ReuseHadoop runs tasks in their own Java Virtual Machine to isolate them from other run-ning tasks. The overhead of starting a new JVM for each task can take around a second,which for jobs that run for a minute or so is insignificant. However, jobs that have alarge number of very short-lived tasks (these are usually map tasks), or that have lengthyinitialization, can see performance gains when the JVM is reused for subsequent tasks.With task JVM reuse enabled, tasks do not run concurrently in a single JVM. The JVMruns tasks sequentially. Tasktrackers can, however, run more than one task at a time,but this is always done in separate JVMs. The properties for controlling the tasktrackersnumber of map task slots and reduce task slots are discussed in “Memory”on page 269.The property for controlling task JVM reuse is mapred.job.reuse.jvm.num.tasks: itspecifies the maximum number of tasks to run for a given job for each JVM launched;the default is 1 (see Table 6-4). Tasks from different jobs are always run in separateJVMs. If the property is set to –1, there is no limit to the number of tasks from the samejob that may share a JVM. The method setNumTasksToExecutePerJvm() on JobConf canalso be used to configure this property.184 | Chapter 6: How MapReduce Works

203.
Table 6-4. Task JVM Reuse properties Property name Type Default value Description mapred.job.reuse.jvm.num.tasks int 1 The maximum number of tasks to run for a given job for each JVM on a tasktracker. A value of –1 indicates no limit: the same JVM may be used for all tasks for a job.Tasks that are CPU-bound may also benefit from task JVM reuse by taking advantageof runtime optimizations applied by the HotSpot JVM. After running for a while, theHotSpot JVM builds up enough information to detect performance-critical sections inthe code and dynamically translates the Java byte codes of these hot spots into nativemachine code. This works well for long-running processes, but JVMs that run for sec-onds or a few minutes may not gain the full benefit of HotSpot. In these cases, it isworth enabling task JVM reuse.Another place where a shared JVM is useful is for sharing state between the tasks of ajob. By storing reference data in a static field, tasks get rapid access to the shared data.Skipping Bad RecordsLarge datasets are messy. They often have corrupt records. They often have recordsthat are in a different format. They often have missing fields. In an ideal world, yourcode would cope gracefully with all of these conditions. In practice, it is often expedientto ignore the offending records. Depending on the analysis being performed, if only asmall percentage of records are affected, then skipping them may not significantly affectthe result. However, if a task trips up when it encounters a bad record—by throwinga runtime exception—then the task fails. Failing tasks are retried (since the failure maybe due to hardware failure or some other reason outside the task’s control), but if atask fails four times, then the whole job is marked as failed (see “Task Fail-ure” on page 173). If it is the data that is causing the task to throw an exception,rerunning the task won’t help, since it will fail in exactly the same way each time. If you are using TextInputFormat (“TextInputFormat” on page 209), then you can set a maximum expected line length to safeguard against corrupted files. Corruption in a file can manifest itself as a very long line, which can cause out of memory errors and then task failure. By setting mapred.linerecordreader.maxlength to a value in bytes that fits in mem- ory (and is comfortably greater than the length of lines in your input data), the record reader will skip the (long) corrupt lines without the task failing. Task Execution | 185

204.
The best way to handle corrupt records is in your mapper or reducer code. You candetect the bad record and ignore it, or you can abort the job by throwing an exception.You can also count the total number of bad records in the job using counters to seehow widespread the problem is.In rare cases, though, you can’t handle the problem because there is a bug in a third-party library that you can’t work around in your mapper or reducer. In these cases, youcan use Hadoop’s optional skipping mode for automatically skipping bad records.When skipping mode is enabled, tasks report the records being processed back to thetasktracker. When the task fails, the tasktracker retries the task, skipping the recordsthat caused the failure. Because of the extra network traffic and bookkeeping tomaintain the failed record ranges, skipping mode is turned on for a task only after ithas failed twice.Thus, for a task consistently failing on a bad record, the tasktracker runs the followingtask attempts with these outcomes: 1. Task fails. 2. Task fails. 3. Skipping mode is enabled. Task fails, but failed record is stored by the tasktracker. 4. Skipping mode is still enabled. Task succeeds by skipping the bad record that failed in the previous attempt.Skipping mode is off by default; you enable it independently for map and reduce tasksusing the SkipBadRecords class. It’s important to note that skipping mode can detectonly one bad record per task attempt, so this mechanism is appropriate only for de-tecting occasional bad records (a few per task, say). You may need to increase themaximum number of task attempts (via mapred.map.max.attempts andmapred.reduce.max.attempts) to give skipping mode enough attempts to detect and skipall the bad records in an input split.Bad records that have been detected by Hadoop are saved as sequence files in the job’soutput directory under the _logs/skip subdirectory. These can be inspected for diag-nostic purposes after the job has completed (using hadoop fs -text, for example).The Task Execution EnvironmentHadoop provides information to a map or reduce task about the environment in whichit is running. For example, a map task can discover the name of the file it is processing(see “File information in the mapper” on page 205), and a map or reduce task can findout the attempt number of the task. The properties in Table 6-5 can be accessed fromthe job’s configuration, obtained by providing an implementation of the configure()method for Mapper or Reducer, where the configuration is passed in as an argument.186 | Chapter 6: How MapReduce Works

205.
Table 6-5. Task environment properties Property name Type Description Example mapred.job.id String The job ID. (See “Job, job_200811201130_0004 Task, and Task Attempt IDs” on page 147 for a description of the format.) mapred.tip.id String The task ID. task_200811201130_0004_m_000003 mapred.task.id String The task attempt ID. attempt_200811201130_0004_m_000003_0 (Not the task ID.) mapred.task. int The ID of the task within 3 partition the job. mapred.task.is.map boolean Whether this task is a true map task.Streaming environment variablesHadoop sets job configuration parameters as environment variables for Streaming pro-grams. However, it replaces nonalphanumeric characters with underscores to makesure they are valid names. The following Python expression illustrates how you canretrieve the value of the mapred.job.id property from within a Python Streaming script: os.environ["mapred_job_id"]You can also set environment variables for the Streaming processes launched by Map-Reduce by supplying the -cmdenv option to the Streaming launcher program (once foreach variable you wish to set). For example, the following sets the MAGIC_PARAMETERenvironment variable: -cmdenv MAGIC_PARAMETER=abracadabraTask side-effect filesThe usual way of writing output from map and reduce tasks is by using the OutputCollector to collect key-value pairs. Some applications need more flexibility than a singlekey-value pair model, so these applications write output files directly from the map orreduce task to a distributed filesystem, like HDFS. (There are other ways to producemultiple outputs, too, as described in “Multiple Outputs” on page 217.)Care needs to be taken to ensure that multiple instances of the same task don’t try towrite to the same file. There are two problems to avoid: if a task failed and was retried,then the old partial output would still be present when the second task ran, and it wouldhave to delete the old file first. Second, with speculative execution enabled, two in-stances of the same task could try to write to the same file simultaneously. Task Execution | 187

206.
Hadoop solves this problem for the regular outputs from a task by writing outputs toa temporary directory that is specific to that task attempt. The directory is ${mapred.output.dir}/_temporary/${mapred.task.id}. On successful completion of the task, thecontents of the directory are copied to the job’s output directory (${mapred.output.dir}). Thus, if the task fails and is retried, the first attempt’s partial output will justbe cleaned up. A task and another speculative instance of the same task will get separateworking directories, and only the first to finish will have the content of its workingdirectory promoted to the output directory—the other will be discarded. The way that a task’s output is committed on completion is implemen- ted by an OutputCommitter, which is associated with the OutputFormat. The OutputCommitter for FileOutputFormat is a FileOutputCommitter, which implements the commit protocol described earlier. The getOut putCommitter() method on OutputFormat may be overridden to return a custom OutputCommitter, in case you want to implement the commit process in a different way.Hadoop provides a mechanism for application writers to use this feature, too. A taskmay find its working directory by retrieving the value of the mapred.work.output.dirproperty from its configuration file. Alternatively, a MapReduce program using the JavaAPI may call the getWorkOutputPath() static method on FileOutputFormat to get thePath object representing the working directory. The framework creates the workingdirectory before executing the task, so you don’t need to create it.To take a simple example, imagine a program for converting image files from one formatto another. One way to do this is to have a map-only job, where each map is given aset of images to convert (perhaps using NLineInputFormat; see “NLineInputFor-mat” on page 211). If a map task writes the converted images into its working directory,then they will be promoted to the output directory when the task successfully finishes.188 | Chapter 6: How MapReduce Works

207.
CHAPTER 7 MapReduce Types and FormatsMapReduce has a simple model of data processing: inputs and outputs for the map andreduce functions are key-value pairs. This chapter looks at the MapReduce model indetail and, in particular, how data in various formats, from simple text to structuredbinary objects, can be used with this model.MapReduce TypesThe map and reduce functions in Hadoop MapReduce have the following general form: map: (K1, V1) → list(K2, V2) reduce: (K2, list(V2)) → list(K3, V3)In general, the map input key and value types (K1 and V1) are different from the mapoutput types ( K2 and V2). However, the reduce input must have the same types as themap output, although the reduce output types may be different again (K3 and V3). TheJava interfaces mirror this form: public interface Mapper<K1, V1, K2, V2> extends JobConfigurable, Closeable { void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter) throws IOException; } public interface Reducer<K2, V2, K3, V3> extends JobConfigurable, Closeable { void reduce(K2 key, Iterator<V2> values, OutputCollector<K3, V3> output, Reporter reporter) throws IOException; }Recall that the OutputCollector is purely for emitting key-value pairs (and is henceparameterized with their types), while the Reporter is for updating counters and status.(In the new MapReduce API in release 0.20.0 and later, these two functions are com-bined in a single context object.) 189

208.
If a combine function is used, then it is the same form as the reduce function (and isan implementation of Reducer), except its output types are the intermediate key andvalue types (K2 and V2), so they can feed the reduce function: map: (K1, V1) → list(K2, V2) combine: (K2, list(V2)) → list(K2, V2) reduce: (K2, list(V2)) → list(K3, V3)Often the combine and reduce functions are the same, in which case, K3 is the same asK2, and V3 is the same as V2.The partition function operates on the intermediate key and value types (K2 and V2),and returns the partition index. In practice, the partition is determined solely by thekey (the value is ignored): partition: (K2, V2) → integerOr in Java: public interface Partitioner<K2, V2> extends JobConfigurable { int getPartition(K2 key, V2 value, int numPartitions); }So much for the theory, how does this help configure MapReduce jobs? Table 7-1summarizes the configuration options. It is divided into the properties that determinethe types and those that have to be compatible with the configured types.Input types are set by the input format. So, for instance, a TextInputFormat generateskeys of type LongWritable and values of type Text. The other types are set explicitly bycalling the methods on the JobConf. If not set explicitly, the intermediate types defaultto the (final) output types, which default to LongWritable and Text. So if K2 and K3 arethe same, you don’t need to call setMapOutputKeyClass(), since it falls back to the typeset by calling setOutputKeyClass(). Similarly, if V2 and V3 are the same, you only needto use setOutputValueClass().It may seem strange that these methods for setting the intermediate and final outputtypes exist at all. After all, why can’t the types be determined from a combination ofthe mapper and the reducer? The answer is that it’s to do with a limitation in Javagenerics: type erasure means that the type information isn’t always present at runtime,so Hadoop has to be given it explicitly. This also means that it’s possible to configurea MapReduce job with incompatible types, because the configuration isn’t checked atcompile time. The settings that have to be compatible with the MapReduce types arelisted in the lower part of Table 7-1. Type conflicts are detected at runtime during jobexecution, and for this reason, it is wise to run a test job using a small amount of datato flush out and fix any type incompatibilities.190 | Chapter 7: MapReduce Types and Formats

210.
The Default MapReduce JobWhat happens when you run MapReduce without setting a mapper or a reducer? Let’stry it by running this minimal MapReduce program: public class MinimalMapReduce extends Configured implements Tool { @Override public int run(String[] args) throws Exception { if (args.length != 2) { System.err.printf("Usage: %s [generic options] <input> <output>n", getClass().getSimpleName()); ToolRunner.printGenericCommandUsage(System.err); return -1; } JobConf conf = new JobConf(getConf(), getClass()); FileInputFormat.addInputPath(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); return 0; } public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run(new MinimalMapReduce(), args); System.exit(exitCode); } }The only configuration that we set is an input path and an output path. We run it overa subset of our weather data with the following: % hadoop MinimalMapReduce "input/ncdc/all/190{1,2}.gz" outputWe do get some output: one file named part-00000 in the output directory. Here’s whatthe first few lines look like (truncated to fit the page): 0→0029029070999991901010106004+64333+023450FM-12+000599999V0202701N01591... 0→0035029070999991902010106004+64333+023450FM-12+000599999V0201401N01181... 135→0029029070999991901010113004+64333+023450FM-12+000599999V0202901N00821... 141→0035029070999991902010113004+64333+023450FM-12+000599999V0201401N01181... 270→0029029070999991901010120004+64333+023450FM-12+000599999V0209991C00001... 282→0035029070999991902010120004+64333+023450FM-12+000599999V0201401N01391...Each line is an integer followed by a tab character, followed by the original weatherdata record. Admittedly, it’s not a very useful program, but understanding how it pro-duces its output does provide some insight into the defaults that Hadoop uses whenrunning MapReduce jobs. Example 7-1 shows a program that has exactly the sameeffect as MinimalMapReduce, but explicitly sets the job settings to their defaults.192 | Chapter 7: MapReduce Types and Formats

212.
return jobConf; } public static void printUsage(Tool tool, String extraArgsUsage) { System.err.printf("Usage: %s [genericOptions] %snn", tool.getClass().getSimpleName(), extraArgsUsage); GenericOptionsParser.printGenericCommandUsage(System.err); }Going back to MinimalMapReduceWithDefaults in Example 7-1, although there are manyother default job settings, the ones highlighted are those most central to running a job.Let’s go through them in turn.The default input format is TextInputFormat, which produces keys of type LongWritable (the offset of the beginning of the line in the file) and values of type Text (the lineof text). This explains where the integers in the final output come from: they are theline offsets.Despite appearances, the setNumMapTasks() call does not necessarily set the number ofmap tasks to one, in fact. It is a hint, and the actual number of map tasks depends onthe size of the input, and the file’s block size (if the file is in HDFS). This is discussedfurther in “FileInputFormat input splits” on page 202.The default mapper is IdentityMapper, which writes the input key and value unchangedto the output: public class IdentityMapper<K, V> extends MapReduceBase implements Mapper<K, V, K, V> { public void map(K key, V val, OutputCollector<K, V> output, Reporter reporter) throws IOException { output.collect(key, val); } }IdentityMapper is a generic type, which allows it to work with any key or value types,with the restriction that the map input and output keys are of the same type, and themap input and output values are of the same type. In this case, the map output key isLongWritable and the map output value is Text.Map tasks are run by MapRunner, the default implementation of MapRunnable that callsthe Mapper’s map() method sequentially with each record.The default partitioner is HashPartitioner, which hashes a record’s key to determinewhich partition the record belongs in. Each partition is processed by a reduce task, sothe number of partitions is equal to the number of reduce tasks for the job: public class HashPartitioner<K2, V2> implements Partitioner<K2, V2> { public void configure(JobConf job) {} public int getPartition(K2 key, V2 value,194 | Chapter 7: MapReduce Types and Formats

213.
int numPartitions) { return (key.hashCode() & Integer.MAX_VALUE) % numPartitions; } }The key’s hash code is turned into a nonnegative integer by bitwise ANDing it with thelargest integer value. It is then reduced modulo the number of partitions to find theindex of the partition that the record belongs in.By default, there is a single reducer, and therefore a single partition, so the action ofthe partitioner is irrelevant in this case since everything goes into one partition. How-ever, it is important to understand the behavior of HashPartitioner when you havemore than one reduce task. Assuming the key’s hash function is a good one, the recordswill be evenly allocated across reduce tasks, with all records sharing the same key beingprocessed by the same reduce task. Choosing the Number of Reducers The single reducer default is something of a gotcha for new users to Hadoop. Almost all real-world jobs should set this to a larger number; otherwise, the job will be very slow since all the intermediate data flows through a single reduce task. (Note that when running under the local job runner, only zero or one reducers are supported.) The optimal number of reducers is related to the total number of available reducer slots in your cluster. The total number of slots is found by multiplying the number of nodes in the cluster and the number of slots per node (which is determined by the value of the mapred.tasktracker.reduce.tasks.maximum property, described in “Environment Settings” on page 269). One common setting is to have slightly fewer reducers than total slots, which gives one wave of reduce tasks (and tolerates a few failures, without extending job execution time). If your reduce tasks are very big, then it makes sense to have a larger number of reducers (resulting in two waves, for example) so that the tasks are more fine-grained, and failure doesn’t affect job execution time significantly.The default reducer is IdentityReducer, again a generic type, which simply writes allits input to its output: public class IdentityReducer<K, V> extends MapReduceBase implements Reducer<K, V, K, V> { public void reduce(K key, Iterator<V> values, OutputCollector<K, V> output, Reporter reporter) throws IOException { while (values.hasNext()) { output.collect(key, values.next()); } } } MapReduce Types | 195

214.
For this job, the output key is LongWritable, and the output value is Text. In fact, allthe keys for this MapReduce program are LongWritable, and all the values are Text,since these are the input keys and values, and the map and reduce functions are bothidentity functions which by definition preserve type. Most MapReduce programs,however, don’t use the same key or value types throughout, so you need to configurethe job to declare the types you are using, as described in the previous section.Records are sorted by the MapReduce system before being presented to the reducer.In this case, the keys are sorted numerically, which has the effect of interleaving thelines from the input files into one combined output file.The default output format is TextOutputFormat, which writes out records, one per line,by converting keys and values to strings and separating them with a tab character. Thisis why the output is tab-separated: it is a feature of TextOutputFormat.The default Streaming jobIn Streaming, the default job is similar, but not identical, to the Java equivalent. Theminimal form is: % hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar -input input/ncdc/sample.txt -output output -mapper /bin/catNotice that you have to supply a mapper: the default identity mapper will not work.The reason has to do with the default input format, TextInputFormat, which generatesLongWritable keys and Text values. However, Streaming output keys and values (in-cluding the map keys and values) are always both of type Text.* The identity mappercannot change LongWritable keys to Text keys, so it fails.When we specify a non-Java mapper, and the input format is TextInputFormat, Stream-ing does something special. It doesn’t pass the key to the mapper process, it just passesthe value. This is actually very useful, since the key is just the line offset in the file, andthe value is the line, which is all most applications are interested in. The overall effectof this job is to perform a sort of the input.With more of the defaults spelled out, the command looks like this: % hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar -input input/ncdc/sample.txt -output output -inputformat org.apache.hadoop.mapred.TextInputFormat -mapper /bin/cat -partitioner org.apache.hadoop.mapred.lib.HashPartitioner -numReduceTasks 1 -reducer org.apache.hadoop.mapred.lib.IdentityReducer -outputformat org.apache.hadoop.mapred.TextOutputFormat* Except when used in binary mode, from version 0.21.0 onward, via the -io rawbytes or -io typedbytes options. Text mode (-io text) is the default.196 | Chapter 7: MapReduce Types and Formats

215.
The mapper and reducer arguments take a command or a Java class. A combiner mayoptionally be specified, using the -combiner argument.Keys and values in StreamingA Streaming application can control the separator that is used when a key-value pair isturned into a series of bytes and sent to the map or reduce process over standard input.The default is a tab character, but it is useful to be able to change it in the case that thekeys or values themselves contain tab characters.Similarly, when the map or reduce writes out key-value pairs, they may be separatedby a configurable separator. Furthermore, the key from the output can be composedof more than the first field: it can be made up of the first n fields (defined bystream.num.map.output.key.fields or stream.num.reduce.output.key.fields), withthe value being the remaining fields. For example, if the output from a Streaming proc-ess was a,b,c (and the separator is a comma), and n is two, then the key would beparsed as a,b and the value as c.Separators may be configured independently for maps and reduces. The properties arelisted in Table 7-2 and shown in a diagram of the data flow path in Figure 7-1.These settings do not have any bearing on the input and output formats. For example,if stream.reduce.output.field.separator were set to be a colon, say, and the reducestream process wrote the line a:b to standard out, then the Streaming reducer wouldknow to extract the key as a and the value as b. With the standard TextOutputFormat,this record would be written to the output file with a tab separating a and b. You canchange the separator that TextOutputFormat uses by setting mapred.textoutputformat.separator.Table 7-2. Streaming separator properties Property name Type Default value Description stream.map.input.field. String t The separator to use when passing the input key and separator value strings to the stream map process as a stream of bytes. stream.map.output.field. String t The separator to use when splitting the output from the separator stream map process into key and value strings for the map output. stream.num.map. int 1 The number of fields separated by output.key.fields stream.map.output.field.separator to treat as the map output key. stream.reduce.input.field. String t The separator to use when passing the input key and separator value strings to the stream reduce process as a stream of bytes. stream.reduce. String t The separator to use when splitting the output from the output.field. stream reduce process into key and value strings for the separator final reduce output. MapReduce Types | 197

216.
Property name Type Default value Description stream.num.reduce. int 1 The number of fields separated by output.key.fields stream.reduce.output.field.separator to treat as the reduce output key.Figure 7-1. Where separators are used in a Streaming MapReduce jobInput FormatsHadoop can process many different types of data formats, from flat text files to data-bases. In this section, we explore the different formats available.Input Splits and RecordsAs we saw in Chapter 2, an input split is a chunk of the input that is processed by asingle map. Each map processes a single split. Each split is divided into records, andthe map processes each record—a key-value pair—in turn. Splits and records are log-ical: there is nothing that requires them to be tied to files, for example, although in theirmost common incarnations, they are. In a database context, a split might correspondto a range of rows from a table and a record to a row in that range (this is precisely whatDBInputFormat does, an input format for reading data from a relational database).Input splits are represented by the Java interface, InputSplit (which, like all of theclasses mentioned in this section, is in the org.apache.hadoop.mapred package†): public interface InputSplit extends Writable { long getLength() throws IOException; String[] getLocations() throws IOException;† But see the new MapReduce classes in org.apache.hadoop.mapreduce, described in “The new Java MapReduce API” on page 25.198 | Chapter 7: MapReduce Types and Formats

217.
}An InputSplit has a length in bytes and a set of storage locations, which are just host-name strings. Notice that a split doesn’t contain the input data; it is just a reference tothe data. The storage locations are used by the MapReduce system to place map tasksas close to the split’s data as possible, and the size is used to order the splits so that thelargest get processed first, in an attempt to minimize the job runtime (this is an instanceof a greedy approximation algorithm).As a MapReduce application writer, you don’t need to deal with InputSplits directly,as they are created by an InputFormat. An InputFormat is responsible for creating theinput splits and dividing them into records. Before we see some concrete examples ofInputFormat, let’s briefly examine how it is used in MapReduce. Here’s the interface: public interface InputFormat<K, V> { InputSplit[] getSplits(JobConf job, int numSplits) throws IOException; RecordReader<K, V> getRecordReader(InputSplit split, JobConf job, Reporter reporter) throws IOException; }The JobClient calls the getSplits() method, passing the desired number of map tasksas the numSplits argument. This number is treated as a hint, as InputFormat imple-mentations are free to return a different number of splits to the number specified innumSplits. Having calculated the splits, the client sends them to the jobtracker, whichuses their storage locations to schedule map tasks to process them on the tasktrackers.On a tasktracker, the map task passes the split to the getRecordReader() method onInputFormat to obtain a RecordReader for that split. A RecordReader is little more thanan iterator over records, and the map task uses one to generate record key-value pairs,which it passes to the map function. A code snippet (based on the code in MapRunner)illustrates the idea: K key = reader.createKey(); V value = reader.createValue(); while (reader.next(key, value)) { mapper.map(key, value, output, reporter); }Here the RecordReader’s next() method is called repeatedly to populate the key andvalue objects for the mapper. When the reader gets to the end of the stream, thenext() method returns false, and the map task completes. Input Formats | 199

218.
This code snippet makes it clear that the same key and value objects are used on each invocation of the map() method—only their contents are changed (by the reader’s next() method). This can be a surprise to users, who might expect keys and values to be immutable. This causes prob- lems when a reference to a key or value object is retained outside the map() method, as its value can change without warning. If you need to do this, make a copy of the object you want to hold on to. For example, for a Text object, you can use its copy constructor: new Text(value). The situation is similar with reducers. In this case, the value objects in the reducer’s iterator are reused, so you need to copy any that you need to retain between calls to the iterator (see Example 8-14).Finally, note that MapRunner is only one way of running mappers. MultithreadedMapRunner is another implementation of the MapRunnable interface that runs mappers concur-rently in a configurable number of threads (set by mapred.map.multithreadedrunner.threads). For most data processing tasks, it confers no advantage over MapRunner.However, for mappers that spend a long time processing each record, because theycontact external servers, for example, it allows multiple mappers to run in one JVMwith little contention. See “Fetcher: A multithreaded MapRunner in ac-tion” on page 527 for an example of an application that uses MultithreadedMapRunner.FileInputFormatFileInputFormat is the base class for all implementations of InputFormat that use filesas their data source (see Figure 7-2). It provides two things: a place to define which filesare included as the input to a job, and an implementation for generating splits for theinput files. The job of dividing splits into records is performed by subclasses.FileInputFormat input pathsThe input to a job is specified as a collection of paths, which offers great flexibility inconstraining the input to a job. FileInputFormat offers four static convenience methodsfor setting a JobConf’s input paths: public static void addInputPath(JobConf conf, Path path) public static void addInputPaths(JobConf conf, String commaSeparatedPaths) public static void setInputPaths(JobConf conf, Path... inputPaths) public static void setInputPaths(JobConf conf, String commaSeparatedPaths)The addInputPath() and addInputPaths() methods add a path or paths to the list ofinputs. You can call these methods repeatedly to build the list of paths. The setInputPaths() methods set the entire list of paths in one go (replacing any paths set on theJobConf in previous calls).200 | Chapter 7: MapReduce Types and Formats

219.
Figure 7-2. InputFormat class hierarchyA path may represent a file, a directory, or, by using a glob, a collection of files anddirectories. A path representing a directory includes all the files in the directory as inputto the job. See “File patterns” on page 60 for more on using globs. The contents of a directory specified as an input path are not processed recursively. In fact, the directory should only contain files: if the direc- tory contains a subdirectory, it will be interpreted as a file, which will cause an error. The way to handle this case is to use a file glob or a filter to select only the files in the directory based on a name pattern.The add and set methods allow files to be specified by inclusion only. To exclude certainfiles from the input, you can set a filter using the setInputPathFilter() method onFileInputFormat: public static void setInputPathFilter(JobConf conf, Class<? extends PathFilter> filter)Filters are discussed in more detail in “PathFilter” on page 61. Input Formats | 201

220.
Even if you don’t set a filter, FileInputFormat uses a default filter that excludes hiddenfiles (those whose names begin with a dot or an underscore). If you set a filter by callingsetInputPathFilter(), it acts in addition to the default filter. In other words, only non-hidden files that are accepted by your filter get through.Paths and filters can be set through configuration properties, too (Table 7-3), whichcan be handy for Streaming and Pipes. Setting paths is done with the -input option forboth Streaming and Pipes interfaces, so setting paths directly is not usually needed.Table 7-3. Input path and filter properties Property name Type Default value Description mapred.input.dir comma-separated none The input files for a job. Paths that contain com- paths mas should have those commas escaped by a backslash character. For example, the glob {a,b} would be escaped as {a,b}. mapred.input.path PathFilter none The filter to apply to the input files for a job. Filter.class classnameFileInputFormat input splitsGiven a set of files, how does FileInputFormat turn them into splits? FileInputFormatsplits only large files. Here “large” means larger than an HDFS block. The split size isnormally the size of an HDFS block, which is appropriate for most applications; how-ever, it is possible to control this value by setting various Hadoop properties, as shownin Table 7-4.Table 7-4. Properties for controlling split size Property name Type Default value Description mapred.min.split.size int 1 The smallest valid size in bytes for a file split. mapred.max.split.size a long Long.MAX_VALUE, that is The largest valid size in 9223372036854775807 bytes for a file split. dfs.block.size long 64 MB, that is 67108864 The size of a block in HDFS in bytes.a This property is not present in the old MapReduce API (with the exception of CombineFileInputFormat). Instead, it is calculated indirectly as the size of the total input for the job, divided by the guide number of map tasks specified by mapred.map.tasks (or the setNumMapTasks() method on JobConf). Because mapred.map.tasks defaults to 1, this makes the maximum split size the size of the input.The minimum split size is usually 1 byte, although some formats have a lower boundon the split size. (For example, sequence files insert sync entries every so often in thestream, so the minimum split size has to be large enough to ensure that every split hasa sync point to allow the reader to resynchronize with a record boundary.)202 | Chapter 7: MapReduce Types and Formats

221.
Applications may impose a minimum split size: by setting this to a value larger thanthe block size, they can force splits to be larger than a block. There is no good reasonfor doing this when using HDFS, since doing so will increase the number of blocks thatare not local to a map task.The maximum split size defaults to the maximum value that can be represented by aJava long type. It has an effect only when it is less than the block size, forcing splits tobe smaller than a block.The split size is calculated by the formula (see the computeSplitSize() method inFileInputFormat): max(minimumSize, min(maximumSize, blockSize))by default: minimumSize < blockSize < maximumSizeso the split size is blockSize. Various settings for these parameters and how they affectthe final split size are illustrated in Table 7-5.Table 7-5. Examples of how to control the split size Minimum split size Maximum split size Block size Split size Comment 1 (default) Long.MAX_VALUE 64 MB (default) 64 MB By default, split size is the same as the (default) default block size. 1 (default) Long.MAX_VALUE 128 MB 128 MB The most natural way to increase the (default) split size is to have larger blocks in HDFS, by setting dfs.block.size, or on a per-file basis at file construction time. 128 MB Long.MAX_VALUE 64 MB (default) 128 MB Making the minimum split size greater (default) than the block size increases the split size, but at the cost of locality. 1 (default) 32 MB 64 MB (default) 32 MB Making the maximum split size less than the block size decreases the split size.Small files and CombineFileInputFormatHadoop works better with a small number of large files than a large number of smallfiles. One reason for this is that FileInputFormat generates splits in such a way that eachsplit is all or part of a single file. If the file is very small (“small” means significantlysmaller than an HDFS block) and there are a lot of them, then each map task will processvery little input, and there will be a lot of them (one per file), each of which imposesextra bookkeeping overhead. Compare a 1 GB file broken into sixteen 64 MB blocks,and 10,000 or so 100 KB files. The 10,000 files use one map each, and the job time canbe tens or hundreds of times slower than the equivalent one with a single input file and16 map tasks. Input Formats | 203

222.
The situation is alleviated somewhat by CombineFileInputFormat, which was designedto work well with small files. Where FileInputFormat creates a split per file,CombineFileInputFormat packs many files into each split so that each mapper has moreto process. Crucially, CombineFileInputFormat takes node and rack locality into accountwhen deciding which blocks to place in the same split, so it does not compromise thespeed at which it can process the input in a typical MapReduce job.Of course, if possible, it is still a good idea to avoid the many small files case sinceMapReduce works best when it can operate at the transfer rate of the disks in the cluster,and processing many small files increases the number of seeks that are needed to runa job. Also, storing large numbers of small files in HDFS is wasteful of the namenode’smemory. One technique for avoiding the many small files case is to merge small filesinto larger files by using a SequenceFile: the keys can act as filenames (or a constantsuch as NullWritable, if not needed) and the values as file contents. See Example 7-4.But if you already have a large number of small files in HDFS, then CombineFileInputFormat is worth trying. CombineFileInputFormat isn’t just good for small files—it can bring ben- efits when processing large files, too. Essentially, CombineFileInputFor mat decouples the amount of data that a mapper consumes from the block size of the files in HDFS. If your mappers can process each block in a matter of seconds, then you could use CombineFileInputFormat with the maximum split size set to a small multiple of the number of blocks (by setting the mapred.max.split.size property in bytes) so that each mapper processes more than one block. In return, the overall processing time falls, since proportionally fewer mappers run, which reduces the overhead in task bookkeeping and startup time associated with a large number of short- lived mappers.Since CombineFileInputFormat is an abstract class without any concrete classes (unlikeFileInputFormat), you need to do a bit more work to use it. (Hopefully, common im-plementations will be added to the library over time.) For example, to have theCombineFileInputFormat equivalent of TextInputFormat, you would create a concretesubclass of CombineFileInputFormat and implement the getRecordReader() method.204 | Chapter 7: MapReduce Types and Formats

223.
Preventing splittingSome applications don’t want files to be split so that a single mapper can process eachinput file in its entirety. For example, a simple way to check if all the records in a fileare sorted is to go through the records in order, checking whether each record is notless than the preceding one. Implemented as a map task, this algorithm will work onlyif one map processes the whole file.‡There are a couple of ways to ensure that an existing file is not split. The first (quickand dirty) way is to increase the minimum split size to be larger than the largest file inyour system. Setting it to its maximum value, Long.MAX_VALUE, has this effect. The sec-ond is to subclass the concrete subclass of FileInputFormat that you want to use, tooverride the isSplitable() method§ to return false. For example, here’s a nonsplitta-ble TextInputFormat: import org.apache.hadoop.fs.*; import org.apache.hadoop.mapred.TextInputFormat; public class NonSplittableTextInputFormat extends TextInputFormat { @Override protected boolean isSplitable(FileSystem fs, Path file) { return false; } }File information in the mapperA mapper processing a file input split can find information about the split by readingsome special properties from its job configuration object, which may be obtained byimplementing configure() in your Mapper implementation to get access to theJobConf object. Table 7-6 lists the properties available. These are in addition to the onesavailable to all mappers and reducers, listed in “The Task Execution Environ-ment” on page 186.Table 7-6. File split properties Property name Type Description map.input.file String The path of the input file being processed map.input.start long The byte offset of the start of the split map.input.length long The length of the split in bytesIn the next section, you shall see how to use this when we need to access the split’sfilename.‡ This is how the mapper in SortValidator.RecordStatsChecker is implemented.§ In the method name isSplitable(), “splitable” has a single “t.” It is usually spelled “splittable,” which is the spelling I have used in this book. Input Formats | 205

224.
Processing a whole file as a recordA related requirement that sometimes crops up is for mappers to have access to the fullcontents of a file. Not splitting the file gets you part of the way there, but you also needto have a RecordReader that delivers the file contents as the value of the record. Thelisting for WholeFileInputFormat in Example 7-2 shows a way of doing this.Example 7-2. An InputFormat for reading a whole file as a recordpublic class WholeFileInputFormat extends FileInputFormat<NullWritable, BytesWritable> { @Override protected boolean isSplitable(FileSystem fs, Path filename) { return false; } @Override public RecordReader<NullWritable, BytesWritable> getRecordReader( InputSplit split, JobConf job, Reporter reporter) throws IOException { return new WholeFileRecordReader((FileSplit) split, job); }}WholeFileInputFormat defines a format where the keys are not used, represented byNullWritable, and the values are the file contents, represented by BytesWritable in-stances. It defines two methods. First, the format is careful to specify that input filesshould never be split, by overriding isSplitable() to return false. Second, weimplement getRecordReader() to return a custom implementation of RecordReader,which appears in Example 7-3.Example 7-3. The RecordReader used by WholeFileInputFormat for reading a whole file as a recordclass WholeFileRecordReader implements RecordReader<NullWritable, BytesWritable> { private FileSplit fileSplit; private Configuration conf; private boolean processed = false; public WholeFileRecordReader(FileSplit fileSplit, Configuration conf) throws IOException { this.fileSplit = fileSplit; this.conf = conf; } @Override public NullWritable createKey() { return NullWritable.get(); } @Override public BytesWritable createValue() { return new BytesWritable();206 | Chapter 7: MapReduce Types and Formats

225.
} @Override public long getPos() throws IOException { return processed ? fileSplit.getLength() : 0; } @Override public float getProgress() throws IOException { return processed ? 1.0f : 0.0f; } @Override public boolean next(NullWritable key, BytesWritable value) throws IOException { if (!processed) { byte[] contents = new byte[(int) fileSplit.getLength()]; Path file = fileSplit.getPath(); FileSystem fs = file.getFileSystem(conf); FSDataInputStream in = null; try { in = fs.open(file); IOUtils.readFully(in, contents, 0, contents.length); value.set(contents, 0, contents.length); } finally { IOUtils.closeStream(in); } processed = true; return true; } return false; } @Override public void close() throws IOException { // do nothing }}WholeFileRecordReader is responsible for taking a FileSplit and converting it into asingle record, with a null key and a value containing the bytes of the file. Because thereis only a single record, WholeFileRecordReader has either processed it or not, so it main-tains a boolean called processed. If, when the next() method is called, the file has notbeen processed, then we open the file, create a byte array whose length is the length ofthe file, and use the Hadoop IOUtils class to slurp the file into the byte array. Then weset the array on the BytesWritable instance that was passed into the next() method,and return true to signal that a record has been read.The other methods are straightforward bookkeeping methods for creating the correctkey and value types, getting the position and progress of the reader, and a close()method, which is invoked by the MapReduce framework when the reader is done with. Input Formats | 207

227.
Since the input format is a WholeFileInputFormat, the mapper has to find only thefilename for the input file split. It does this by retrieving the map.input.file propertyfrom the JobConf, which is set to the split’s filename by the MapReduce framework,but only for splits that are FileSplit instances (this includes most subclasses ofFileInputFormat). The reducer is the IdentityReducer, and the output format is aSequenceFileOutputFormat.Here’s a run on a few small files. We’ve chosen to use two reducers, so we get twooutput sequence files: % hadoop jar job.jar SmallFilesToSequenceFileConverter -conf conf/hadoop-localhost.xml -D mapred.reduce.tasks=2 input/smallfiles outputTwo part files are created, each of which is a sequence file, which we can inspect withthe -text option to the filesystem shell: % hadoop fs -conf conf/hadoop-localhost.xml -text output/part-00000 hdfs://localhost/user/tom/input/smallfiles/a 61 61 61 61 61 61 61 61 61 61 hdfs://localhost/user/tom/input/smallfiles/c 63 63 63 63 63 63 63 63 63 63 hdfs://localhost/user/tom/input/smallfiles/e % hadoop fs -conf conf/hadoop-localhost.xml -text output/part-00001 hdfs://localhost/user/tom/input/smallfiles/b 62 62 62 62 62 62 62 62 62 62 hdfs://localhost/user/tom/input/smallfiles/d 64 64 64 64 64 64 64 64 64 64 hdfs://localhost/user/tom/input/smallfiles/f 66 66 66 66 66 66 66 66 66 66The input files were named a, b, c, d, e, and f, and each contained 10 characters of thecorresponding letter (so, for example, a contained 10 “a” characters), except e, whichwas empty. We can see this in the textual rendering of the sequence files, which printsthe filename followed by the hex representation of the file.There’s at least one way we could improve this program. As mentioned earlier, havingone mapper per file is inefficient, so subclassing CombineFileInputFormat instead ofFileInputFormat would be a better approach. Also, for a related technique of packingfiles into a Hadoop Archive, rather than a sequence file, see the section “Hadoop Ar-chives” on page 71.Text InputHadoop excels at processing unstructured text. In this section, we discuss the differentInputFormats that Hadoop provides to process text.TextInputFormatTextInputFormat is the default InputFormat. Each record is a line of input. The key, aLongWritable, is the byte offset within the file of the beginning of the line. The value isthe contents of the line, excluding any line terminators (newline, carriage return), andis packaged as a Text object. So a file containing the following text: On the top of the Crumpetty Tree The Quangle Wangle sat, Input Formats | 209

228.
But his face you could not see, On account of his Beaver Hat.is divided into one split of four records. The records are interpreted as the followingkey-value pairs: (0, On the top of the Crumpetty Tree) (33, The Quangle Wangle sat,) (57, But his face you could not see,) (89, On account of his Beaver Hat.)Clearly, the keys are not line numbers. This would be impossible to implement in gen-eral, in that a file is broken into splits, at byte, not line, boundaries. Splits are processedindependently. Line numbers are really a sequential notion: you have to keep a countof lines as you consume them, so knowing the line number within a split would bepossible, but not within the file.However, the offset within the file of each line is known by each split independently ofthe other splits, since each split knows the size of the preceding splits and just adds thison to the offsets within the split to produce a global file offset. The offset is usuallysufficient for applications that need a unique identifier for each line. Combined withthe file’s name, it is unique within the filesystem. Of course, if all the lines are a fixedwidth, then calculating the line number is simply a matter of dividing the offset by thewidth. The Relationship Between Input Splits and HDFS Blocks The logical records that FileInputFormats define do not usually fit neatly into HDFS blocks. For example, a TextInputFormat’s logical records are lines, which will cross HDFS boundaries more often than not. This has no bearing on the functioning of your program—lines are not missed or broken, for example—but it’s worth knowing about, as it does mean that data-local maps (that is, maps that are running on the same host as their input data) will perform some remote reads. The slight overhead this causes is not normally significant. Figure 7-3 shows an example. A single file is broken into lines, and the line boundaries do not correspond with the HDFS block boundaries. Splits honor logical record boun- daries, in this case lines, so we see that the first split contains line 5, even though it spans the first and second block. The second split starts at line 6.Figure 7-3. Logical records and HDFS blocks for TextInputFormat210 | Chapter 7: MapReduce Types and Formats

229.
KeyValueTextInputFormatTextInputFormat’s keys, being simply the offset within the file, are not normally veryuseful. It is common for each line in a file to be a key-value pair, separated by a delimitersuch as a tab character. For example, this is the output produced by TextOutputFormat, Hadoop’s default OutputFormat. To interpret such files correctly, KeyValueTextInputFormat is appropriate.You can specify the separator via the key.value.separator.in.input.line property. Itis a tab character by default. Consider the following input file, where → represents a(horizontal) tab character: line1→On the top of the Crumpetty Tree line2→The Quangle Wangle sat, line3→But his face you could not see, line4→On account of his Beaver Hat.Like in the TextInputFormat case, the input is in a single split comprising four records,although this time the keys are the Text sequences before the tab in each line: (line1, On the top of the Crumpetty Tree) (line2, The Quangle Wangle sat,) (line3, But his face you could not see,) (line4, On account of his Beaver Hat.)NLineInputFormatWith TextInputFormat and KeyValueTextInputFormat, each mapper receives a variablenumber of lines of input. The number depends on the size of the split and the lengthof the lines. If you want your mappers to receive a fixed number of lines of input, thenNLineInputFormat is the InputFormat to use. Like TextInputFormat, the keys are the byteoffsets within the file and the values are the lines themselves.N refers to the number of lines of input that each mapper receives. With N set toone (the default), each mapper receives exactly one line