This test was flaky due to trying to write some data into /bin/ls.
Depending on the speed of the test run, this sometimes resulted
in a Broken Pipe on flush() which caused the test to fail.
Reason: Bugfix (race condition in test)
Author: Todd Lipcon
Ref: UNKNOWN
commit ae699cda01c093097ae723224553773247577aa2
Author: Aaron Kimball
Date: Fri Mar 12 17:52:32 2010 -0800
HDFS-961. dfs_readdir incorrectly parses paths
Description: fuse-dfs dfs_readdir assumes that DistributedFileSystem#listStatus returns Paths with the same scheme/authority as the dfs.name.dir used to connect. If NameNode.DEFAULT_PORT port is used listStatus returns Paths that have authorities without the port (see HDFS-960), which breaks the following code.

Let's make the path parsing here more robust. listStatus returns normalized paths so we can find the start of the path by searching for the 3rd slash. A more long term solution is to have hdfsFileInfo maintain a path object or at least pointers to the relevant URI components.

which would omit the time consuming table import step, generate hive create table statements and run them.

Also adds --hive-overwrite flag which allows overwriting of existing table definition.
Reason: New feature
Author: Leonid Furman
Ref: UNKNOWN
commit bdf576aa69eeb56a954416f7c2fcbe0136f421bd
Author: Aaron Kimball
Date: Fri Mar 12 17:51:16 2010 -0800
HADOOP-4012. Providing splitting support for bzip2 compressed files
Description: Hadoop assumes that if the input data is compressed, it can not be split (mainly due to the limitation of many codecs that they need the whole input stream to decompress successfully). So in such a case, Hadoop prepares only one split per compressed file, where the lower split limit is at 0 while the upper limit is the end of the file. The consequence of this decision is that, one compress file goes to a single mapper. Although it circumvents the limitation of codecs (as mentioned above) but reduces the parallelism substantially, as it was possible otherwise in case of splitting.

BZip2 is a compression / De-Compression algorithm which does compression on blocks of data and later these compressed blocks can be decompressed independent of each other. This is indeed an opportunity that instead of one BZip2 compressed file going to one mapper, we can process chunks of file in parallel. The correctness criteria of such a processing is that for a bzip2 compressed file, each compressed block should be processed by only one mapper and ultimately all the blocks of the file should be processed. (By processing we mean the actual utilization of that un-compressed data (coming out of the codecs) in a mapper).

We are writing the code to implement this suggested functionality. Although we have used bzip2 as an example, but we have tried to extend Hadoop's compression interfaces so that any other codecs with the same capability as that of bzip2, could easily use the splitting support. The details of these changes will be posted when we submit the code.

Reason: New feature
Author: Abdul Qadeer
Ref: UNKNOWN
commit 8e47288583fcdbdf649ddf3486bf201788e79202
Author: Aaron Kimball
Date: Fri Mar 12 17:50:51 2010 -0800
MAPREDUCE-707. Provide a jobconf property for explicitly assigning a job to a pool
Description: A common use case of the fair scheduler is to have one pool per user, but then to define some special pools for various production jobs, import jobs, etc. Therefore, it would be nice if jobs went by default to the pool of the user who submitted them, but there was a setting to explicitly place a job in another pool. Today, this can be achieved through a sort of trick in the JobConf:

This JIRA proposes to add a property called mapred.fairscheduler.pool that allows a job to be placed directly into a pool, avoiding the need for this trick.

Reason: Configuration improvement
Author: Alan Heirich
Ref: UNKNOWN
commit 96e17e1e593b818a888c8dfc177b8fb36e514e8f
Author: Aaron Kimball
Date: Fri Mar 12 17:50:18 2010 -0800
MAPREDUCE-967. (version 2) TaskTracker does not need to fully unjar job jars
Description:
This is a performance improvement for jobs that contain a large number of
classes. The unpacking of these jars consumes a large amount of time, as
does the resulting cleanup. This patch changes the classpath to simply
include the jar itself, and only unpacks the lib/ directory out of the
jar in order to add those dependencies to the classpath.
Users who previously depended on this functionality for shipping non-code
dependencies can use the undocumented configuration parameter
"mapreduce.job.jar.unpack.pattern" to cause specific jar contents to be unpacked
This new patch version fixes a streaming regression where the "-file" argument
no longer worked. It includes a new unit test, TestFileArgs, to protect
against this regression.
Author: Todd Lipcon
Ref: UNKNOWN
commit cf08a128b87bbfae90babd61795599b3645d37a3
Author: Aaron Kimball
Date: Fri Mar 12 17:48:40 2010 -0800
HDFS-455, MAPREDUCE-1441, HADOOP-6534. Allow spaces in between comma-separated elements in directory list configurations.
Description: Make NN and DN handle in a intuitive way comma-separated configuration strings
The following configuration causes problems:
<property>
<name>dfs.data.dir</name>
<value>/mnt/hstore2/hdfs, /home/foo/dfs</value>
</property>

The problem is that the space after the comma causes the second directory for storage to be " /home/foo/dfs" which is in a directory named <SPACE> which contains a sub-dir named "home" in the hadoop datanodes default directory. This will typically cause the user's home partition to fill, but will be very hard for the user to understand since a directory with a whitespace name is hard to understand.

This fixes any configuration consisting of a comma-separated list of directories
(e.g., dfs.data.dir, dfs.name.dir, fs.checkpoint.dir, mapred.local.dir, etc) so that
the elements may also contain separating whitespace. Without this patch,
setting mapred.local.dir to "/disk1, /disk2" would create a directory by the name
" " in the user's home directory, or fail outright. The patch trims the
directory
names as they are fetched from the configuration.
Reason: Configuration improvement
Author: Todd Lipcon
Ref: UNKNOWN
commit 65a04ab8197a8db21a97d279ca881b5cd45a5365
Author: Aaron Kimball
Date: Fri Mar 12 17:48:03 2010 -0800
HADOOP-2366. Space in the value for dfs.data.dir can cause great problems
Description: The following configuration causes problems:

<property>
<name>dfs.data.dir</name>
<value>/mnt/hstore2/hdfs, /home/foo/dfs</value>
<description>
Determines where on the local filesystem an DFS data node should store its bl
ocks. If this is a comma-delimited list of directories, then data will be stor
ed in all named directories, typically on different devices. Directories that
do not exist are ignored.
</description>
</property>

The problem is that the space after the comma causes the second directory for storage to be " /home/foo/dfs" which is in a directory named <SPACE> which contains a sub-dir named "home" in the hadoop datanodes default directory. This will typically cause the user's home partition to fill, but will be very hard for the user to understand since a directory with a whitespace name is hard to understand.

My proposed solution would be to trimLeft all path names from this and similar property after splitting on comma. This still allows spaces in file and directory names but avoids this problem.

This provides support in Configuration to get comma-separated string lists in such
a way that whitespace in between elements is ignored. This patch is required for
later patches which fix mapred.local.dir, dfs.data.dir, etc to support spaces
in between elements.
Test plan: unit tested in TestStringUtils
Reason: Configuration improvement
Author: Michele (@pirroh) Catasta
Ref: UNKNOWN
commit 8d4807322a42509726b376b37a89739acd6cbd7d
Author: Aaron Kimball
Date: Fri Mar 12 17:47:55 2010 -0800
MAPREDUCE-1356. Allow user-specified hive table name in sqoop
Description: The table name used in a hive-destination import is currently pegged to the input table name. This should be user-configurable.
Reason: New feature
Author: Aaron Kimball
Ref: UNKNOWN
commit 8bf3439ff69762a33967dca4abb15c0cd2bb8417
Author: Aaron Kimball
Date: Fri Mar 12 17:47:45 2010 -0800
MAPREDUCE-1395. Sqoop does not check return value of Job.waitForCompletion()
Description: Old code depended on JobClient.runJob() throwing IOException on failure. Job.waitForCompletion can fail in that manner, or it can fail by returning false. Sqoop needs to check for this condition.
Reason: bugfix
Author: Aaron Kimball
Ref: UNKNOWN
commit bd4e81234dd12fa9534577f0caa0db5c3d0a99fc
Author: Aaron Kimball
Date: Fri Mar 12 17:47:30 2010 -0800
CLOUDERA-BUILD. Set HADOOP_PID_DIR to something smarter than /tmp
Author: Chad Metcalf
commit 2466310d0e2a426e848860e9a8411b8ea14e1bb1
Author: Aaron Kimball
Date: Fri Mar 12 17:47:07 2010 -0800
HADOOP-6453. Hadoop wrapper script shouldn't ignore an existing JAVA_LIBRARY_PATH
Description: Currently the hadoop wrapper script assumes its the only place that uses JAVA_LIBRARY_PATH and initializes it to a blank line.

JAVA_LIBRARY_PATH=''

This prevents anyone from setting this outside of the hadoop wrapper (say hadoop-config.sh) for their own native libraries.

The fix is pretty simple. Don't initialize it to '' and append the native libs like normal.

Reason: Bugfix (environment)
Author: Chad Metcalf
Ref: UNKNOWN
commit a67b4b1c361c26e002da64953a7f8bc068d29b98
Author: Aaron Kimball
Date: Fri Mar 12 17:46:42 2010 -0800
MAPREDUCE-1327. Oracle database import via sqoop fails when a table contains the column types such as TIMESTAMP(6) WITH LOCAL TIME ZONE and TIMESTAMP(6) WITH TIME ZONE
Description: When Oracle table contains the columns "TIMESTAMP(6) WITH LOCAL TIME ZONE" and "TIMESTAMP(6) WITH TIME ZONE", Sqoop fails to map values for those columns to valid Java data types, resulting in the following exception:

ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.NullPointerException
java.lang.NullPointerException
at org.apache.hadoop.sqoop.orm.ClassWriter.generateFields(ClassWriter.java:253)
at org.apache.hadoop.sqoop.orm.ClassWriter.generateClassForColumns(ClassWriter.java:701)
at org.apache.hadoop.sqoop.orm.ClassWriter.generate(ClassWriter.java:597)
at org.apache.hadoop.sqoop.Sqoop.generateORM(Sqoop.java:75)
at org.apache.hadoop.sqoop.Sqoop.importTable(Sqoop.java:87)
at org.apache.hadoop.sqoop.Sqoop.run(Sqoop.java:175)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.sqoop.Sqoop.main(Sqoop.java:201)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

Reason: Compatibility improvement
Author: Leonid Furman
Ref: UNKNOWN
commit a937ba2b9b6132883d727f856911ae31d22ad619
Author: Aaron Kimball
Date: Fri Mar 12 17:46:26 2010 -0800
MAPREDUCE-1394. Sqoop generates incorrect URIs in paths sent to Hive
Description: Hive used to require a ':8020' in HDFS URIs used with LOAD DATA statements, even though the normalized form of such a URI does not contain an explicit port number (since 8020 is the default port). Sqoop matched this by hacking the URI strings it forwarded to Hive.

This is version two of this patch. THe previous patch fixed some systems
but broke others.
Reason: Bugfix
Author: Todd Lipcon
Ref: UNKNOWN
commit 7fafe032223921ad194c69b16ab451b4aade87fa
Author: Aaron Kimball
Date: Fri Mar 12 17:43:41 2010 -0800
HADOOP-4368. Superuser privileges required to do "df"
Description: super user privileges are required in DFS in order to get the file system statistics (FSNamesystem.java, getStats method). This means that when HDFS is mounted via fuse-dfs as a non-root user, "df" is going to return 16exabytes total and 0 free instead of the correct amount.

As far as I can tell, there's no need to require super user privileges to see the file system size (and historically in Unix, this is not required).

To fix this, simply comment out the privilege check in the getStats method.

This prevents me from monitoring DFS datanodes through Hadoop using the JMX interface; in order to do that, you must be able to specify the bean name on the command line.

The fix is simple, patch will be coming momentarily. However, there was probably a reason for making the datanodes all unique names which I'm unaware of, so it'd be nice to hear from the metrics maintainer.

Reason: Monitoring improvement
Author: Brian Bockelman
Ref: UNKNOWN
commit 5dfcc6d2d7806636c6237996e1b28a00ba075b4b
Author: Aaron Kimball
Date: Fri Mar 12 17:43:05 2010 -0800
HADOOP-6503. contrib projects should pull in the ivy-fetched libs from the root project
Description: On branch-20 currently, I get an error just running "ant contrib -Dtestcase=TestHdfsProxy". In a full "ant test" build sometimes this doesn't appear to be an issue. The problem is that the contrib projects don't automatically pull in the dependencies of the "Hadoop" ivy project. Thus, they each have to declare all of the common dependencies like commons-cli, etc. Some are missing and this causes test failures.
Reason: Build system improvement
Author: Todd Lipcon
Ref: UNKNOWN
commit be70b10f11445f4a71807405718bfeebd38ad924
Author: Aaron Kimball
Date: Fri Mar 12 17:42:51 2010 -0800
MAPREDUCE-1155. Streaming tests swallow exceptions
Description: Many of the streaming tests (including TestMultipleArchiveFiles) catch exceptions and print their stack trace rather than failing the job. This means that tests do not fail even when the job fails.
Reason: Test coverage improvement
Author: Todd Lipcon
Ref: UNKNOWN
commit f84830ae5e6c862cd0e2b8ebea57880e54c8a082
Author: Aaron Kimball
Date: Fri Mar 12 17:42:33 2010 -0800
HADOOP-5647. TestJobHistory fails if /tmp/_logs is not writable to. Testcase should not depend on /tmp
Description: TestJobHistory sets /tmp as hadoop.job.history.user.location to check if the history file is created in that directory or not. If /tmp/_logs is already created by some other user, this test will fail because of not having write permission.
Reason: Bugfix in test harness
Author: Ravi Gummadi
Ref: UNKNOWN
commit 669b65f14d78ffd1cf0304cf459d1abbae3412ae
Author: Aaron Kimball
Date: Fri Mar 12 17:42:15 2010 -0800
CLOUDERA-BUILD. Fix javadoc warnings shown by test-patch, and update eclipse classpath to match current CDH.
Author: Todd Lipcon
commit 51804fd45d3a527a130a373c591a17c185102a0c
Author: Aaron Kimball
Date: Fri Mar 12 17:41:40 2010 -0800
Revert "HDFS-127: DFSClient block read failures cause open DFSInputStream to become unusable"
Description: This is being reverted as it causes infinite retries when there are no valid replicas.
Reason: bugfix
Author: Todd Lipcon
Ref: UNKNOWN
commit 623bfc0c18087274315dfbd41d025a8a775abe80
Author: Aaron Kimball
Date: Fri Mar 12 17:40:30 2010 -0800
HDFS-877. Client-driven block verification not functioning
Description: This is actually the reason for HDFS-734 (TestDatanodeBlockScanner timing out). The issue is that DFSInputStream relies on readChunk being called one last time at the end of the file in order to receive the lastPacketInBlock=true packet from the DN. However, DFSInputStream.read checks pos < getFileLength() before issuing the read. Thus gotEOS never shifts to true and checksumOk() is never called.
This is a simpler patch than the one on 0.21/0.22 since those fix a further regression
since 0.20.
Reason: bugfix
Author: Todd Lipcon
Ref: UNKNOWN
commit b332fe77255047409da701dfb97df1bddb5b10cb
Author: Aaron Kimball
Date: Fri Mar 12 17:40:05 2010 -0800
CLOUDERA-BUILD. Add mockito to 0.20 branch for easier unit testing of HDFS stability patches.
Reason: Test coverage improvement
Author: Todd Lipcon
commit 44a6c559de056b35c6eb2e2d53798c88d8c779e6
Author: Aaron Kimball
Date: Fri Mar 12 17:39:09 2010 -0800
HDFS-630. In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block.
Description: created from hdfs-200.

If during a write, the dfsclient sees that a block replica location for a newly allocated block is not-connectable, it re-requests the NN to get a fresh set of replica locations of the block. It tries this dfs.client.block.write.retries times (default 3), sleeping 6 seconds between each retry ( see DFSClient.nextBlockOutputStream).

This setting works well when you have a reasonable size cluster; if u have few datanodes in the cluster, every retry maybe pick the dead-datanode and the above logic bails out.

Our solution: when getting block location from namenode, we give nn the excluded datanodes. The list of dead datanodes is only for one block allocation.

What is odd is this: the final patch of HADOOP-6426 does include the stub <target> files needed, yet they aren't in SVN_HEAD. Which implies that a different version may have gone in than intended.

Reason: Build system bugfix
Author: Tom White
Ref: UNKNOWN
commit 083a6a1cfb2a5198243aa82a020681ad62da5938
Author: Aaron Kimball
Date: Fri Mar 12 17:33:58 2010 -0800
HADOOP-6444. Support additional security group option in hadoop-ec2 script
Description: When deploying a hadoop cluster on ec2 alongside other services it is very useful to be able to specify additional (pre-existing) security groups to facilitate access control. For example one could use this feature to add a cluster to a generic "hadoop" group, which authorizes hdfs access from instances outside the cluster. Without such an option the access control for the security groups created by the script need to manually updated after cluster launch.
Reason: Security improvement
Author: Paul Egan
Ref: UNKNOWN
commit 63152ce4ba3c0cf2006016cc825fc72b0bd23d2d
Author: Aaron Kimball
Date: Fri Mar 12 17:33:49 2010 -0800
HADOOP-6426. Create ant build for running EC2 unit tests
Description: There is no easy way currently to run the Python unit tests for the cloud contrib.
Reason: Test coverage improvement
Author: Tom White
Ref: UNKNOWN
commit a20069b2adfafa59e0001fe5e5685d36d9eb7fee
Author: Aaron Kimball
Date: Fri Mar 12 17:33:15 2010 -0800
HADOOP-6392. Run namenode and jobtracker on separate EC2 instances
Description: Replace concept of "master" with that of "namenode" and "jobtracker". Still need to be able to run both on one node, of course.
Reason: Scalability improvement
Author: Tom White
Ref: UNKNOWN
commit 361221a2a082d0ab7a87ba0226dbe05938440738
Author: Aaron Kimball
Date: Fri Mar 12 17:33:07 2010 -0800
HADOOP-6108. Add support for EBS storage on EC2
Description: By using EBS for namenode and datanode storage we can have persistent, restartable Hadoop clusters running on EC2.
Reason: New feature
Author: Tom White
Ref: UNKNOWN
commit 4ca1c78e1b257eefa10b5ed94479df8a6473d3e9
Author: Aaron Kimball
Date: Fri Mar 12 17:32:50 2010 -0800
HDFS-861. fuse-dfs does not support O_RDWR
Description: Some applications (for us, the big one is rsync) will open a file in read-write mode when it really only intends to read xor write (not both). fuse-dfs should try to not fail until the application actually tries to write to a pre-existing file or read from a newly created file.
Reason: bugfix
Author: Brian Bockelman
Ref: UNKNOWN
commit 00f6976093cc20ea825a35f6831f645dc5f61637
Author: Aaron Kimball
Date: Fri Mar 12 17:32:17 2010 -0800
HDFS-860. fuse-dfs truncate behavior causes issues with scp
Description: For whatever reason, scp issues a "truncate" once it's written a file to truncate the file to the # of bytes it has written (i.e., if a file is X bytes, it calls truncate(X)).

AS t" to get meta information is too expensive for big tables
Description: The SqlManager uses the query, "SELECT t.* from <table> AS t" to get table spec is too expensive for big tables, and it was called twice to generate column names and types. For tables that are big enough to be map-reduced, this is too expensive to make sqoop useful.
Reason: Performance improvement
Author: Spencer Ho
Ref: UNKNOWN
commit 1198ef1375387ba107d46f0ab5e9a7c6a7645931
Author: Aaron Kimball
Date: Fri Mar 12 17:28:15 2010 -0800
MAPREDUCE-706. Support for FIFO pools in the fair scheduler
Description: The fair scheduler should support making the internal scheduling algorithm for some pools be FIFO instead of fair sharing in order to work better for batch workloads. FIFO pools will behave exactly like the current default scheduler, sorting jobs by priority and then submission time. Pools will have their scheduling algorithm set through the pools config file, and it will be changeable at runtime.

To support this feature, I'm also changing the internal logic of the fair scheduler to no longer use deficits. Instead, for fair sharing, we will assign tasks to the job farthest below its share as a ratio of its share. This is easier to combine with other scheduling algorithms and leads to a more stable sharing situation, avoiding unfairness issues brought up in MAPREDUCE-543 and MAPREDUCE-544 that happen when some jobs have long tasks. The new preemption (MAPREDUCE-551) will ensure that critical jobs can gain their fair share within a bounded amount of time.

Copy failed: java.io.IOException: wrong value class: org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus is not class org.apache.hadoop.fs.FileStatus
at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:988)
at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:977)
at org.apache.hadoop.tools.DistCp.deleteNonexisting(DistCp.java:1226)
at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1134)
at org.apache.hadoop.tools.DistCp.copy(DistCp.java:650)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:857)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)

Reason: bugfix
Author: Peter Romianowski
Ref: UNKNOWN
commit 34bb813a5884aeb05909c2ce2cc541882ca3eda1
Author: Aaron Kimball
Date: Fri Mar 12 17:27:53 2010 -0800
MAPREDUCE-764. TypedBytesInput's readRaw() does not preserve custom type codes
Description: The typed bytes format supports byte sequences of the form <custom type code> <length> <bytes>. When reading such a sequence via TypedBytesInput's readRaw() method, however, the returned sequence currently is 0 <length> <bytes> (0 is the type code for a bytes array), which leads to bugs such as the one described here.
Reason: bugfix
Author: Klaas Bosteels
Ref: UNKNOWN
commit 7fd2cb371354219abd108fda35087f08dc481b35
Author: Aaron Kimball
Date: Fri Mar 12 17:27:31 2010 -0800
HADOOP-6400. Log errors getting Unix UGI
Description: For various reasons, the calls out to `whoami` and `id` can fail when trying to get the unix UGI information. Currently it silently ignores failures and uses the default DrWho/Tardis ugi. This is extremely confusing for users - we should log the exception at warn level when the shell execs fail.
Reason: Debug logging improvement
Author: Todd Lipcon
Ref: UNKNOWN
commit d6dc22fecc058e12695a481fa354078d9b012089
Author: Aaron Kimball
Date: Fri Mar 12 17:27:21 2010 -0800
MAPREDUCE-1293. AutoInputFormat doesn't work with non-default FileSystems
Description: AutoInputFormat uses the wrong FileSystem.get() method when getting a reference to a FileSystem object. AutoInputFormat gets the default FileSystem, so this method breaks if the InputSplit's path is pointing to a different FileSystem.
Reason: bugfix
Author: Andrew Hitchcock
Ref: UNKNOWN
commit 25a4ea86b0b085e3afd6f2f040201594155b3de1
Author: Aaron Kimball
Date: Fri Mar 12 17:27:09 2010 -0800
MAPREDUCE-1131. Using profilers other than hprof can cause JobClient to report job failure
Description: If task profiling is enabled, the JobClient will download the profile.out file created by the tasks under profile. If this causes an IOException, the job is reported as a failure to the client, even though all the tasks themselves may complete successfully. The expected result files are assumed to be generated by hprof. Using the profiling system with other profilers will cause job failure.
Reason: compatibility bugfix
Author: Aaron Kimball
Ref: UNKNOWN
commit ab98123c7114752945452af0b96c8de04af9ba93
Author: Aaron Kimball
Date: Fri Mar 12 17:26:02 2010 -0800
MAPREDUCE-370. Change org.apache.hadoop.mapred.lib.MultipleOutputs to use new api.
Description: Ports the MultipleOutputs OutputFormat to the new context-based API.
Reason: API compatibility improvement.
Author: Amareshwari Sriramadasu
Ref: UNKNOWN
commit 50726d13750f3f71d2fc5d3a012ce81aa2adb26d
Author: Aaron Kimball
Date: Fri Mar 12 17:24:46 2010 -0800
CLOUDERA-BUILD. Backport MapReduceTestUtil to Hadoop 0.20
Description: MapReduceTestUtil is required for unit tests in subsequent
patches, but this class itself was not created in one clean JIRA. Therefore
it was backported "As-is" from the trunk and not in a patch-wise fashion.
This class is only used in the JUnit tests for Hadoop.
Author: Aaron Kimball
Reason: Testing improvement
Ref: UNKNOWN
commit d713dc1063afc4967381b6583ec424d2850bac63
Author: Aaron Kimball
Date: Fri Mar 12 17:24:30 2010 -0800
MAPREDUCE-1059. distcp can generate uneven map task assignments
Description: distcp writes out a SequenceFile containing the source files to transfer, and their sizes. Map tasks are created over spans of this file, representing files which each mapper should transfer. In practice, some transfer loads yield many empty map tasks and a few tasks perform the bulk of the work.
Reason: Improvement for load balancing
Author: Aaron Kimball
Ref: UNKNOWN
commit 855b0bf3718f2c397ef79967475468e4153f120a
Author: Aaron Kimball
Date: Fri Mar 12 17:24:20 2010 -0800
MAPREDUCE-1128. MRUnit Allows Iteration Twice
Description: MRUnit allows one to iterate over a collection of values twice (ie.

I personally prefer option (2) since we can ensure plugin API compatibility at compile-time, and we avoid an ugly switch statement in a runHook() function.

Interested to hear what people's thoughts are here.

HADOOP-5640 puts this in the new test dir. It needs to be in the old one.
Reason: Improvement
Author: Todd Lipcon
Ref: UNKNOWN
commit e9b04609d88ed5d1af442ee950aa5dcd6646e830
Author: Aaron Kimball
Date: Fri Mar 12 17:22:08 2010 -0800
MAPREDUCE-1017. Compression and output splitting for Sqoop
Description: Sqoop "direct mode" writing will generate a single large text file in HDFS. It is important to be able to compress this data before it reaches HDFS. Due to the difficulty in splitting compressed files in HDFS for use by MapReduce jobs, data should also be split at compression time.
Reason: New feature
Author: Aaron Kimball
Ref: UNKNOWN
commit 8c9b473e1af036a3e2cc9036a945a4567277db8a
Author: Aaron Kimball
Date: Fri Mar 12 17:21:14 2010 -0800
HADOOP-6312. Configuration sends too much data to log4j
Description: Configuration objects send a DEBUG-level log message every time they're instantiated, which include a full stack trace. This is more appropriate for TRACE-level logging, as it renders other debug logs very hard to read.
Reason: Logging improvement
Author: Aaron Kimball
Ref: UNKNOWN
commit 698fe169f31e54111d30e4420cd1c1c5eaeecdec
Author: Aaron Kimball
Date: Fri Mar 12 17:21:03 2010 -0800
HDFS-686. NullPointerException is thrown while merging edit log and image
Description: Our secondary name node is not able to start on NullPointerException:
ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: java.lang.NullPointerException
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1232)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1221)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:776)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:992)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.doMerge(SecondaryNameNode.java:590)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.access$000(SecondaryNameNode.java:473)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:350)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:314)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:225)
at java.lang.Thread.run(Thread.java:619)

Reason: Bugfix
Author: Todd Lipcon
Ref: UNKNOWN
commit 34ca2a5547398f9435a5d3d22603d0f7da420226
Author: Aaron Kimball
Date: Fri Mar 12 17:17:48 2010 -0800
MAPREDUCE-551. Add preemption to the fair scheduler
Description: Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
1) A job isn't getting its guaranteed share of the cluster for at least T1 seconds.
2) A job is getting significantly less than its fair share for T2 seconds (e.g. less than half its share).

T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.

When deciding which tasks to kill to make room for the job, we will use the following heuristics:

Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).

For maps: kill tasks that have run for the least amount of time (limiting wasted time).

For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).

In SecondaryNameNode.getInfoServer, the 2NN should notice a "0.0.0.0" dfs.http.address and, in that case, pull the hostname out of fs.default.name. This would fix the default configuration to work properly for most users.

Reason: Configuration improvement
Author: Todd Lipcon
Ref: UNKNOWN
commit 74e10e4a137b2aa60ab39186115350b5e82464fc
Author: Aaron Kimball
Date: Fri Mar 12 17:11:50 2010 -0800
HDFS-127. DFSClient block read failures cause open DFSInputStream to become unusable
Description: We are using some Lucene indexes directly from HDFS and for quite long time we were using Hadoop version 0.15.3.

When tried to upgrade to Hadoop 0.19 - index searches started to fail with exceptions like:
2008-11-13 16:50:20,314 WARN [Listener-4] [] DFSClient : DFS Read: java.io.IOException: Could not obtain block: blk_5604690829708125511_15489 file=/usr/collarity/data/urls-new/part-00000/20081110-163426/_0.tis
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1708)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1536)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1663)
at java.io.DataInputStream.read(DataInputStream.java:132)
at org.apache.nutch.indexer.FsDirectory$DfsIndexInput.readInternal(FsDirectory.java:174)
at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:152)
at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76)
at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:63)
at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:131)
at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:162)
at org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:223)
at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:217)
at org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:54)
...

The investigation showed that the root of this issue is that we exceeded # of xcievers in the data nodes and that was fixed by changing configuration settings to 2k.
However - one thing that bothered me was that even after datanodes recovered from overload and most of client servers had been shut down - we still observed errors in the logs of running servers.
Further investigation showed that fix for HADOOP-1911 introduced another problem - the DFSInputStream instance might become unusable once number of failures over lifetime of this instance exceeds configured threshold.

The fix for this specific issue seems to be trivial - just reset failure counter before reading next block (patch will be attached shortly).

This seems to be also related to HADOOP-3185, but I'm not sure I really understand necessity of keeping track of failed block accesses in the DFS client.

Copy failed: java.lang.NullPointerException
at org.apache.hadoop.fs.s3.S3FileSystem.makeAbsolute(S3FileSystem.java:121)
at org.apache.hadoop.fs.s3.S3FileSystem.getFileStatus(S3FileSystem.java:332)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:633)
at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1005)
at org.apache.hadoop.tools.DistCp.copy(DistCp.java:650)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:857)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:884)

This issue completes the feature mentioned in HADOOP-2838. HADOOP-2838 provided a way to set env variables in child process. This issue provides a way to inherit tt's env variables and append or reset it. So now
X=$X:y will inherit X (if there) and append y to it.

Reason: Bugfix
Author: Amar Kamat
Ref: UNKNOWN
commit eb635e4de3a8b2b5bd9f34225770f24be42dcd83
Author: Chad Metcalf
Date: Tue Sep 15 22:29:50 2009 -0700
HADOOP-5981: HADOOP-2838 doesnt work as expected
commit 5d4e93d8e0df3c445f56c5eb51965eef92bebd78
Author: Aaron Kimball
Date: Fri Mar 12 17:09:46 2010 -0800
HADOOP-2838. Add HADOOP_LIBRARY_PATH config setting so Hadoop will include external directories for jni
Description: Currently there is no way to configure Hadoop to use external JNI directories. I propose we add a new variable like HADOOP_CLASS_PATH that is added to the JAVA_LIBRARY_PATH before the process is run.

Now the users can set environment variables using mapred.child.env. They can do the following
X=Y : set X to Y
X=$X:Y : Append Y to X (which should be taken from the tasktracker)

-mapper <cmd|JavaClassName> The streaming command to run
-combiner <JavaClassName> Combiner has to be a Java class
-reducer <cmd|JavaClassName> The streaming command to run

Reason: Usability improvement
Author: Amareshwari Sriramadasu
Ref: UNKNOWN
commit 33e4f0a87effa466914e292488c47977245edc96
Author: Aaron Kimball
Date: Fri Mar 12 17:04:06 2010 -0800
MAPREDUCE-987. Exposing MiniDFS and MiniMR clusters as a single process command-line
Description: It's hard to test non-Java programs that rely on significant mapreduce functionality. The patch I'm proposing shortly will let you just type "bin/hadoop jar hadoop-hdfs-hdfswithmr-test.jar minicluster" to start a cluster (internally, it's using Mini{MR,HDFS}Cluster) with a specified number of daemons, etc. A test that checks how some external process interacts with Hadoop might start minicluster as a subprocess, run through its thing, and then simply kill the java subprocess.

I've been using just such a system for a couple of weeks, and I like it. It's significantly easier than developing a lot of scripts to start a pseudo-distributed cluster, and then clean up after it. I figure others might find it useful as well.

I'm at a bit of a loss as to where to put it in 0.21. hdfs-with-mr tests have all the required libraries, so I've put it there. I could conceivably split this into "minimr" and "minihdfs", but it's specifically the fact that they're configured to talk to each other that I like about having them together. And one JVM is better than two for my test programs.

Reason: Testing feature
Author: Philip Zeyliger
Ref: UNKNOWN
commit 39ff7e5ee285df97c765a73271066df718be0e30
Author: Aaron Kimball
Date: Fri Mar 12 17:03:23 2010 -0800
HADOOP-6267. build-contrib.xml unnecessarily enforces that contrib projects be located in contrib/ dir
Description: build-contrib.xml currently sets hadoop.root to ${basedir}/../../../. This path is relative to the contrib project which is assumed to be inside src/contrib/. We occasionally work on contrib projects in other repositories until they're ready to contribute. We can use the <dirname> ant task to do this more correctly.
Reason: Build system improvement
Author: Todd Lipcon
Ref: UNKNOWN
commit 139bea6660193cc73852832e03fe570437343e96
Author: Aaron Kimball
Date: Fri Mar 12 15:02:55 2010 -0800
HDFS-528. Add ability for safemode to wait for a minimum number of live datanodes
Description: When starting up a fresh cluster programatically, users often want to wait until DFS is "writable" before continuing in a script. "dfsadmin -safemode wait" doesn't quite work for this on a completely fresh cluster, since when there are 0 blocks on the system, 100% of them are accounted for before any DNs have reported.

This JIRA is to add a command which waits until a certain number of DNs have reported as alive to the NN.

should not use direct calls to the name-node rather call DistributedFileSystem methods.

Reason: Test coverage improvement
Author: Konstantin Shvachko
Ref: UNKNOWN
commit f04a321596a513e71354f2a6829b44e474077507
Author: Aaron Kimball
Date: Fri Mar 12 15:02:22 2010 -0800
HADOOP-5650. Namenode log that indicates why it is not leaving safemode may be confusing
Description: A namenode with a large number of datablocks is setup with dfs.safemode.threshold.pct set to 1.0. With a small number of unreported blocks, namenode prints the following as the reason for not leaving safe mode:The ratio of reported blocks 1.0000 has not reached the threshold 1.0000

With a large number of blocks, precision used for printing the log may not indicate the difference between the actual ratio of safe blocks to total blocks and the configured threshold. Printing number of blocks instead of ratio will improve the clarity.

Reason: Bugfix
Author: Aaron Kimball
Ref: UNKNOWN
commit e97883c5b9c389f82a6447e4cb1678c0a0ed83ba
Author: Aaron Kimball
Date: Fri Mar 12 14:57:19 2010 -0800
CLOUDERA-BUILD. Sqoop asciidoc syntax error
Author: Aaron Kimball
commit 520bda2edcb90dfe9461e16b96aa4a048d33ed7b
Author: Aaron Kimball
Date: Fri Mar 12 14:57:11 2010 -0800
HADOOP-5450. Add support for application-specific typecodes to typed bytes
Description: For serializing objects of types that are not supported by typed bytes serialization, applications might want to use a custom serialization format. Right now, typecode 0 has to be used for the bytes resulting from this custom serialization, which could lead to problems when deserializing the objects because the application cannot know if a byte sequence following typecode 0 is a customly serialized object or just a raw sequence of bytes. Therefore, a range of typecodes that are treated as aliases for 0 should be added, such that different typecodes can be used for application-specific purposes.
Reason: New feature
Author: Klaas Bosteels
Ref: UNKNOWN
commit b30fc99332c4a444d275731dac4b4245115d65b2
Author: Aaron Kimball
Date: Fri Mar 12 14:56:59 2010 -0800
HADOOP-1722. Make streaming to handle non-utf8 byte array
Description: Right now, the streaming framework expects the output sof the steam process (mapper or reducer) are line
oriented UTF-8 text. This limit makes it impossible to use those programs whose outputs may be non-UTF-8
(international encoding, or maybe even binary data). Streaming can overcome this limit by introducing a simple
encoding protocol. For example, it can allow the mapper/reducer to hexencode its keys/values,
the framework decodes them in the Java side.
This way, as long as the mapper/reducer executables follow this encoding protocol,
they can output arabitary bytearray and the streaming framework can handle them.
Reason: New feature
Author: Klaas Bosteels
Ref: UNKNOWN
commit 921c135653736bcc279700435358058762bc8f78
Author: Aaron Kimball
Date: Fri Mar 12 14:56:43 2010 -0800
CLOUDERA-BUILD. More Sqoop documentation updates
Author: Aaron Kimball
commit be7f1dc031e17dc4f53ebe76d27c1b9242105785
Author: Aaron Kimball
Date: Fri Mar 12 14:56:26 2010 -0800
MAPREDUCE-840. DBInputFormat leaves open transaction
Description: (Reapplied after HADOOP-4687)
Reason: MISSING: Reason for inclusion
Author: Aaron Kimball
Ref: UNKNOWN
commit 89a96d8fff80ac809dbda9582044a7c6b3986d16
Author: Aaron Kimball
Date: Fri Mar 12 14:56:07 2010 -0800
MAPREDUCE-906. Updated Sqoop documentation
Description: Provides the latest documentation for Sqoop, in both user-guide and manpage form. Built with asciidoc.
Reason: Documentation
Author: Aaron Kimball
Ref: UNKNOWN
commit 51f867aea0667d0191b730ea3abf114e75cafa4b
Author: Aaron Kimball
Date: Fri Mar 12 14:55:54 2010 -0800
MAPREDUCE-907. Sqoop should use more intelligent splits
Description: Sqoop should use the new split generation / InputFormat in MAPREDUCE-885
Reason: Performance / scalability improvement
Author: Aaron Kimball
Ref: UNKNOWN
commit 239df04415dba8d12c7d3fbf33c580d473202e94
Author: Aaron Kimball
Date: Fri Mar 12 14:55:28 2010 -0800
MAPREDUCE-885. More efficient SQL queries for DBInputFormat
Description: DBInputFormat generates InputSplits by counting the available rows in a table, and selecting subsections of the table via the "LIMIT" and "OFFSET" SQL keywords. These are only meaningful in an ordered context, so the query also includes an "ORDER BY" clause on an index column. The resulting queries are often inefficient and require full table scans. Actually using multiple mappers with these queries can lead to O(n^2) behavior in the database, where n is the number of splits. Attempting to use parallelism with these queries is counter-productive.

A better mechanism is to organize splits based on data values themselves, which can be performed in the WHERE clause, allowing for index range scans of tables, and can better exploit parallelism in the database.