Accumulo 1.6.0 runs on Hadoop 1, however Hadoop 2 with HA namenode is recommended for production systems. In addition to HA, Hadoop 2 also offers better data durability guarantees, in the case when nodes lose power, than Hadoop 1.

Notable Improvements

Multiple volume support

BigTable’s design allows for its internal metadata to automatically spread across multiple nodes. Accumulo has followed this design and scales very well as a result. There is one impediment to scaling though, and this is the HDFS namenode. There are two problems with the namenode when it comes to scaling. First, the namenode stores all of its filesystem metadata in memory on a single machine. This introduces an upper bound on the number of files Accumulo can have. Second, there is an upper bound on the number of file operations per second that a single namenode can support. For example, a namenode can only support a few thousand delete or create file request per second.

To overcome this bottleneck, support for multiple namenodes was added under ACCUMULO-118. This change allows Accumulo to store its files across multiple namenodes. To use this feature, place comma separated list of namenode URIs in the new instance.volumes configuration property in accumulo-site.xml. When upgrading to 1.6.0 and multiple namenode support is desired, modify this setting only after a successful upgrade.

Table namespaces

Administering an Accumulo instance with many tables is cumbersome. To ease this, ACCUMULO-802 introduced table namespaces which allow tables to be grouped into logical collections. This allows configuration and permission changes to made to a namespace, which will apply to all of its tables.

Conditional Mutations

Accumulo now offers a way to make atomic read,modify,write row changes from the client side. Atomic test and set row operations make this possible. ACCUMULO-1000 added conditional mutations and a conditional writer. A conditional mutation has tests on columns that must pass before any changes are made. These test are executed in server processes while a row lock is held. Below is a simple example of making atomic row changes using conditional mutations.

Read columns X,Y,SEQ into a,b,s from row R1 using an isolated scanner.

The only built in test that conditional mutations support are equality and isNull. However, iterators can be configured on a conditional mutation to run before these test. This makes it possible to implement any number of test such as less than, greater than, contains, etc.

Encryption

Encryption is still an experimental feature, but much progress has been made since 1.5.0. Support for encrypting rfiles and write ahead logs were added in ACCUMULO-958 and ACCUMULO-980. Support for encrypting data over the wire using SSL was added in ACCUMULO-1009.

When a tablet server fails, its write ahead logs are sorted and stored in HDFS. In 1.6.0, encrypting these sorted write ahead logs is not supported. ACCUMULO-981 is open to address this issue.

Pluggable compaction strategies

One of the key elements of the BigTable design is use of the Log Structured Merge Tree. This entails sorting data in memory, writing out sorted files, and then later merging multiple sorted files into a single file. These automatic merges happen in the background and Accumulo decides when to merge files based comparing relative sizes of files to a compaction ratio. Before 1.6.0 adjusting the compaction ratio was the only way a user could control this process. ACCUMULO-1451 introduces pluggable compaction strategies which allow users to choose when and what files to compact. ACCUMULO-1808 adds a compaction strategy that prevents compaction of files over a configurable size.

Lexicoders

Accumulo only sorts data lexicographically. Getting something like a pair of (String,Integer) to sort correctly in Accumulo is tricky. It’s tricky because you only want to compare the integers if the strings are equal. It’s possible to make this sort properly in Accumulo if the data is encoded properly, but can be difficult. To make this easier ACCUMULO-1336 added Lexicoders to the Accumulo API. Lexicoders provide an easy way to serialize data so that it sorts properly lexicographically. Below is a simple example.

Locality groups in memory

In cases where a very small amount of data is stored in a locality group one would expect fast scans over that locality group. However this was not always the case because recently written data stored in memory was not partitioned by locality group. Therefore if a table had 100GB of data in memory and 1MB of that was in locality group A, then scanning A would have required reading all 100GB. ACCUMULO-112 changes this and partitions data by locality group as its written.

Service IP addresses

Previous versions of Accumulo always used IP addresses internally. This could be problematic in virtual machine environments where IP addresses change. In ACCUMULO-1585 this was changed, now Accumulo uses the exact hostnames from its config files for internal addressing.

All Accumulo processes running on a cluster are locatable via zookeeper. Therefore using well known ports is not really required. ACCUMULO-1664 makes it possible to for all Accumulo processes to use random ports. This makes it easier to run multiple Accumulo instances on a single node.

While Hadoop does not support IPv6 networks, attempting to run on a system that does not have IPv6 completely disabled can cause strange failures. ACCUMULO-2262 invokes the JVM-provided configuration parameter at process startup to prefer IPv4 over IPv6.

ViewFS

Multiple bug-fixes were made to support running Accumulo over multiple HDFS instances using ViewFS. ACCUMULO-2047 is the parent
ticket that contains numerous fixes to enable this support.

Maven Plugin

This version of Accumulo is accompanied by a new maven plugin for testing client apps (ACCUMULO-1030). You can execute the accumulo-maven-plugin inside your project by adding the following to your pom.xml’s build plugins section:

This plugin is designed to work in conjunction with the maven-failsafe-plugin. A small test instance of Accumulo will run during the pre-integration-test phase of the Maven build lifecycle, and will be stopped in the post-integration-test phase. Your integration tests, executed by maven-failsafe-plugin can access this instance with a MiniAccumuloInstance connector (the plugin uses MiniAccumuloInstance, internally), as in the following example:

This plugin is quite limited, currently only supporting an instance name and a root user password as configuration parameters. Improvements are expected in future releases, so feedback is welcome and appreciated (file bugs/requests under the “maven-plugin” component in the Accumulo JIRA).

Packaging

One notable change that was made to the binary tarball is the purposeful omission of a pre-built copy of the Accumulo “native map” library.
This shared library is used at ingest time to implement an off-JVM-heap sorted map that greatly increases ingest throughput while side-stepping
issues such as JVM garbage collection pauses. In earlier releases, a pre-built copy of this shared library was included in the binary tarball; however, the decision was made to omit this due to the potential variance in toolchains on the target system.

It is recommended that users invoke the provided build_native_library.sh before running Accumulo:

$ACCUMULO_HOME/bin/build_native_library.sh

Be aware that you will need a C++ compiler/toolchain installed to build this library. Check your GNU/Linux distribution documentation for the package manager command.

Size-Based Constraint on New Tables

A Constraint is an interface that can determine if a Mutation should be applied or rejected server-side. After ACCUMULO-466, new tables that are created in 1.6.0 will automatically have the DefaultKeySizeConstraint set.
As performance can suffer when large Keys are inserted into a table, this Constraint will reject any Key that is larger than 1MB. If this constraint is undesired, it can be removed using the constraint shell
command. See the help message on the command for more information.

Known Issues

Slower writes than previous Accumulo versions

When using Accumulo 1.6 and Hadoop 2, Accumulo will call hsync() on HDFS.
Calling hsync improves durability by ensuring data is on disk (where other older
Hadoop versions might lose data in the face of power failure); however, calling
hsync frequently does noticeably slow writes. A simple work around is to increase
the value of the tserver.mutation.queue.max configuration parameter via accumulo-site.xml.

A value of “4M” is a better recommendation, and memory consumption will increase by
the number of concurrent writers to that TabletServer. For example, a value of 4M with
50 concurrent writers would equate to approximately 200M of Java heap being used for
mutation queues.

Another possible cause of slower writes is the change in write ahead log replication
between 1.4 and 1.5. Accumulo 1.4. defaulted to two loggers servers. Accumulo 1.5 and 1.6 store
write ahead logs in HDFS and default to using three datanodes.

BatchWriter hold time error

If a BatchWriter fails with MutationsRejectedException and the message contains
"# server errors 1" then it may be ACCUMULO-2388. To confirm this look in the tablet server logs
for org.apache.accumulo.tserver.HoldTimeoutException around the time the BatchWriter failed.
If this is happening often a possible work around is to set general.rpc.timeout to 240s.

Testing

Below is a list of all platforms that 1.6.0 was tested against by developers. Each Apache Accumulo release
has a set of tests that must be run before the candidate is capable of becoming an official release. That list includes the following:

Successfully run all unit tests

Successfully run all functional test (test/system/auto)

Successfully complete two 24-hour RandomWalk tests (LongClean module), with and without “agitation”

Successfully complete two 24-hour Continuous Ingest tests, with and without “agitation”, with data verification

Successfully complete two 72-hour Continuous Ingest tests, with and without “agitation”

Each unit and functional test only runs on a single node, while the RandomWalk and Continuous Ingest tests run
on any number of nodes. Agitation refers to randomly restarting Accumulo processes and Hadoop Datanode processes,
and, in HDFS High-Availability instances, forcing NameNode failover.