http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/7cc70b2e/docs/master/design.md
----------------------------------------------------------------------
diff --git a/docs/master/design.md b/docs/master/design.md
deleted file mode 100644
index 6c77cb6..0000000
--- a/docs/master/design.md
+++ /dev/null
@@ -1,180 +0,0 @@
-// Licensed to the Apache Software Foundation (ASF) under one or more
-// contributor license agreements. See the NOTICE file distributed with
-// this work for additional information regarding copyright ownership.
-// The ASF licenses this file to You under the Apache License, Version 2.0
-// (the "License"); you may not use this file except in compliance with
-// the License. You may obtain a copy of the License at
-//
-// http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-== Accumulo Design
-
-=== Data Model
-
-Accumulo provides a richer data model than simple key-value stores, but is not a
-fully relational database. Data is represented as key-value pairs, where the key and
-value are comprised of the following elements:
-
-[width="75%",cols="^,^,^,^,^,^"]
-|===========================================================================
- 5+|Key .3+^.^|Value
-.2+^.^|Row ID 3+|Column .2+^.^|Timestamp
- |Family |Qualifier |Visibility
-|===========================================================================
-
-All elements of the Key and the Value are represented as byte arrays except for
-Timestamp, which is a Long. Accumulo sorts keys by element and lexicographically
-in ascending order. Timestamps are sorted in descending order so that later
-versions of the same Key appear first in a sequential scan. Tables consist of a set of
-sorted key-value pairs.
-
-=== Architecture
-
-Accumulo is a distributed data storage and retrieval system and as such consists of
-several architectural components, some of which run on many individual servers.
-Much of the work Accumulo does involves maintaining certain properties of the
-data, such as organization, availability, and integrity, across many commodity-class
-machines.
-
-=== Components
-
-An instance of Accumulo includes many TabletServers, one Garbage Collector process,
-one Master server and many Clients.
-
-==== Tablet Server
-
-The TabletServer manages some subset of all the tablets (partitions of tables). This includes receiving writes from clients, persisting writes to a
-write-ahead log, sorting new key-value pairs in memory, periodically
-flushing sorted key-value pairs to new files in HDFS, and responding
-to reads from clients, forming a merge-sorted view of all keys and
-values from all the files it has created and the sorted in-memory
-store.
-
-TabletServers also perform recovery of a tablet
-that was previously on a server that failed, reapplying any writes
-found in the write-ahead log to the tablet.
-
-==== Garbage Collector
-
-Accumulo processes will share files stored in HDFS. Periodically, the Garbage
-Collector will identify files that are no longer needed by any process, and
-delete them. Multiple garbage collectors can be run to provide hot-standby support.
-They will perform leader election among themselves to choose a single active instance.
-
-==== Master
-
-The Accumulo Master is responsible for detecting and responding to TabletServer
-failure. It tries to balance the load across TabletServer by assigning tablets carefully
-and instructing TabletServers to unload tablets when necessary. The Master ensures all
-tablets are assigned to one TabletServer each, and handles table creation, alteration,
-and deletion requests from clients. The Master also coordinates startup, graceful
-shutdown and recovery of changes in write-ahead logs when Tablet servers fail.
-
-Multiple masters may be run. The masters will choose among themselves a single master,
-and the others will become backups if the master should fail.
-
-==== Tracer
-
-The Accumulo Tracer process supports the distributed timing API provided by Accumulo.
-One to many of these processes can be run on a cluster which will write the timing
-information to a given Accumulo table for future reference. Seeing the section on
-Tracing for more information on this support.
-
-==== Monitor
-
-The Accumulo Monitor is a web application that provides a wealth of information about
-the state of an instance. The Monitor shows graphs and tables which contain information
-about read/write rates, cache hit/miss rates, and Accumulo table information such as scan
-rate and active/queued compactions. Additionally, the Monitor should always be the first
-point of entry when attempting to debug an Accumulo problem as it will show high-level problems
-in addition to aggregated errors from all nodes in the cluster. See the section on <<monitoring>>
-for more information.
-
-Multiple Monitors can be run to provide hot-standby support in the face of failure. Due to the
-forwarding of logs from remote hosts to the Monitor, only one Monitor process should be active
-at one time. Leader election will be performed internally to choose the active Monitor.
-
-==== Client
-
-Accumulo includes a client library that is linked to every application. The client
-library contains logic for finding servers managing a particular tablet, and
-communicating with TabletServers to write and retrieve key-value pairs.
-
-=== Data Management
-
-Accumulo stores data in tables, which are partitioned into tablets. Tablets are
-partitioned on row boundaries so that all of the columns and values for a particular
-row are found together within the same tablet. The Master assigns Tablets to one
-TabletServer at a time. This enables row-level transactions to take place without
-using distributed locking or some other complicated synchronization mechanism. As
-clients insert and query data, and as machines are added and removed from the
-cluster, the Master migrates tablets to ensure they remain available and that the
-ingest and query load is balanced across the cluster.
-
-image::data_distribution.png[width=500]
-
-=== Tablet Service
-
-
-When a write arrives at a TabletServer it is written to a Write-Ahead Log and
-then inserted into a sorted data structure in memory called a MemTable. When the
-MemTable reaches a certain size, the TabletServer writes out the sorted
-key-value pairs to a file in HDFS called a Relative Key File (RFile), which is a
-kind of Indexed Sequential Access Method (ISAM) file. This process is called a
-minor compaction. A new MemTable is then created and the fact of the compaction
-is recorded in the Write-Ahead Log.
-
-When a request to read data arrives at a TabletServer, the TabletServer does a
-binary search across the MemTable as well as the in-memory indexes associated
-with each RFile to find the relevant values. If clients are performing a scan,
-several key-value pairs are returned to the client in order from the MemTable
-and the set of RFiles by performing a merge-sort as they are read.
-
-=== Compactions
-
-In order to manage the number of files per tablet, periodically the TabletServer
-performs Major Compactions of files within a tablet, in which some set of RFiles
-are combined into one file. The previous files will eventually be removed by the
-Garbage Collector. This also provides an opportunity to permanently remove
-deleted key-value pairs by omitting key-value pairs suppressed by a delete entry
-when the new file is created.
-
-=== Splitting
-
-When a table is created it has one tablet. As the table grows its initial
-tablet eventually splits into two tablets. Its likely that one of these
-tablets will migrate to another tablet server. As the table continues to grow,
-its tablets will continue to split and be migrated. The decision to
-automatically split a tablet is based on the size of a tablets files. The
-size threshold at which a tablet splits is configurable per table. In addition
-to automatic splitting, a user can manually add split points to a table to
-create new tablets. Manually splitting a new table can parallelize reads and
-writes giving better initial performance without waiting for automatic
-splitting.
-
-As data is deleted from a table, tablets may shrink. Over time this can lead
-to small or empty tablets. To deal with this, merging of tablets was
-introduced in Accumulo 1.4. This is discussed in more detail later.
-
-=== Fault-Tolerance
-
-If a TabletServer fails, the Master detects it and automatically reassigns the tablets
-assigned from the failed server to other servers. Any key-value pairs that were in
-memory at the time the TabletServer fails are automatically reapplied from the Write-Ahead
-Log(WAL) to prevent any loss of data.
-
-Tablet servers write their WALs directly to HDFS so the logs are available to all tablet
-servers for recovery. To make the recovery process efficient, the updates within a log are
-grouped by tablet. TabletServers can quickly apply the mutations from the sorted logs
-that are destined for the tablets they have now been assigned.
-
-TabletServer failures are noted on the Master's monitor page, accessible via
-+http://master-address:9995/monitor+.
-
-image::failure_handling.png[width=500]
http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/7cc70b2e/docs/master/development_clients.md
----------------------------------------------------------------------
diff --git a/docs/master/development_clients.md b/docs/master/development_clients.md
deleted file mode 100644
index 18821e3..0000000
--- a/docs/master/development_clients.md
+++ /dev/null
@@ -1,107 +0,0 @@
-// Licensed to the Apache Software Foundation (ASF) under one or more
-// contributor license agreements. See the NOTICE file distributed with
-// this work for additional information regarding copyright ownership.
-// The ASF licenses this file to You under the Apache License, Version 2.0
-// (the "License"); you may not use this file except in compliance with
-// the License. You may obtain a copy of the License at
-//
-// http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-== Development Clients
-
-Normally, Accumulo consists of lots of moving parts. Even a stand-alone version of
-Accumulo requires Hadoop, Zookeeper, the Accumulo master, a tablet server, etc. If
-you want to write a unit test that uses Accumulo, you need a lot of infrastructure
-in place before your test can run.
-
-=== Mock Accumulo
-
-Mock Accumulo supplies mock implementations for much of the client API. It presently
-does not enforce users, logins, permissions, etc. It does support Iterators and Combiners.
-Note that MockAccumulo holds all data in memory, and will not retain any data or
-settings between runs.
-
-While normal interaction with the Accumulo client looks like this:
-
-[source,java]
-Instance instance = new ZooKeeperInstance(...);
-Connector conn = instance.getConnector(user, passwordToken);
-
-To interact with the MockAccumulo, just replace the ZooKeeperInstance with MockInstance:
-
-[source,java]
-Instance instance = new MockInstance();
-
-In fact, you can use the +--fake+ option to the Accumulo shell and interact with
-MockAccumulo:
-
-----
-$ accumulo shell --fake -u root -p ''
-
-Shell - Apache Accumulo Interactive Shell
--
-- version: 2.x.x
-- instance name: fake
-- instance id: mock-instance-id
--
-- type 'help' for a list of available commands
--
-
-root@fake> createtable test
-
-root@fake test> insert row1 cf cq value
-root@fake test> insert row2 cf cq value2
-root@fake test> insert row3 cf cq value3
-
-root@fake test> scan
-row1 cf:cq [] value
-row2 cf:cq [] value2
-row3 cf:cq [] value3
-
-root@fake test> scan -b row2 -e row2
-row2 cf:cq [] value2
-
-root@fake test>
-----
-
-When testing Map Reduce jobs, you can also set the Mock Accumulo on the AccumuloInputFormat
-and AccumuloOutputFormat classes:
-
-[source,java]
-// ... set up job configuration
-AccumuloInputFormat.setMockInstance(job, "mockInstance");
-AccumuloOutputFormat.setMockInstance(job, "mockInstance");
-
-=== Mini Accumulo Cluster
-
-While the Mock Accumulo provides a lightweight implementation of the client API for unit
-testing, it is often necessary to write more realistic end-to-end integration tests that
-take advantage of the entire ecosystem. The Mini Accumulo Cluster makes this possible by
-configuring and starting Zookeeper, initializing Accumulo, and starting the Master as well
-as some Tablet Servers. It runs against the local filesystem instead of having to start
-up HDFS.
-
-To start it up, you will need to supply an empty directory and a root password as arguments:
-
-[source,java]
-File tempDirectory = // JUnit and Guava supply mechanisms for creating temp directories
-MiniAccumuloCluster accumulo = new MiniAccumuloCluster(tempDirectory, "password");
-accumulo.start();
-
-Once we have our mini cluster running, we will want to interact with the Accumulo client API:
-
-[source,java]
-Instance instance = new ZooKeeperInstance(accumulo.getInstanceName(), accumulo.getZooKeepers());
-Connector conn = instance.getConnector("root", new PasswordToken("password"));
-
-Upon completion of our development code, we will want to shutdown our MiniAccumuloCluster:
-
-[source,java]
-accumulo.stop();
-// delete your temporary folder
http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/7cc70b2e/docs/master/high_speed_ingest.md
----------------------------------------------------------------------
diff --git a/docs/master/high_speed_ingest.md b/docs/master/high_speed_ingest.md
deleted file mode 100644
index 2a7a702..0000000
--- a/docs/master/high_speed_ingest.md
+++ /dev/null
@@ -1,124 +0,0 @@
-// Licensed to the Apache Software Foundation (ASF) under one or more
-// contributor license agreements. See the NOTICE file distributed with
-// this work for additional information regarding copyright ownership.
-// The ASF licenses this file to You under the Apache License, Version 2.0
-// (the "License"); you may not use this file except in compliance with
-// the License. You may obtain a copy of the License at
-//
-// http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-== High-Speed Ingest
-
-Accumulo is often used as part of a larger data processing and storage system. To
-maximize the performance of a parallel system involving Accumulo, the ingestion
-and query components should be designed to provide enough parallelism and
-concurrency to avoid creating bottlenecks for users and other systems writing to
-and reading from Accumulo. There are several ways to achieve high ingest
-performance.
-
-=== Pre-Splitting New Tables
-
-New tables consist of a single tablet by default. As mutations are applied, the table
-grows and splits into multiple tablets which are balanced by the Master across
-TabletServers. This implies that the aggregate ingest rate will be limited to fewer
-servers than are available within the cluster until the table has reached the point
-where there are tablets on every TabletServer.
-
-Pre-splitting a table ensures that there are as many tablets as desired available
-before ingest begins to take advantage of all the parallelism possible with the cluster
-hardware. Tables can be split at any time by using the shell:
-
- user@myinstance mytable> addsplits -sf /local_splitfile -t mytable
-
-For the purposes of providing parallelism to ingest it is not necessary to create more
-tablets than there are physical machines within the cluster as the aggregate ingest
-rate is a function of the number of physical machines. Note that the aggregate ingest
-rate is still subject to the number of machines running ingest clients, and the
-distribution of rowIDs across the table. The aggregation ingest rate will be
-suboptimal if there are many inserts into a small number of rowIDs.
-
-=== Multiple Ingester Clients
-
-Accumulo is capable of scaling to very high rates of ingest, which is dependent upon
-not just the number of TabletServers in operation but also the number of ingest
-clients. This is because a single client, while capable of batching mutations and
-sending them to all TabletServers, is ultimately limited by the amount of data that
-can be processed on a single machine. The aggregate ingest rate will scale linearly
-with the number of clients up to the point at which either the aggregate I/O of
-TabletServers or total network bandwidth capacity is reached.
-
-In operational settings where high rates of ingest are paramount, clusters are often
-configured to dedicate some number of machines solely to running Ingester Clients.
-The exact ratio of clients to TabletServers necessary for optimum ingestion rates
-will vary according to the distribution of resources per machine and by data type.
-
-=== Bulk Ingest
-
-Accumulo supports the ability to import files produced by an external process such
-as MapReduce into an existing table. In some cases it may be faster to load data this
-way rather than via ingesting through clients using BatchWriters. This allows a large
-number of machines to format data the way Accumulo expects. The new files can
-then simply be introduced to Accumulo via a shell command.
-
-To configure MapReduce to format data in preparation for bulk loading, the job
-should be set to use a range partitioner instead of the default hash partitioner. The
-range partitioner uses the split points of the Accumulo table that will receive the
-data. The split points can be obtained from the shell and used by the MapReduce
-RangePartitioner. Note that this is only useful if the existing table is already split
-into multiple tablets.
-
- user@myinstance mytable> getsplits
- aa
- ab
- ac
- ...
- zx
- zy
- zz
-
-Run the MapReduce job, using the AccumuloFileOutputFormat to create the files to
-be introduced to Accumulo. Once this is complete, the files can be added to
-Accumulo via the shell:
-
- user@myinstance mytable> importdirectory /files_dir /failures
-
-Note that the paths referenced are directories within the same HDFS instance over
-which Accumulo is running. Accumulo places any files that failed to be added to the
-second directory specified.
-
-See the https://github.com/apache/accumulo-examples/blob/master/docs/bulkIngest.md[Bulk Ingest example]
-for a complete example.
-
-=== Logical Time for Bulk Ingest
-
-Logical time is important for bulk imported data, for which the client code may
-be choosing a timestamp. At bulk import time, the user can choose to enable
-logical time for the set of files being imported. When its enabled, Accumulo
-uses a specialized system iterator to lazily set times in a bulk imported file.
-This mechanism guarantees that times set by unsynchronized multi-node
-applications (such as those running on MapReduce) will maintain some semblance
-of causal ordering. This mitigates the problem of the time being wrong on the
-system that created the file for bulk import. These times are not set when the
-file is imported, but whenever it is read by scans or compactions. At import, a
-time is obtained and always used by the specialized system iterator to set that
-time.
-
-The timestamp assigned by Accumulo will be the same for every key in the file.
-This could cause problems if the file contains multiple keys that are identical
-except for the timestamp. In this case, the sort order of the keys will be
-undefined. This could occur if an insert and an update were in the same bulk
-import file.
-
-=== MapReduce Ingest
-
-It is possible to efficiently write many mutations to Accumulo in parallel via a
-MapReduce job. In this scenario the MapReduce is written to process data that lives
-in HDFS and write mutations to Accumulo using the AccumuloOutputFormat. See
-the MapReduce section under Analytics for details. The https://github.com/apache/accumulo-examples/blob/master/docs/mapred.md[MapReduce example]
-is also a good reference for example code.
http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/7cc70b2e/docs/master/implementation.md
----------------------------------------------------------------------
diff --git a/docs/master/implementation.md b/docs/master/implementation.md
deleted file mode 100644
index 520f538..0000000
--- a/docs/master/implementation.md
+++ /dev/null
@@ -1,86 +0,0 @@
-// Licensed to the Apache Software Foundation (ASF) under one or more
-// contributor license agreements. See the NOTICE file distributed with
-// this work for additional information regarding copyright ownership.
-// The ASF licenses this file to You under the Apache License, Version 2.0
-// (the "License"); you may not use this file except in compliance with
-// the License. You may obtain a copy of the License at
-//
-// http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-== Implementation Details
-
-=== Fault-Tolerant Executor (FATE)
-
-Accumulo must implement a number of distributed, multi-step operations to support
-the client API. Creating a new table is a simple example of an atomic client call
-which requires multiple steps in the implementation: get a unique table ID, configure
-default table permissions, populate information in ZooKeeper to record the table's
-existence, create directories in HDFS for the table's data, etc. Implementing these
-steps in a way that is tolerant to node failure and other concurrent operations is
-very difficult to achieve. Accumulo includes a Fault-Tolerant Executor (FATE) which
-is widely used server-side to implement the client API safely and correctly.
-
-FATE is the implementation detail which ensures that tables in creation when the
-Master dies will be successfully created when another Master process is started.
-This alleviates the need for any external tools to correct some bad state -- Accumulo can
-undo the failure and self-heal without any external intervention.
-
-=== Overview
-
-FATE consists of two primary components: a repeatable, persisted operation (REPO), a storage
-layer for REPOs and an execution system to run REPOs. Accumulo uses ZooKeeper as the storage
-layer for FATE and the Accumulo Master acts as the execution system to run REPOs.
-
-The important characteristic of REPOs are that they implemented in a way that is idempotent:
-every operation must be able to undo or replay a partial execution of itself. Requiring the
-implementation of the operation to support this functional greatly simplifies the execution
-of these operations. This property is also what guarantees safety in light of failure conditions.
-
-=== Administration
-
-Sometimes, it is useful to inspect the current FATE operations, both pending and executing.
-For example, a command that is not completing could be blocked on the execution of another
-operation. Accumulo provides an Accumulo shell command to interact with fate.
-
-The +fate+ shell command accepts a number of arguments for different functionality:
-+list+/+print+, +fail+, +delete+, +dump+.
-
-==== List/Print
-
-Without any additional arguments, this command will print all operations that still exist in
-the FATE store (ZooKeeper). This will include active, pending, and completed operations (completed
-operations are lazily removed from the store). Each operation includes a unique "transaction ID", the
-state of the operation (e.g. +NEW+, +IN_PROGRESS+, +FAILED+), any locks the
-transaction actively holds and any locks it is waiting to acquire.
-
-This option can also accept transaction IDs which will restrict the list of transactions shown.
-
-==== Fail
-
-This command can be used to manually fail a FATE transaction and requires a transaction ID
-as an argument. Failing an operation is not a normal procedure and should only be performed
-by an administrator who understands the implications of why they are failing the operation.
-
-==== Delete
-
-This command requires a transaction ID and will delete any locks that the transaction
-holds. Like the fail command, this command should only be used in extreme circumstances
-by an administrator that understands the implications of the command they are about to
-invoke. It is not normal to invoke this command.
-
-==== Dump
-
-This command accepts zero more transaction IDs. If given no transaction IDs,
-it will dump all active transactions. A FATE operations is compromised as a
-sequence of REPOs. In order to start a FATE transaction, a REPO is pushed onto
-a per transaction REPO stack. The top of the stack always contains the next
-REPO the FATE transaction should execute. When a REPO is successful it may
-return another REPO which is pushed on the stack. The +dump+ command will
-print all of the REPOs on each transactions stack. The REPOs are serialized to
-JSON in order to make them human readable.
http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/7cc70b2e/docs/master/introduction.md
----------------------------------------------------------------------
diff --git a/docs/master/introduction.md b/docs/master/introduction.md
deleted file mode 100644
index 1b964b4..0000000
--- a/docs/master/introduction.md
+++ /dev/null
@@ -1,25 +0,0 @@
-// Licensed to the Apache Software Foundation (ASF) under one or more
-// contributor license agreements. See the NOTICE file distributed with
-// this work for additional information regarding copyright ownership.
-// The ASF licenses this file to You under the Apache License, Version 2.0
-// (the "License"); you may not use this file except in compliance with
-// the License. You may obtain a copy of the License at
-//
-// http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-== Introduction
-Apache Accumulo is a highly scalable structured store based on Google's BigTable.
-Accumulo is written in Java and operates over the Hadoop Distributed File System
-(HDFS), which is part of the popular Apache Hadoop project. Accumulo supports
-efficient storage and retrieval of structured data, including queries for ranges, and
-provides support for using Accumulo tables as input and output for MapReduce
-jobs.
-
-Accumulo features automatic load-balancing and partitioning, data compression
-and fine-grained security labels.
http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/7cc70b2e/docs/master/iterator_design.md
----------------------------------------------------------------------
diff --git a/docs/master/iterator_design.md b/docs/master/iterator_design.md
deleted file mode 100644
index 4beaeb0..0000000
--- a/docs/master/iterator_design.md
+++ /dev/null
@@ -1,401 +0,0 @@
-// Licensed to the Apache Software Foundation (ASF) under one or more
-// contributor license agreements. See the NOTICE file distributed with
-// this work for additional information regarding copyright ownership.
-// The ASF licenses this file to You under the Apache License, Version 2.0
-// (the "License"); you may not use this file except in compliance with
-// the License. You may obtain a copy of the License at
-//
-// http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-== Iterator Design
-
-Accumulo SortedKeyValueIterators, commonly referred to as Iterators for short, are server-side programming constructs
-that allow users to implement custom retrieval or computational purpose within Accumulo TabletServers. The name rightly
-brings forward similarities to the Java Iterator interface; however, Accumulo Iterators are more complex than Java
-Iterators. Notably, in addition to the expected methods to retrieve the current element and advance to the next element
-in the iteration, Accumulo Iterators must also support the ability to "move" (`seek`) to an specified point in the
-iteration (the Accumulo table). Accumulo Iterators are designed to be concatenated together, similar to applying a
-series of transformations to a list of elements. Accumulo Iterators can duplicate their underlying source to create
-multiple "pointers" over the same underlying data (which is extremely powerful since each stream is sorted) or they can
-merge multiple Iterators into a single view. In this sense, a collection of Iterators operating in tandem is close to
-a tree-structure than a list, but there is always a sense of a flow of Key-Value pairs through some Iterators. Iterators
-are not designed to act as triggers nor are they designed to operate outside of the purview of a single table.
-
-Understanding how TabletServers invoke the methods on a SortedKeyValueIterator can be obtuse as the actual code is
-buried within the implementation of the TabletServer; however, it is generally unnecessary to have a strong
-understanding of this as the interface provides clear definitions about what each action each method should take. This
-chapter aims to provide a more detailed description of how Iterators are invoked, some best practices and some common
-pitfalls.
-
-=== Instantiation
-
-To invoke an Accumulo Iterator inside of the TabletServer, the Iterator class must be on the classpath of every
-TabletServer. For production environments, it is common to place a JAR file which contains the Iterator in
-`lib/`. In development environments, it is convenient to instead place the JAR file in `lib/ext/` as JAR files
-in this directory are dynamically reloaded by the TabletServers alleviating the need to restart Accumulo while
-testing an Iterator. Advanced classloader features which enable other types of filesystems and per-table classpath
-configurations (as opposed to process-wide classpaths). These features are not covered here, but elsewhere in the user
-manual.
-
-Accumulo references the Iterator class by name and uses Java reflection to instantiate the Iterator. This means that
-Iterators must have a public no-args constructor.
-
-=== Interface
-
-A normal implementation of the SortedKeyValueIterator defines functionality for the following methods:
-
-[source,java]
-----
-void init(SortedKeyValueIterator<Key,Value> source, Map<String,String> options, IteratorEnvironment env) throws IOException;
-
-boolean hasTop();
-
-void next() throws IOException;
-
-void seek(Range range, Collection<ByteSequence> columnFamilies, boolean inclusive) throws IOException;
-
-Key getTopKey();
-
-Value getTopValue();
-
-SortedKeyValueIterator<Key,Value> deepCopy(IteratorEnvironment env);
-----
-
-==== `init`
-
-The `init` method is called by the TabletServer after it constructs an instance of the Iterator. This method should
-clear/reset any internal state in the Iterator and prepare it to process data. The first argument, the `source`, is the
-Iterator "below" this Iterator (where the client is at "top" and the Iterator for files in HDFS are at the "bottom").
-The "source" Iterator provides the Key-Value pairs which this Iterator will operate upon.
-
-The second argument, a Map of options, is made up of options provided by the user, options set in the table's
-configuration, and/or options set in the containing namespace's configuration.
-These options allow for Iterators to dynamically configure themselves on the fly. If no options are used in the current context
-(a Scan or Compaction), the Map will be empty. An example of a configuration item for an Iterator could be a pattern used to filter
-Key-Value pairs in a regular expression Iterator.
-
-The third argument, the `IteratorEnvironment`, is a special object which provides information to this Iterator about the
-context in which it was invoked. Commonly, this information is not necessary to inspect. For example, if an Iterator
-knows that it is running in the context of a full-major compaction (reading all of the data) as opposed to a user scan
-(which may strongly limit the number of columns), the Iterator might make different algorithmic decisions in an attempt to
-optimize itself.
-
-==== `seek`
-
-The `seek` method is likely the most confusing method on the Iterator interface. The purpose of this method is to
-advance the stream of Key-Value pairs to a certain point in the iteration (the Accumulo table). It is common that before
-the implementation of this method returns some additional processing is performed which may further advance the current
-position past the `startKey` of the `Range`. This, however, is dependent on the functionality the iterator provides. For
-example, a filtering iterator would consume a number Key-Value pairs which do not meets its criteria before `seek`
-returns. The important condition for `seek` to meet is that this Iterator should be ready to return the first Key-Value
-pair, or none if no such pair is available, when the method returns. The Key-Value pair would be returned by `getTopKey`
-and `getTopValue`, respectively, and `hasTop` should return a boolean denoting whether or not there is
-a Key-Value pair to return.
-
-The arguments passed to seek are as follows:
-
-The TabletServer first provides a `Range`, an object which defines some collection of Accumulo `Key`s, which defines the
-Key-Value pairs that this Iterator should return. Each `Range` has a `startKey` and `endKey` with an inclusive flag for
-both. While this Range is often similar to the Range(s) set by the client on a Scanner or BatchScanner, it is not
-guaranteed to be a Range that the client set. Accumulo will split up larger ranges and group them together based on
-Tablet boundaries per TabletServer. Iterators should not attempt to implement any custom logic based on the Range(s)
-provided to `seek` and Iterators should not return any Keys that fall outside of the provided Range.
-
-The second argument, a `Collection<ByteSequence>`, is the set of column families which should be retained or
-excluded by this Iterator. The third argument, a boolean, defines whether the collection of column families
-should be treated as an inclusion collection (true) or an exclusion collection (false).
-
-It is likely that all implementations of `seek` will first make a call to the `seek` method on the
-"source" Iterator that was provided in the `init` method. The collection of column families and
-the boolean `include` argument should be passed down as well as the `Range`. Somewhat commonly, the Iterator will
-also implement some sort of additional logic to find or compute the first Key-Value pair in the provided
-Range. For example, a regular expression Iterator would consume all records which do not match the given
-pattern before returning from `seek`.
-
-It is important to retain the original Range passed to this method to know when this Iterator should stop
-reading more Key-Value pairs. Ignoring this typically does not affect scans from a Scanner, but it
-will result in duplicate keys emitting from a BatchScan if the scanned table has more than one tablet.
-Best practice is to never emit entries outside the seek range.
-
-==== `next`
-
-The `next` method is analogous to the `next` method on a Java Iterator: this method should advance
-the Iterator to the next Key-Value pair. For implementations that perform some filtering or complex
-logic, this may result in more than one Key-Value pair being inspected. This method alters
-some internal state that is exposed via the `hasTop`, `getTopKey`, and `getTopValue` methods.
-
-The result of this method is commonly caching a Key-Value pair which `getTopKey` and `getTopValue`
-can later return. While there is another Key-Value pair to return, `hasTop` should return true.
-If there are no more Key-Value pairs to return from this Iterator since the last call to
-`seek`, `hasTop` should return false.
-
-==== `hasTop`
-
-The `hasTop` method is similar to the `hasNext` method on a Java Iterator in that it informs
-the caller if there is a Key-Value pair to be returned. If there is no pair to return, this method
-should return false. Like a Java Iterator, multiple calls to `hasTop` (without calling `next`) should not
-alter the internal state of the Iterator.
-
-==== `getTopKey` and `getTopValue`
-
-These methods simply return the current Key-Value pair for this iterator. If `hasTop` returns true,
-both of these methods should return non-null objects. If `hasTop` returns false, it is undefined
-what these methods should return. Like `hasTop`, multiple calls to these methods should not alter
-the state of the Iterator.
-
-Users should take caution when either
-
-1. caching the Key/Value from `getTopKey`/`getTopValue`, for use after calling `next` on the source iterator.
-In this case, the cached Key/Value object is aliased to the reference returned by the source iterator.
-Iterators may reuse the same Key/Value object in a `next` call for performance reasons, changing the data
-that the cached Key/Value object references and resulting in a logic bug.
-2. modifying the Key/Value from `getTopKey`/`getTopValue`. If the source iterator reuses data stored in the Key/Value,
-then the source iterator may use the modified data that the Key/Value references. This may/may not result in a logic bug.
-
-In both cases, copying the Key/Value's data into a new object ensures iterator correctness. If neither case applies,
-it is safe to not copy the Key/Value. The general guideline is to be aware of who else may use Key/Value objects
-returned from `getTopKey`/`getTopValue`.
-
-==== `deepCopy`
-
-The `deepCopy` method is similar to the `clone` method from the Java `Cloneable` interface.
-Implementations of this method should return a new object of the same type as the Accumulo Iterator
-instance it was called on. Any internal state from the instance `deepCopy` was called
-on should be carried over to the returned copy. The returned copy should be ready to have
-`seek` called on it. The SortedKeyValueIterator interface guarantees that `init` will be called on
-an iterator before `deepCopy` and that `init` will not be called on the iterator returned by
-`deepCopy`.
-
-Typically, implementations of `deepCopy` call a copy-constructor which will initialize
-internal data structures. As with `seek`, it is common for the `IteratorEnvironment`
-argument to be ignored as most Iterator implementations can be written without the explicit
-information the environment provides.
-
-In the analogy of a series of Iterators representing a tree, `deepCopy` can be thought of as
-early programming assignments which implement their own tree data structures. `deepCopy` calls
-copy on its sources (the children), copies itself, attaches the copies of the children, and
-then returns itself.
-
-=== TabletServer invocation of Iterators
-
-The following code is a general outline for how TabletServers invoke Iterators.
-
-[source,java]
-----
- List<KeyValue> batch;
- Range range = getRangeFromClient();
- while(!overSizeLimit(batch)){
- SortedKeyValueIterator source = getSystemIterator();
-
- for(String clzName : getUserIterators()){
- Class<?> clz = Class.forName(clzName);
- SortedKeyValueIterator iter = (SortedKeyValueIterator) clz.newInstance();
- iter.init(source, opts, env);
- source = iter;
- }
-
- // read a batch of data to return to client
- // the last iterator, the "top"
- SortedKeyValueIterator topIter = source;
- topIter.seek(getRangeFromUser(), ...)
-
- while(topIter.hasTop() && !overSizeLimit(batch)){
- key = topIter.getTopKey()
- val = topIter.getTopValue()
- batch.add(new KeyValue(key, val)
- if(systemDataSourcesChanged()){
- // code does not show isolation case, which will
- // keep using same data sources until a row boundry is hit
- range = new Range(key, false, range.endKey(), range.endKeyInclusive());
- break;
- }
- }
- }
- //return batch of key values to client
-----
-
-Additionally, the obtuse "re-seek" case can be outlined as the following:
-
-[source,java]
-----
- // Given the above
- List<KeyValue> batch = getNextBatch();
-
- // Store off lastKeyReturned for this client
- lastKeyReturned = batch.get(batch.size() - 1).getKey();
-
- // thread goes away (client stops asking for the next batch).
-
- // Eventually client comes back
- // Setup as before...
-
- Range userRange = getRangeFromUser();
- Range actualRange = new Range(lastKeyReturned, false
- userRange.getEndKey(), userRange.isEndKeyInclusive());
-
- // Use the actualRange, not the user provided one
- topIter.seek(actualRange);
-----
-
-
-=== Isolation
-
-Accumulo provides a feature which clients can enable to prevent the viewing of partially
-applied mutations within the context of rows. If a client is submitting multiple column
-updates to rows at a time, isolation would ensure that a client would either see all of
-updates made to that row or none of the updates (until they are all applied).
-
-When using Isolation, there are additional concerns in iterator design. A scan time iterator in accumulo
-reads from a set of data sources. While an iterator is reading data it has an isolated view. However, after it returns a
-key/value it is possible that accumulo may switch data sources and re-seek the iterator. This is done so that resources
-may be reclaimed. When the user does not request isolation this can occur after any key is returned. When a user enables
-Isolation, this will only occur after a new row is returned, in which case it will re-seek to the very beginning of the
-next possible row.
-
-=== Abstract Iterators
-
-A number of Abstract implementations of Iterators are provided to allow for faster creation
-of common patterns. The most commonly used abstract implementations are the `Filter` and
-`Combiner` classes. When possible these classes should be used instead as they have been
-thoroughly tested inside Accumulo itself.
-
-==== Filter
-
-The `Filter` abstract Iterator provides a very simple implementation which allows implementations
-to define whether or not a Key-Value pair should be returned via an `accept(Key, Value)` method.
-
-Filters are extremely simple to implement; however, when the implementation is filtering a
-large percentage of Key-Value pairs with respect to the total number of pairs examined,
-it can be very inefficient. For example, if a Filter implementation can determine after examining
-part of the row that no other pairs in this row will be accepted, there is no mechanism to
-efficiently skip the remaining Key-Value pairs. Concretely, take a row which is comprised of
-1000 Key-Value pairs. After examining the first 10 Key-Value pairs, it is determined
-that no other Key-Value pairs in this row will be accepted. The Filter must still examine each
-remaining 990 Key-Value pairs in this row. Another way to express this deficiency is that
-Filters have no means to leverage the `seek` method to efficiently skip large portions
-of Key-Value pairs.
-
-As such, the `Filter` class functions well for filtering small amounts of data, but is
-inefficient for filtering large amounts of data. The decision to use a `Filter` strongly
-depends on the use case and distribution of data being filtered.
-
-==== Combiner
-
-The `Combiner` class is another common abstract Iterator. Similar to the `Combiner` interface
-define in Hadoop's MapReduce framework, implementations of this abstract class reduce
-multiple Values for different versions of a Key (Keys which only differ by timestamps) into one Key-Value pair.
-Combiners provide a simple way to implement common operations like summation and
-aggregation without the need to implement the entire Accumulo Iterator interface.
-
-One important consideration when choosing to design a Combiner is that the "reduction" operation
-is often best represented when it is associative and commutative. Operations which do not meet
-these criteria can be implemented; however, the implementation can be difficult.
-
-A second consideration is that a Combiner is not guaranteed to see every Key-Value pair
-which differ only by timestamp every time it is invoked. For example, if there are 5 Key-Value
-pairs in a table which only differ by the timestamps 1, 2, 3, 4, and 5, it is not guaranteed that
-every invocation of the Combiner will see 5 timestamps. One invocation might see the Values for
-Keys with timestamp 1 and 4, while another invocation might see the Values for Keys with the
-timestamps 1, 2, 4 and 5.
-
-Finally, when configuring an Accumulo table to use a Combiner, be sure to disable the Versioning Iterator or set the
-Combiner at a priority less than the Combiner (the Versioning Iterator is added at a priority of 20 by default). The
-Versioning Iterator will filter out multiple Key-Value pairs that differ only by timestamp and return only the Key-Value
-pair that has the largest timestamp.
-
-=== Best practices
-
-Because of the flexibility that the `SortedKeyValueInterface` provides, it doesn't directly disallow
-many implementations which are poor design decisions. The following are some common recommendations to
-follow and pitfalls to avoid in Iterator implementations.
-
-==== Avoid special logic encoded in Ranges
-
-Commonly, granular Ranges that a client passes to an Iterator from a `Scanner` or `BatchScanner` are unmodified.
-If a `Range` falls within the boundaries of a Tablet, an Iterator will often see that same Range in the
-`seek` method. However, there is no guarantee that the `Range` will remain unaltered from client to server. As such, Iterators
-should *never* make assumptions about the current state/context based on the `Range`.
-
-The common failure condition is referred to as a "re-seek". In the context of a Scan, TabletServers construct the
-"stack" of Iterators and batch up Key-Value pairs to send back to the client. When a sufficient number of Key-Value
-pairs are collected, it is common for the Iterators to be "torn down" until the client asks for the next batch of
-Key-Value pairs. This is done by the TabletServer to add fairness in ensuring one Scan does not monopolize the available
-resources. When the client asks for the next batch, the implementation modifies the original Range so that servers know
-the point to resume the iteration (to avoid returning duplicate Key-Value pairs). Specifically, the new Range is created
-from the original but is shortened by setting the startKey of the original Range to the Key last returned by the Scan,
-non-inclusive.
-
-==== `seek`'ing backwards
-
-The ability for an Iterator to "skip over" large blocks of Key-Value pairs is a major tenet behind Iterators.
-By `seek`'ing when it is known that there is a collection of Key-Value pairs which can be ignored can
-greatly increase the speed of a scan as many Key-Value pairs do not have to be deserialized and processed.
-
-While the `seek` method provides the `Range` that should be used to `seek` the underlying source Iterator,
-there is no guarantee that the implementing Iterator uses that `Range` to perform the `seek` on its
-"source" Iterator. As such, it is possible to seek to any `Range` and the interface has no assertions
-to prevent this from happening.
-
-Since Iterators are allowed to `seek` to arbitrary Keys, it also allows Iterators to create infinite loops
-inside Scans that will repeatedly read the same data without end. If an arbitrary Range is constructed, it should
-construct a completely new Range as it allows for bugs to be introduced which will break Accumulo.
-
-Thus, `seek`'s should always be thought of as making "forward progress" in the view of the total iteration. The
-`startKey` of a `Range` should always be greater than the current Key seen by the Iterator while the `endKey` of the
-`Range` should always retain the original `endKey` (and `endKey` inclusivity) of the last `Range` seen by your
-Iterator's implementation of seek.
-
-==== Take caution in constructing new data in an Iterator
-
-Implementations of Iterator might be tempted to open BatchWriters inside of an Iterator as a means
-to implement triggers for writing additional data outside of their client application. The lifecycle of an Iterator
-is *not* managed in such a way that guarantees that this is safe nor efficient. Specifically, there
-is no way to guarantee that the internal ThreadPool inside of the BatchWriter is closed (and the thread(s)
-are reaped) without calling the close() method. `close`'ing and recreating a `BatchWriter` after every
-Key-Value pair is also prohibitively performance limiting to be considered an option.
-
-The only safe way to generate additional data in an Iterator is to alter the current Key-Value pair.
-For example, the `WholeRowIterator` serializes the all of the Key-Values pairs that fall within each
-row. A safe way to generate more data in an Iterator would be to construct an Iterator that is
-"higher" (at a larger priority) than the `WholeRowIterator`, that is, the Iterator receives the Key-Value pairs which are
-a serialization of many Key-Value pairs. The custom Iterator could deserialize the pairs, compute
-some function, and add a new Key-Value pair to the original collection, re-serializing the collection
-of Key-Value pairs back into a single Key-Value pair.
-
-Any other situation is likely not guaranteed to ensure that the caller (a Scan or a Compaction) will
-always see all intended data that is generated.
-
-=== Final things to remember
-
-Some simple recommendations/points to keep in mind:
-
-==== Method call order
-
-On an instance of an Iterator: `init` is always called before `seek`, `seek` is always called before `hasTop`,
-`getTopKey` and `getTopValue` will not be called if `hasTop` returns false.
-
-==== Teardown
-
-As mentioned, instance of Iterators may be torn down inside of the server transparently. When a complex
-collection of iterators is performing some advanced functionality, they will not be torn down until a Key-Value
-pair is returned out of the "stack" of Iterators (and added into the batch of Key-Values to be returned
-to the caller). Being torn-down is equivalent to a new instance of the Iterator being creating and `deepCopy`
-being called on the new instance with the old instance provided as the argument to `deepCopy`. References
-to the old instance are removed and the object is lazily garbage collected by the JVM.
-
-=== Compaction-time Iterators
-
-When Iterators are configured to run during compactions, at the `minc` or `majc` scope, these Iterators sometimes need
-to make different assertions than those who only operate at scan time. Iterators won't see the delete entries; however,
-Iterators will not necessarily see all of the Key-Value pairs in ever invocation. Because compactions often do not rewrite
-all files (only a subset of them), it is possible that the logic take this into consideration.
-
-For example, a Combiner that runs over data at during compactions, might not see all of the values for a given Key. The
-Combiner must recognize this and not perform any function that would be incorrect due
-to the missing values.
http://git-wip-us.apache.org/repos/asf/accumulo-website/blob/7cc70b2e/docs/master/iterator_test_harness.md
----------------------------------------------------------------------
diff --git a/docs/master/iterator_test_harness.md b/docs/master/iterator_test_harness.md
deleted file mode 100644
index 91ae53a..0000000
--- a/docs/master/iterator_test_harness.md
+++ /dev/null
@@ -1,110 +0,0 @@
-// Licensed to the Apache Software Foundation (ASF) under one or more
-// contributor license agreements. See the NOTICE file distributed with
-// this work for additional information regarding copyright ownership.
-// The ASF licenses this file to You under the Apache License, Version 2.0
-// (the "License"); you may not use this file except in compliance with
-// the License. You may obtain a copy of the License at
-//
-// http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-== Iterator Testing
-
-Iterators, while extremely powerful, are notoriously difficult to test. While the API defines
-the methods an Iterator must implement and each method's functionality, the actual invocation
-of these methods by Accumulo TabletServers can be surprisingly difficult to mimic in unit tests.
-
-The Apache Accumulo "Iterator Test Harness" is designed to provide a generalized testing framework
-for all Accumulo Iterators to leverage to identify common pitfalls in user-created Iterators.
-
-=== Framework Use
-
-The harness provides an abstract class for use with JUnit4. Users must define the following for this
-abstract class:
-
- * A `SortedMap` of input data (`Key`-`Value` pairs)
- * A `Range` to use in tests
- * A `Map` of options (`String` to `String` pairs)
- * A `SortedMap` of output data (`Key`-`Value` pairs)
- * A list of `IteratorTestCase`s (these can be automatically discovered)
-
-The majority of effort a user must make is in creating the input dataset and the expected
-output dataset for the iterator being tested.
-
-=== Normal Test Outline
-
-Most iterator tests will follow the given outline:
-
-[source,java]
-----
-import java.util.List;
-import java.util.SortedMap;
-
-import org.apache.accumulo.core.data.Key;
-import org.apache.accumulo.core.data.Range;
-import org.apache.accumulo.core.data.Value;
-import org.apache.accumulo.iteratortest.IteratorTestCaseFinder;
-import org.apache.accumulo.iteratortest.IteratorTestInput;
-import org.apache.accumulo.iteratortest.IteratorTestOutput;
-import org.apache.accumulo.iteratortest.junit4.BaseJUnit4IteratorTest;
-import org.apache.accumulo.iteratortest.testcases.IteratorTestCase;
-import org.junit.runners.Parameterized.Parameters;
-
-public class MyIteratorTest extends BaseJUnit4IteratorTest {
-
- @Parameters
- public static Object[][] parameters() {
- final IteratorTestInput input = createIteratorInput();
- final IteratorTestOutput output = createIteratorOutput();
- final List<IteratorTestCase> testCases = IteratorTestCaseFinder.findAllTestCases();
- return BaseJUnit4IteratorTest.createParameters(input, output, tests);
- }
-
- private static SortedMap<Key,Value> INPUT_DATA = createInputData();
- private static SortedMap<Key,Value> OUTPUT_DATA = createOutputData();
-
- private static SortedMap<Key,Value> createInputData() {
- // TODO -- implement this method
- }
-
- private static SortedMap<Key,Value> createOutputData() {
- // TODO -- implement this method
- }
-
- private static IteratorTestInput createIteratorInput() {
- final Map<String,String> options = createIteratorOptions();
- final Range range = createRange();
- return new IteratorTestInput(MyIterator.class, options, range, INPUT_DATA);
- }
-
- private static Map<String,String> createIteratorOptions() {
- // TODO -- implement this method
- // Tip: Use INPUT_DATA if helpful in generating output
- }
-
- private static Range createRange() {
- // TODO -- implement this method
- }
-
- private static IteratorTestOutput createIteratorOutput() {
- return new IteratorTestOutput(OUTPUT_DATA);
- }
-
-}
-----
-
-=== Limitations
-
-While the provided `IteratorTestCase`s should exercise common edge-cases in user iterators,
-there are still many limitations to the existing test harness. Some of them are:
-
- * Can only specify a single iterator, not many (a "stack")
- * No control over provided IteratorEnvironment for tests
- * Exercising delete keys (especially with major compactions that do not include all files)
-
-These are left as future improvements to the harness.