Using Hadoop as an Archive Solution

In the Imixs-Workflow project we are currently working on a new archive solution to archive business data into a big data storage. The main goal is to store business data over a very long period of time (10 to 30 years). Therefore we evaluate different big data solutions and concepts to be integrated with the Imixs-Workflow system.

The Requirements

The requirements for the Imixs-Workflow archive solution can be summarizes as follows:

A single Workitem is represented and stored in a XML format

A Workitem has typically a size between 1 and 10 MB including workflow meta data and optional documents

A single workflow instance can handle more than 10 million workitems in one year

Workitems need to be stored over many years (10-30 years).

The data store should support replication over multiple server nodes in a cluster.

To write and read the xml files a Rest API is needed

Workitems are generated in unsteady intervals form the Imixs-Workflow system

The Workflow engine need to read the data of a workitem also in unsteady intervals

The solution need to guarantee the data integrity of all files over a long period of time

Hadoop and the small-file-problem

Apache Hadoop is one of the most promising systems in this area. Apache Hadoop is a framework for scalable distributed software solutions. It is based on the MapReduce algorithm and includes a distributed file system which makes it possible to perform intensive and complex data processes with large amounts of data very efficiently.

A process instance controlled by the Imixs-Workflow engine is called a “workitem”. A ‘workitem’ contains the workflow and business data of a single process instance. A workitem has a typically a size between 1 and 5 MB which can be serialized into a XML file. These files, which may also contain documents, seem to be relatively large at first glance. But in the Hadoop system, these files fall into the category of ‘small-files’. A small file in the Hadoop system is a file which is significantly smaller than the hadoop block size which is typically between 64 and 128 MB. It is not necessarily the size of the files that cause the problem, but their quantity. In a workflow application, several million business processes can be managed at the same time. By additional Snapshot work items, which are one of the new core concepts of the Imixs-Workflow engine, tens of millions of files could be created over one year. The management of a single file in Hadoop requires only a few bytes, but with many files the data management layer – which is called the ‘namenode’ can grow up to several gigabytes. This is one of the reasons why Hadoop is not suitable for storing so many small files.

For that reasons we analyzed different solutions to solve the small-file-problem form hadoop which I want to briefly summarize in the next section.

Multiple small-files stored in one sequence file:

Storing multiple small files in one sequence file is one of the recommended solutions to solve the small-file-problem. The workitems created and processed by the Imixs-Workflow engine can be easily appended into a sequence file. To read the file later without the need to read the whole sequence file the client need to compute the correct offset in the sequence file a workitem was stored. With the offset and the size of the workitem file the data can be accessed later for example via WebHDFS:

As the Imixs-Workflow engine stores the URL of a archived workitem this is a easy way to manage the archive data also in large sequence files. If multiple threads try to append data in parallel, the problem becomes more and more complex. But a sequence file is one of the possible solutions we follow.

Multiple small-files in a Hadoop Archive (HAR):

Another solution which is often recommended is to pack small files into a Hadoop Archive (HAR) file. This solution is even more difficult to implement in our case. As explained before, the Imixs-Workflow engine writes data in unsteady intervals. This means that a separate archive job is needed to pack the workitem files into a archive file (HAR). The scheduler need to be implemented as a separated block running in the Hadoop system. This scheduler can, for example archive and delete files older than a predefined period of time. This solution will – similar to the sequence-file solution – reduce the number of small files significant. But the downside of this solution is, that the Imixs-Workflow engine need to be aware of the new location of a single workitem file because the packing scheduler is decoupled from the main archive job from the workflow engine. To access a ‘packed’ workitem file, the offset and size in the HAR file need to be transferred back to the workflow engine in a asynchronous way. As a result of this solution, the complexity of the overall system increases unreasonably. To reduce this problem, the scheduler may create a kind of index file for each new created HAR file. The index can than be used by the Imixs-Workflow engine to lookup first the offset and size of a workitem inside a HAR file. This requires up to too read operations to access a workitem from the archive. Another problem is the fact that a single workitem can now be either stored still as a small-file or already as a part of a HAR file in the hadoop system. So the access method need to be implemented in a very tricky way.

Intel-bigdata/SSM:

HDFS-7240 or Ozone:

HDFS-7240 – also named “Ozone” looks like the most promising solution for now. It looks to me that Ozone is the missing piece in the Hadoop project. Although it is not yet ready for use in production, we will follow this project.

Other solutions

Beside the hadoop system there are of course other possible solutions available. However, since I am basically convinced of Hadoop, I will not make a fundamental change in architecture for now. But I want to list some of the alternatives here:

Openstack swift

It seems that the object store “openstack swift” is solving the small-file problem much better that hadoop does it out of the box. It is certainly worth following this approach.

Casandra

Apache Cassandra is a free and open-source distributed NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple datacenters with asynchronous masterless replication allowing low latency operations for all clients.Cassandra maybe an alternative to the Hadoop system-

MapR-fs

The MapR File System (MapR FS) is a clustered file system that supports both – very large-scale and high-performance uses. MapR FS supports a variety of interfaces including conventional read/write file access via NFS and a FUSE interface, as well as via the HDFS interface used by many systems such as Apache Hadoop and Apache Spark. In addition to file-oriented access, MapR FS supports access to tables and message streams using the Apache HBase and Apache Kafka APIs as well as via a document database interface.

Conclusion:

So for the Imixs-Workflow project in short-term it seems to be the best solution to start with the sequence file approach. In the intermediate-term we will see if we can adapt the solution to the Hadoop Ozone project. In the long term we will probably support both approaches.