Abstract

FP-Growth algorithm is a Frequent Pattern Mining (FPM) algorithm that
has been extensively used to study correlations and patterns in large
scale datasets. While several researchers have designed distributed
memory FP-Growth algorithms, it is pivotal to consider fault tolerant
FP-Growth, which can address the increasing fault rates in large scale
systems.
In this work, we propose a novel
parallel, algorithm-level fault-tolerant FP-Growth algorithm. We
leverage algorithmic properties and MPI advanced features to guarantee an
O(1) space complexity, achieved by using the dataset memory space
itself for checkpointing. We also propose a recovery algorithm that can
use in-memory and disk-based checkpointing, though in many cases the
recovery can be completed without any disk access, and incurring no
memory overhead for checkpointing. We evaluate our FT algorithm on a
large scale InfiniBand cluster with several large datasets using up to
2K cores. Our evaluation
demonstrates excellent efficiency for checkpointing and recovery in
comparison to the disk-based approach. We have also observed 20x
average speed-up in comparison to Spark, establishing that a well
designed
algorithm can easily outperform a solution based on a general
fault-tolerant programming model.

Machine Learning and Data Mining (MLDM) algorithms are becoming
ubiquitous in analysing large volume of data produced in science areas
(instrument and simulation data) as well as other areas such as social
networks and financial transactions. Frequent Pattern Mining (FPM) is
an important MLDM algorithm, which is used for finding attributes that
frequently occur together. Due to its high applicability, several FPM
algorithms have been proposed in the literature such as
Apriori [10], Eclat [39],
FP-Growth [17], and GenMax [15].
However, FP-Growth has become extremely popular due to its relatively
small space and time complexity requirements.

To address increasing data volumes, several researchers have proposed
large scale distributed memory FP-Growth
algorithms [26, 13, 20, 11, 35]. One of the challenges
that arise with execution on large-scale parallel systems is the
increased likelihood (and frequency) of faults. Large scale systems
frequently suffer from faults of several types in many components
[32, 33, 29, 34, 7, 30].

Driven by these trends, several recent programming models such as
Hadoop, Spark [38], and MillWheel [3] have
considered fault tolerance to be one of the most important design
consideration. Hadoop achieves fault tolerance by using multiple
replicas of the data structures in permanent storage — possibly
resulting in a significant amount of I/O in the critical path. Spark
addresses this limitation by using Resilient Distributed Datasets
(RDDs), such that in-memory replication can be used for fault tolerance.
However, for very large datasets, in-memory replication is infeasible.
In several cases, Spark considers disk as the backend for checkpointing
— which can again significantly slow-down the computation and increase
data movement. Similarly, MillWheel is used for fault tolerant stream
processing and uses the disk as the backend for checkpointing.
Naturally, an advantage of using fault tolerant programming model is the
fact that checkpointing and recovery is automated. However, the
performance penalty of a fault tolerant programming model (due to
disk-based checkpointing) or space overhead (due to in-memory
checkpointing) is unattractive for scaling several MLDM algorithms at
large volume and computing scale.

In the context of general-purpose programming systems, recently
proposed methods such as Scalable Checkpoint Restart (SCR)
[25] are able to provide in-memory checkpointing for
multi-level hierarchical file systems using non-blocking methods. SCR
also allows using spare main memory for in-memory checkpointing.
Similarly, other researchers have proposed programming model/runtime
extensions to Charm++, and X10 for supporting fault tolerance. While
these approaches provide non-blocking checkpointing, the overall memory
requirements increase, since the implementations need to use spare
memory for checkpointing. This can very well make the approach
infeasible, especially with weak scaling executions, where spare memory
is scarce.

Figure 1: Pattern of Memory Requirements of FP-Tree and Dataset during FP-Tree build phase. As more transactions are processed, lesser
memory is required for dataset — which can be used for checkpointing

In this paper, we present an in-depth study of FP-Growth algorithm
for fault tolerance. Considering its two-pass properties (impact shown
in Figure 1), we propose a
novel algorithm, which requires O(1) space complexity for saving
critical data structures, i.e., FP-Tree, in memory of other computing
nodes. The proposed algorithm incrementally leverages the memory
allocated for the default algorithm for checkpointing FP-Trees – and possibly partial
replica of transactions from other computing nodes – ensuring an O(1)
space overhead of our proposed algorithms. To further minimize time
overhead for checkpointing, our solution not only leverages non-blocking
properties, but use MPI-Remote Memory Access ( MPI-RMA) in
addition to minimize any involvement of remote process for checkpointing.
By using MPI-RMA and contiguous data structures for implementing our
proposed algorithms, we are able to leverage Remote Direct Memory Access
(RDMA) effectively. We believe that our proposed extensions may be
included with existing solutions such as SCR, where a class of
algorithms may re-use already allocated memory for checkpointing and
recovery.

1.1 Contributions

Specifically, we make the following contributions in the paper:

We propose an O(1) in-memory checkpointing based
FP-Growth algorithm for large scale systems. The
proposed algorithm leverages overlapping communication
with FP-Tree build phase — such that the overhead of
checkpointing is minimized.

We study the limitations of existing programming models (Hadoop
MapReduce, Spark and MillWheel) and implement our algorithms
using Message Passing Interface (MPI) [16, 14].
Specifically, we use MPI-RMA mechanism to checkpoint critical
data structures of FP-Growth asynchronously. With recent
developments in MPI-RMA Fault tolerance [5], it
is possible to use MPI for handling faults, while providing
native performance.

We perform an in-depth evaluation of our proposed approaches
using up to 200M transactions and 2048 cores. Using 100M transactions on 2048 cores, the
checkpointing overhead is ≈ 5%, while the recovery cost
for multiple failures is independent of the number of processes.

We also show the effectiveness of our fault-tolerant FP-Growth
implementation – implementations outperforms Spark
implementations of the same algorithm by providing 20x
average speed-up.

2.1 Frequent Pattern Mining

Frequent Pattern Mining (FPM) algorithms find items that frequently
occur together within transactions of a database. An item or itemset is defined as
frequent if its frequency is higher than a user-defined threshold.
Several FPM algorithms have been proposed in the literature including
Apriori, Eclat, GenMax and FP-Growth. The FP-Growth algorithm is very
popular since it requires only two passes on the dataset, does not
involve candidate generation (unlike Apriori) and provides a compressed
representation of the frequent items using a Frequent Pattern
(FP)-Tree. We specifically focus on designing parallel fault-tolerant
versions of the FP-Growth algorithm, due to its attractive properties.

During the first pass, FP-Growth algorithm finds items that occur
frequently. In the second pass, it creates an FP-Tree, which is a
modified Trie. The first pass requires a simple scan through the
given dataset to find all single frequent items. FP-Tree creation
step (the second pass) is the most time consuming part of the overall
calculation[35]. Hence, we focus on fault tolerant
FP-Tree creation step of the algorithm, since longer execution time also implies higher fault
probability.

2.2 Faults

Large scale systems suffer from several fault types — permanent,
transient, and intermittent. A permanent fault typically requires a
device (such as a compute node) to be replaced. We consider fault
tolerance for permanent process faults in this paper.We assume a fail-stop fault model — once a process is perceived as dead/faulty, it is
presumed unavailable for the rest of the computation.

Since permanent node faults are commonplace in large scale systems,
several researchers have proposed techniques for addressing these
faults. Typically, checkpoint-restart[28, 8]
based methodologies are used. Application-independent methods checkpoint
the entire application space on a permanent disk — however, they have
been shown to scale only on small size
systems [8]. Application-dependent methods
— also known as Algorithm Based Fault Tolerance
(ABFT) [12, 23, 31, 1]
methods reduce this overhead by selectively checkpointing important data
structures periodically. However, depending up on the application
characteristics, checkpointing of critical data structures may still
require disk access.

2.3 Fault Tolerant Programming Models

Recently, there has been a surge of large scale and fault tolerant
functional programming models such as Hadoop, Spark, and MillWheel.
Functional programming, in turn, uses the concept of single
assignment, where every mutation of a variable is recorded, saved (on a
permanent storage/memory of another node), and replayed when a fault
occurs.

Now, let us examine the implication of such a framework for an
algorithm like FP-Tree. Every change or mutation needs to be recorded locally, and such records can be eventually saved to permanent storage. In many cases, the step of saving a new version
of the FP-Tree on the disk is carried-out at the end of the Reduce
phase (of the MapReduce implementation). For a two-phase algorithm such
as FP-Tree, where most of the time is spent on the second phase, no
advantage is achieved. Another possible implementation may choose to
divide the overall computation into multiple MapReduce steps. The
checkpointing can be executed at the end of each Reduce phase. However,
now the overall execution time will increase, since saving a new
version will either involve writing to a disk (expensive) or
neighbor’s memory. Since the reduce phase is a blocking phase, the
application will observe a significant overhead of checkpointing, which
will degrade the overall performance. Naturally, a scalable algorithm
should harness best possible performance by using native execution,
while minimizing the cost of checkpointing, by using non-blocking
methods.

Now, in examining an alternate programming model, we consider the Message
Passing Interface (MPI) [16, 14], which has been readily
available and widely used on supercomputers and clusters, and beginning to find its place on cloud computing systems.
While MPI has been frequently criticized for lack of fault
tolerance support, recent literature and implementations indicate that fault
tolerance is addressed well for permanent process
faults [5]. More importantly, recently introduced MPI
One-sided - MPI one-sided communication (also known as MPI-Remote Memory Access
(MPI-RMA)) [16, 14]- primitives provide necessary tools for overlapping
communication with computation. With this observation, we focus on using MPI for designing fault
tolerant FP-Growth algorithm in this paper.

Algorithm 1 shows the key steps of the parallel
FP-Growth
algorithm, which we have used as the baseline for designing fault
tolerant FP-Growth algorithms.

A brief explanation of the steps is presented here: The first step is
to distribute the input database transactions among |P| processes
(Line 3) (Each process is a worker, which is involved in computing
its
local FP-Tree). Each process (pi) scans the local transactions
and records the frequency of each item (Line 4). To collect the global frequency,
an all-to-all reduction (by MPI_Allreduce) is used
(incurring
log(|P|) time complexity) (Line 5). After all-to-all reduction, the items with
frequency greater than support threshold are saved, and other items are
discarded.
Then, each pi generates a local FP-Tree (L.Tree) using its local
transactions, which have at least one frequent item (Line 6). Later,
each pi merges its local FP-Tree with the FP-Trees from other
processes to produce a global FP-Tree (G.Tree) by using a ring
communication algorithm [35] (Line 7).
Finally, frequent itemsets
(FreqItemSet) are produced using the output global FP-Tree (Line 8).

\setstretch

1.0

Algorithm 1 : Parallel FP-Growth Algorithm

1:Input: Set of transactions S, Support threshold θ

2:Output: Set of frequent itemsets

3:L.Trans ← getLocalTrans(S)

4:L.FreqList ← findLocalFreqItems(L.Trans, θ)

5:G.FreqList ← Reduce Local Freq items through all processes

6:L.Tree ← generateLocalFPTree(L.Trans, G.Freq.List)

7:G.Tree ← generateGlobalFPTree(L.Tree)

8:FreqItemSet ← miningGFPTree(G.Tree)

Further, we summarize the symbols we have used to model the time and space complexity of the proposed fault tolerant algorithms
in Table 1.

In this section, we present several approaches for designing fault
tolerant FP-Growth algorithm. Our baseline algorithm uses the disk as the
safe storage for saving intermediate FP-Trees, whereas the
optimized algorithms use the memory originally allocated to the
database transactions for checkpointing intermediate FP-Trees and
transactions of other processes (with a high overlap of communication
with computation achieved using MPI-RMA methodology).

To design a fault tolerant FP-Growth algorithm, there are several design
choices. Since we consider fail-stop model, it is important to
understand the design choices between re-spawning a new set of
processes on a spare node versus continued-execution with existing
processes and nodes. We use continued-execution, primarily because for
most systems, it is intricate to re-spawn, attach the processes/node to
the existing set of processes, and continue recovery. Instead,
continued-execution provides a simple mechanism to conduct recovery,
without significant dependence on external software.

4.1 Disk-based Fault Tolerant (DFT) FP-Growth

The Disk-based Fault Tolerant (DFT) algorithm is
the baseline for other approaches presented in this
paper.

Checkpointing Algorithm and Complexity:
In the FP-Growth algorithm, there are two critical data
structures that are needed during the recovery process — database transactions themselves
and intermediate FP-Trees generated by the processes.
Under the DFT approach, the intermediate FP-Trees generated by each
process are periodically saved on disk. For many supercomputers, the
disks are located remotely, such as a remote storage. In other cases,
locally available SSDs can be used as well. The database transactions
are already resident on the disk. Hence, it is not necessary to
checkpoint the database transactions.

Let us consider an equal
distribution of database transactions to processes (|T|/|P|
transactions are available on each process). Let C be the number of
checkpoints, which are executed by the application. The number of
checkpoints are derived as a function of |T|, and |P|, such that the
cost of checkpointing can be amortized over the FP-Tree creation phase.
The DFT algorithm also needs to save metadata file associated with
FP-Tree, which may be used during recovery. The space complexity of the
metadata file is negligible, since only a few integers need to be
saved.

Let savg represent the average size of an FP-Tree generated by each
process (calculated as ∑|P|−1i=0si|P|).
The time complexity for checkpointing intermediate FP-Trees is
O(C⋅savgl). However, the actual time to checkpoint
can escalate due to the contention from multiple processes writing the
checkpoint file simultaneously. The space complexity incurred by each
process is O(C⋅savg), which can be reduced further by
recycling existing checkpoints.

Recovery Algorithm and Complexity:
In the DFT approach, the recovery is initiated by the master (pm) (In
our implementation we use the default process — process with the first
rank in MPI as the master). pm reads the metadata file associated
with the faulty process (pf), which provides the necessary information
for conducting recovery. A recovery process (pr) is selected, which
reads checkpointed FP-Tree of pf from the disk and merges the
checkpointed FP-Tree of pf with its FP-Tree, while pm reads dead
process transactions from disk, and re-distributes them among remaining
processes.

The time complexity of the recovery algorithm is a function of reading
the partial dataset and executing the recovery algorithm. In the
worst case, the entire transactions of the faulty process need to be
re-executed. Hence, the worst case time complexity is
|T||P|⋅l (reading the dataset) +|T||P|⋅b (re-distributing among process) + m
(re-computation), where m is the average cost of merging a transaction
in an existing FP-Tree (In the worst case, the FP-Tree is null, since all
transactions are re-executed).

Implementation Details:
As mentioned earlier, each process saves a copy of local
FP-Tree in a safe storage. Thus, our implementation depends on
checkpointing local FP-Tree on disk — LFPBackup file. This file
associated with another metadata file describes the checkpointed
FP-Tree by storing a set of description values such as: checkpoint
timestamp and last processed transaction. Each process asynchronously
updates both files, during the
execution. In the case of failure, the recovery operation is
performed in two steps: The pre-determined recovery process pr
process reads the last checkpointed FP-Tree of the faulty process pf
from the disk and merges it with its local FP-Tree. At the same time,
the master process reads the metadata file of pf to decide the set of
transactions to be recovered from the disk. The master process recovers
unprocessed transactions and redistributes them to the remaining
processes.

Advantages and Limitations of DFT:
The proposed DFT algorithm is largely equivalent to designing a fault
tolerant FP-Growth algorithm using MapReduce programming models such as
Hadoop/Spark. However, an advantage is that it can specifically take
advantage of native communication by using MPI, especially when high
performance interconnects are available. Disk-based
approach makes DFT suffer from several limitations: These include
prohibitive I/O cost for checkpointing/recovering local FP-Trees and
recovering unprocessed transactions, and centralized bottleneck of the
master process in the case of failure to re-read unprocessed
transactions from the disk.

4.2 Synchronous Memory-based Fault Tolerant (SMFT) FP-Growth

As discussed above, the primary limitation of the DFT approach is that
it uses disk-based checkpointing and recovery, which is prohibitive
for scaling the FP-Growth algorithm. Hence, it is important to consider
memory based fault tolerant FP-Growth algorithm.

Since available memory size is relatively small in comparison to the
disk size, it is also unattractive to incur additional space complexity
for in-memory checkpointing of FP-Trees and database transactions from
other processes. SMFT involves checkpointing method where the overall
space complexity of the algorithm remains constant. Additionally, we
overlap the checkpointing of FP-Trees and database transactions by using
non-blocking primitives provided by the MPI one-sided model. We present
the checkpointing, and recovery methods with their time-space complexity
analysis in the ensuing sections.

Figure 2: SMFT FP-Tree Checkpointing Operation Overview

Checkpointing Algorithm:
The premise of constant space complexity is based on the two-pass
properties of the FP-Growth algorithm. During the FP-Tree creation
phase, once a database transaction is processed, the memory occupied by
the transaction can be used for checkpointing. We leverage this property
of the algorithm to checkpoint the FP-Trees and database transactions.
Specifically, once a transaction is processed, we reclaim the memory
consumed by the transaction and allocate a separate window of
memory, which can be used by other processes for checkpointing their
FP-Trees and database transactions. With this technique, the overall
space complexity of the algorithm is O(1).

Besides optimal space complexity, the objective of SMFT algorithm is to minimize
the time complexity of checkpointing both the FP-Trees and database
transactions. Considering C as the number of checkpoints, under a
naive algorithm, each process can checkpoint its existing FP-Tree to
another process at every |T||P|⋅C steps. Since the time
overhead of checkpointing is non-negligible, as this step blocks
for the communication to complete before continuing to process remaining
transactions, at every checkpointing step — with increasing FP-Tree
size — the overhead of blocking increases. Hence, it is important to
consider non-blocking methods of checkpointing, such that communication
cost of checkpointing can be overlapped with computation.

SMFT algorithm uses MPI one-sided non-blocking methods for checkpointing. Specifically, as the database transactions are processed, a similar amount of memory is added to a checkpoint window. The algorithm uses dynamic allocation feature in MPI-RMA , MPI_Win_create_dynamic, that allows incremental increase in the size of the checkpointing memory space during the execution. However, this dynamic allocation technique requires synchronization between both cooperated processes to perform each single checkpoint which adds more overhead to the checkpoint process. SMFT checkpoint overhead comes from different sources: waiting time till synchronization, communication —which is negligible based on well known communication model LogGP [4]—, and memory allocation and de-allocation cost.

Figure 2 shows an overview over the FP-Tree checkpointing operation in
SMFT approach. Assuming process pi needs to checkpoint on process
ptarget memory, each time period, i.e., t0, t1, …, tn,
process ptarget re-initiates a checkpoint space that can handle
process pi checkpointed local FP-Tree. In this case, process pi
can remotely checkpoint its local FP-Tree to the new assigned location
without communicating with checkpoint process ptarget.

Recovery Algorithm:
Assuming a process pf fails while executing the FP-Tree
phase. On fault recovery, the recovery process pr (in the simplistic
case, a neighbor such as pf+1) merges checkpointed FP-Tree of pf
stored on its memory to its local FP-Tree. If pr has also stored
part of the database transactions from pf, it re-distributes these
transactions to other processes, which are still active in the
computation. The recovered transactions can be gathered from the memory
of pr, if they were checkpointed by pf before failure. In the
case of disk recovery, lost transactions can be read from the disk using
two different ways. First, dataset transactions may be read from the
disk by using the master process and re-distributed evenly among the
remaining processes. However, in this case, disk access will be the most
expensive part of the overall recovery algorithm. So, we suggest using
all available processes to read samples of failed process (np) from the disk in parallel. With
this, each process will only access the disk to read nP(P−1)
transactions. Further, since failed process pf held the data
checkpointed by process pf−1 , process pf−1 performs a critical
checkpoint on process prec — in the simplest case, the processes can be assumed
to be connected in a virtual ring topology. Using this methodology, there is always
at least one replica of the FP-Tree of each process.

Advantages and Limitations of SMFT:
The primary advantage of SMFT is that it avoids reading/writing from the disk. Naturally, SMFT achieves native performance using MPI and is expected to incur low overhead for checkpointing with non-blocking MPI one-sided communication. The recovery algorithm uses memory to recover the database transactions, if possible. By distributing the transactions of a failed process to other active processes, the algorithm is able to minimize the recovery overhead. In the case of disk-based transactions recovery, SMFT uses all processes to read recovered transactions from the disk in parallel to avoid master process bottleneck.

SMFT approach has two main limitations. First, each two processes pi and ptarget need to synchronize in all checkpoints to share the address of checkpoint vector and the size of checkpointed FP-Tree or checkpointed transactions. Second, SMFT algorithm requires de-allocating existing space and allocating new space for checkpointing window. The overhead of synchronization, de-allocation and allocation are observed during FP-Tree creation phase. We address these two limitations in the AMFT approach, presented later.

Implementation Details:
In SMFT, each process ptarget allocates three memory vectors. These vectors are used to handle checkpoints from process pi namely: FPT.chktarget vector to handle local FP-Tree of proceeding process pi, Trans.chktarget vector to handle transactions checkpoint of pi, and metadatatarget vector that includes a set of parameters to describe both checkpoint vectors. These vectors are allocated and exposed for read/update by each process using MPI-RMA primitives.

For in-memory checkpointing, SMFT requires that each process pi selects another process for checkpointing. While SMFT supports any arbitrary topology, in the simplest case, the processes can be assumed to be connected in a virtual ring topology. Each process pi uses the memory of adjacent processor pi+1 for checkpointing its local FP-Tree and transactions. Therefore, each process pi+1 should prepare its checkpoint buffers ( FPT checkpoints and transaction checkpoints vectors) to handle the data checkpointed by process pi, when needed during recovery.

To perform a single checkpoint, each pair of processes (pi,ptarget) need to perform three operations. First, ptarget increases the size of the metadatatarget and FPT.chktarget data structure, such that the new checkpoint from pi can be handled. The operation of determining the size of the checkpointed pi local FP-Tree requires synchronization between pi and ptarget. Specifically, pi sends a checkpointing request to ptarget including the volume of data to be checkpointed. ptarget uses MPI_Win_create_dynamic mechanism to increase the size of the checkpointed space. The new virtual address is communicated to pi, which is used by pi for checkpointing the actual data using MPI_Put operation.

A process pi may also checkpoint its remaining local transactions on pi memory to avoid reading it from disk in the case of failure.
If the fault occurs before checkpointing the transactions, remaining transactions are recovered from the disk. However, if pi fails after dataset transactions have been checkpointed, they can be redistributed directly by ptarget to other available processes. Transactions checkpointing can be performed similar to FP-Tree checkpointing on Trans.chktarget vector of the target process.

Algorithm 2 shows the checkpointing and recovery algorithms for SMFT. In initialization procedure, each process create three vectors FPT.chki, Trans.chki and metadatai vectors to handle proceeding process checkpoints (Line 1). These vectors are allocated and exposed using MPI-RMA technology for facilitating remote read/update (Line 2). Both PerformLFPChk procedure and PerformTransChk procedures, illustrate checkpoint operation in SMFT for both local FP-Tree and transactions, respectively. Process pi synchronizes with its source process psrc by receiving its checkpoint size and resizing its checkpoint buffer to handle psrc data. Process pi finalizes the synchronization operation by sending the new checkpoint vector address to the source process (Line 1). Next, process pi uses MPI_Put function to checkpoint its data and updates the metadata vector on target process memory (Lines 2-3).

The performRecovery procedure shows the recovery algorithm in SMFT. The predetermined recovery process pr is used to recover failed process Pf by merging checkpointed local FP-Tree of Pf it has on its memory to local FP-Tree (Line 1). Further, failed process transactions can be recovered with the aid of metadata vector directly from recovery process memory if available or from the disk if not (Lines 2-6). Disk-based recovery should be performed in parallel to speed-up the total recovery time.

4.3 Asynchronous Memory-based Fault Tolerant (AMFT) FP-Growth

In the SMFT approach, we observed the advantages of using in-memory
checkpointing of FP-Tree and database transactions. However, there are a
few limitations of SMFT. Specifically, a pair of processes need to
synchronize for memory allocation and address exchange — which reduces
the overall effectiveness of the MPI One-sided model.

We address the limitations of SMFT by proposing a truly one-sided
mechanism for checkpointing, i.e., Asynchronous Memory-based Fault Tolerant (AMFT). Under AMFT, we use the memory of already processed transactions for checkpointing instead of allocating new space.
Similar to SMFT, under the AMFT approach, it is possible to checkpoint
the FP-Trees and a portion of the database transactions. We describe the
checkpointing, recovery and implementation details of the AMFT approach
as follows.

Checkpointing Algorithm:
Consider a subset of two processes ∈P — pi and ptarget. The checkpoint from pi is stored on ptarget. To enable truly one-sided mechanism for checkpointing, pi must ensure that its checkpoint size is less than the size of the already processed transactions in ptarget. In AMFT, we achieve this objective by using atomic operations on variables allocated using MPI-RMA and exposing it to read/update by other processes. The original parallel FP-Growth algorithm is slightly modified to atomically update the size of available checkpointing space — this step does not require communication with any other process. When pi decides to checkpoint its FP-Tree, it atomically reads
the value of available checkpointing space on ptarget. By carefully designing the checkpointing interval, it is highly likely that the size of the available checkpointing space on ptarget is greater than the size required by pi. In the pathological case, pi periodically reads the available checkpointing space, till the condition is satisfied — in practice, this situation is not observed. In the common case, pi simply initiates the checkpoint using MPI_Put. Besides local FP-Tree, remaining (unprocessed) transactions of process pi can also be checkpointed to ptarget memory if there is enough space. Checkpointing transactions is one-time operation that improves the recovery process by reading failed process’s transactions directly from checkpoint memory space instead of disk.

Figure 3 illustrates AMFT checkpointing operation by showing two different cases. In Figure 3 only local FP-Tree of process pi is checkpointed on ptarget available transactions space. However, in Figure 3 both remaining transactions and local FP-Tree of process pi are checkpointed to ptarget memory (i.e., memory space availability is required).

\subcaptionbox

Local FP-Tree Checkpointing\subcaptionboxUnprocessed Transactions and Local FP-Tree Checkpointing

Figure 3: AMFT Checkpointing Operation Overview

The effectiveness of AMFT checkpointing algorithm is in its simplicity. Unlike SMFT, there is no synchronization required between any pair of processes, and memory allocation is not required as well. By using MPI-RMA on high performance interconnects such as InfiniBand, we expect AMFT to be a near-optimal checkpointing algorithm for designing large scale FP-Growth algorithm. As expected, since each process simply initiates the communication for the checkpoint, the expected time
complexity of the checkpointing is O(|T|log|P|.C), using the LogGP model [4].

Recovery Algorithm:
The recovery algorithm for AMFT is similar to SMFT. Assuming ptarget is the recovery process prec. When a fault occurs (on pi), recovery processprec merges the checkpointed FP-Tree of pi with its FP-Tree and re-distributes the dead process pi transactions among a subset of available processes (such as log|P|), if an in-memory checkpoint is available locally. Otherwise, all available processes recovered unprocessed transactions of the failed process pi from the disk in parallel.

The worst case time complexity of AMFT approach is similar to SMFT. In the worst case, the entire transactions are read from disk in parallel as mentioned in SMFT approach with (|T||P|⋅|P−1|⋅l) time complexity, and recomputed by log|P| processes in (|T||P|⋅log|P|). However, in many cases — especially when the fault occurs during later stages of FP-Tree build phase — disk will be completely avoided, resulting in much faster recovery in comparison to the worst case scenario.

4:ifTranstarget has enough space for remaining L.Trans of Pi (Only one time) then

5: add(L.Trans, Transtarget) (MPI_Put)

6:endif

7:Update metadatatarget vector (MPI_Put)

Procedure: performRecovery (Pf, G.Freq.List, Prec )

8:

1:Prec process: merge (L.Tree, Pf.chkFPTree, G.Freq.List)

2:ifTrans.checkpoint is NULL then

3: diskTransRecv(metadata)

4:else

5: memTransRecv(Trans, metadata)

6:endif

Algorithm 3 illustrates the checkpointing and recovery procedures for AMFT algorithm. During the initialization procedure, each process has its own Transi vector that contains local set of transactions L.Trans (Line 1). In line 2, each process pi creates a single vector, i.e., metadatai, that represents a set of parameters to describe the status of L.Trans vector and checkpointed data of source process psrc stored on pi memory. In line 3, MPI-RMA technology is used to shared both vectors, i.e., Transi and metadatai, to other processes.

Both L.FPTree and remaining transactions L.Trans can be checkpointed using performChk procedure. Each process should read metadatatarget on target process ptarget to check for space availability before checkpointing (Lines 1-6). Remaining transaction L.Trans checkpointing is only performed one time once a space is available.

The performRecovery procedure shows the recovery algorithm in AMFT approach. Like the SMFT recovery algorithm, the recovery process Pr process is used to recover pf by merging latest checkpointed FP-Tree pf it has with its local FP-Tree. pf unprocessed transactions can be recover from recovery process memory if it was checkpointed before failure or directly from disk (Lines 2-6).

In this section, we present a detailed performance evaluation of the proposed fault tolerant FP-Growth algorithms, i.e., DFT, SMFT, and AMFT that were presented in section 4. For each fault tolerant algorithm, we present a detailed performance analysis of the checkpointing and recovery overhead. We use up to 200 million transactions and a large scale evaluation using up to 2048 cores. At the end of this section, a comparison against a fault-tolerant version executed on Spark is presented.

5.1 Setup

Experimental Testbed

We use Stampede supercomputer at the Texas Advanced Computing Center (TACC) for performance evaluation. The Stampede supercomputer is Dell PowerEdge C8220 cluster with 6,400 Dell PowerEdge server nodes, each with 32GB memory, (2) Intel Xeon E5 (8-core Sandy Bridge) processors. We use MVAPICH2-2.1, a high performance MPI library available on Remote Direct Memory Access (RDMA) interconnects such as InfiniBand. We use aggressive compiler optimizations with Intel compiler v15.0.1 for performance evaluation.

Datasets

To evaluate different proposed fault tolerant FP-Growth algorithms, we use IBM Quest dataset generator [2] for generating large scale synthetic datasets. IBM Quest dataset generator has been widely used in several studies, and accurately reflects the pattern of transactions in real-world datasets [37, 9, 36, 22]. For experimental evaluation, we use two synthetic datasets with 100 and 200 million transactions. The number of items per transaction is 15-20. A total of 1000 item-ids are used.

5.2 Overhead of Supporting FP-Growth Fault Tolerance

Checkpointing Overhead Evaluation

While the recovery algorithm is executed only during faults, the cost of checkpointing is incurred even in the absence of faults. Naturally, it is critical to minimize the checkpointing time — especially, when the fault rates are low.

Figure 4 shows the checkpointing overhead of
DFT, SMFT and AMFT algorithms using 100M, 200M transactions and
support threshold (θ) values of 0.03 and 0.05. Table 2
presents the data in a tabular form, by showing the percentage of
slowdown in comparison to the default parallel algorithm that is
not fault-tolerant.
In Figure 4(a), if we focus on strong
scaling evaluation (keeping the overall work constant and increasing the
number of processes), the algorithm scales very well
(scaling from 256 -512 processes, we observe super-linear speed-up due
to better cache utilization). Similar speed-ups are observed for DFT,
SMFT, and AMFT algorithms, respectively. Since the support threshold is
high (0.05), the number of frequent item-ids is relatively small. Hence,
the overall computation time is less than 50s. Naturally, the slow down
observed by DFT and SMFT is high — 67% and 31%, respectively. AMFT
only experiences a slowdown of 21%. We expected negligible overhead for
AMFT. However, we experienced slowdown, because for small scales such as
256 processes, the size of individual FP-Tree is larger (in comparison
to larger process counts). Unfortunately, current MPI-RMA
implementations are not always optimized for bulk data transfers. To
validate this argument, we observe the column for AMFT with 100M
transactions. On 2048 cores — with strong scaling — the overhead of
checkpointing reduced to 5%. For lower support threshold, as shown in
Figure 4(b), the overall slowdown for AMFT is
4-6%, while DFT overhead is 10-20%, for different process counts.

Figure 4(c) shows the performance comparison of
DFT, SMFT, and AMFT algorithms using 200M transactions and 0.05 support
threshold. We observe similar pattern as
Figure 4(a). While we expect relatively high
overheads for DFT and SMFT approaches, we observe higher relative
overhead for AMFT approach as well. We argue that for larger
transactions per process, the size of the FP-Tree is larger. Since
MPI-RMA runtimes are less optimized for bulk transfer, the slowdown is
smaller, but non-negligible.

Figure 5: SMFT and AMFT Recovery Speed up Compared to DFT Approach with Different Number of transactions, Support
Threshold , and Cores

Figure 4(d) illustrates the performance of the
proposed approaches with 200M transactions and 0.03 support threshold.
The DFT approach observes a slowdown of 17-35% in comparison to the
basic parallel algorithm, while AMFT only observes up to 10% overhead.

Clearly AMFT outperforms other approaches, especially the disk-based approach easily without incurring any additional space complexity. We
also observe that with strong scaling, which is usually a problem for
distributed memory algorithms, the relative overhead of AMFT decreases.
We argue that it is due to the unoptimized MPI-RMA protocols for bulk
data transfer. With further optimizations, as expected in near future,
these overheads are expected to reduce further. With O(1) space
complexity and still acceptable checkpointing overhead such as 10% for
AMFT, we expect the proposed algorithm to be used as the basis for
future research and practical deployments.

Recovery Overhead Evaluation

The effectiveness of any fault tolerance mechanism is related to failure
recovery overhead besides the checkpointing overhead. In this
subsection, we evaluate the recovery overhead in the case of failure by
injecting faults into FP-Growth parallel execution. To simulate faults,
we select a process to fail and the point of failure. When
reaching failure point, that process is eliminated from the execution.
We assume failure point after processing 80% of dataset transactions to
fairly comparing recovering algorithm for DFT, SMFT , and AMFT
approaches.

In the case of failure, DFT recovery algorithm needs to recover FP-Tree of failed
process from the disk comparing to both SMFT and AMFT approaches
where FP-Tree is recovered from memory. In the first set of experiments,
we calculate the speed-up using both SMFT and AMFT approaches
compared to DFT approach to recovery one failure process as shown in
Figure 5. In Figures 5 and
5, with 0.05 support threshold, the average speed-up
by SMFT algorithm is 1.36x while average gained speed-up by AMFT algorithm is 1.41x using
100M dataset in the recovery process. In the case of 200M synthetic dataset,
both SMFT and AMFT recovery algorithms speed-up the total execution time with recovery by 1.55x and 1.59x, respectively, compared to DFT algorithm. In Figure 5 and 5, with 0.03 support threshold, the recovered FP-Tree becomes
larger which negatively impacts the performance of DFT approach compared
to the other two approaches (i.e., SMFT and AMFT). Thus, with 100M dataset,
compared to DFT approach, SMFT speeds-up the recovery process by 1.39x
while AMFT speeds-up the recovery process with 1.46x. Using 200M
dataset, SMFT speeds-up the algorithm execution with recovery by 1.51x while AMFT
speeds-up the algorithm with 1.68x.

# Cores

Sup.

DFT Time (Sec)

SMFT Time (Sec)

AMFT Time (Sec)

100M

200M

100M

200M

100M

200M

256

0.03

2312.65

8860.26

2049.68

6945.23

1972.01

6822.59

0.05

67.12

182.685

56.57

132.52

54.23

119.16

512

0.03

948.125

3227.25

722.19

2268.12

701.12

2226.65

0.05

34.59

92.36

26.95

64.12

24.83

59.63

1024

0.03

609.52

1762.34

415.12

1038.23

399.52

1022.52

0.05

15.88

45.48

11.06

31.68

9.95

27.23

2048

0.03

438.85

1151.12

280.23

629.62

272.85

609.62

0.05

10.55

27.04

6.97

15.12

6.40

13.78

Table 3: DFT, SMFT and AMFT Total Execution Time Including The Recovery Time

Table 3 summarizes the total execution time including the recovery time of DFT, SMFT, and AMFT algorithms to handle one failure using 256,
512, 1024, and 2048 cores with 0.03 and 0.05 support threshold,
respectively. Several observations can be drawn from Figure 5 and
Table 3. Both SMFT and AMFT algorithms speed-up the
FP-Growth algorithm recovery process compared to DFT algorithm. With smaller support threshold (θ=0.05), the size of checkpointed
local FP-Trees and dead process recovered FP-Tree is small. Thus, in
SMFT the synchronization overhead can be clearly shown compared to AMFT
algorithm. In this case, AMFT outperforms SMFT algorithm as shown. However,
in the case of (θ =0.03), the size of FP-Tree is larger and the
synchronization overheads are small compared to checkpointing and recovery
time. Thus, the speed-up difference between SMFT and AMFT decreases.
Another observation that could be obvious is that the average speed-up for
both SMFT and AMFT algorithms increases with larger dataset (i.e.,
200M). The main reason of this that FP-Trees become larger and DFT
algorithm needs more time to checkpoint or recover it from disk.
Finally, with (θ =0.3), we observe a super-linear speed-up from 256 to 512 cores due to better cash utilization.

5.3 Comparison Against Spark

We compare our proposed AMFT FP-Growth algorithm with Spark FP-Growth algorithm to show the effectiveness of our proposed system.
Although, it is common for MPI-based implementations to outperform MapReduce-based implementations [18],
we are particularly interested in absolute and relative overheads for handling failures. Spark has a built-in Machine
Learning library (MLlib) that includes an FP-Growth algorithm, which we use in our comparison. A set of experiments has been conducted with different number of nodes and using 500K synthetic dataset to show the performance of both MPI-based and spark-based FP-Growth algorithms.

\subcaptionbox

500K Trans. θ=0.03\subcaptionbox500K Trans. θ=0.01

Figure 6: Spark and MPI-based (AMFT))with Different Support Threshold θ and using 500K Synthetic Dataset

Figure 6 shows the performance of AMFT algorithm compared to Spark. With the absence of a failure, AMFT
algorithm outperforms spark FP-Growth version with an average speed-up of 20x with θ=0.01 and an average speed-up of 8.6x
with θ=0.03. The average speed-up in the case of smaller threshold (θ=0.01) is larger because the size of
checkpointed FP-Trees is larger. Moreover, when checkpointing, the scalability of AMFT algorithm is better than the Spark-based algorithm
because AMFT only depends on checkpointing FP-Trees and a set of transactions periodically, which are both small with larger
number of cores. However, Spark depends on the RDD mechanism by having in-memory replication of both FP-Trees and transactions,
overhead of which increases with a larger number of cores.

In the case of a failure, the average gained speed-up from using AMFT compared to Spark is 15.3x with θ=0.01 and 8.34x
with θ=0.03. Performance of both AMFT and Spark-based algorithms becomes better with larger number of cores and/or smaller
support threshold (i.e., θ=0.03) because recovered FP-Tree is smaller in both cases.

Several researchers have proposed FP-Growth algorithms
for both single node and distributed memory systems [21, 24, 6, 40, 10, 27].
These algorithms have addressed several issues for scalable FP-Growth such as memory
utilization, communication cost, and load-balancing. However, fault tolerance has not
been considered in these efforts.

Several programming models proposed recently provide
automatic fault tolerance using functional paradigms. These include
MapReduce implementations like Hadoop and Spark, as well as MillWheel.
There have been studies for using
MapReduce to parallelize frequent pattern mining algorithms,
including FP-Growth [21, 40, 19]
and apriori [24, 6]. In these work,
MapReduce achieves fault-tolerance by re-executing all the tasks of the
failed node(s). As far as we are aware,
recovery algorithm has to completely re-execute the FP-Tree generation
from scratch in these implementations, which severely and negatively impacts the recovery performance.

Scalable Checkpoint/Restart library (SCR) is another way to support fault tolerant MPI-based applications through a multi-level
checkpointing technique [25]. SCR handles hardware failures in MPI application by performing less frequent and
inexpensive checkpoints on available compute nodes memory. Our work has somewhat similar ideas, but further specializes them
by considering algorithm-specific properties.

This paper focuses on building a fault tolerance framework to support
FP-Growth algorithm in parallel systems. Three fault tolerance
algorithms have been proposed: Disk-based Fault Tolerance (DFT),
Synchronous Memory-based Fault Tolerance (SMFT), Asynchronous
Memory-based Fault Tolerance (AMFT). DFT algorithm represents the
brute-force approach to build a fault tolerance system using
periodically checkpoints on disk. However, the other two algorithms,
i.e., SMFT and AMFT, perform periodically checkpoints on memory instead
of disk to avoid I/O latency.

In SMFT algorithm, we shrink the processed transactions space and allocate a new space that can remotely be accessed by other processes to perform FP-Tree and transactions checkpoint. This algorithm requires synchronization between processes before any single checkpoint which adds more overhead to checkpointing operation. However, in AMFT algorithm, we use the transactions vector itself as checkpoint space to avoid any communication between processes during the checkpointing operation.

An extensive evaluation over 256, 512, 1024, and 2048 cores has been
performed on large datasets, i.e., 100 and 200 million transactions
datasets. Our evaluation demonstrates excellent efficiency for
checkpointing and recovery in comparison to the disk-based algorithm. Our
detailed experimental evaluation also shows low overheads and how we can
outperform Spark by an average of 20x with θ=0.01 and 8.6x with θ=0.03.

W.-T. Lin and C.-P. Chu.
Determining the appropriate number of nodes for fast mining of
frequent patterns in distributed computing environments.
International Journal of Parallel, Emergent and Distributed
Systems, (ahead-of-print):1–13, 2014.

M. Shantharam, S. Srinivasmurthy, and P. Raghavan.
Characterizing the impact of soft errors on iterative methods in
scientific computing.
In Proceedings of the International Conference on
Supercomputing, ICS ’11, 2011.

V. Sridharan and D. Liberty.
A study of dram failures in the field.
In Proceedings of the International Conference on High
Performance Computing, Networking, Storage and Analysis, SC ’12, pages
76:1–76:11, 2012.

V. Sridharan and D. Liberty.
A study of dram failures in the field.
In Proceedings of the International Conference on High
Performance Computing, Networking, Storage and Analysis, SC ’12, pages
76:1–76:11, 2012.

M. Shantharam, S. Srinivasmurthy, and P. Raghavan.
Characterizing the impact of soft errors on iterative methods in
scientific computing.
In Proceedings of the International Conference on
Supercomputing, ICS ’11, 2011.

W.-T. Lin and C.-P. Chu.
Determining the appropriate number of nodes for fast mining of
frequent patterns in distributed computing environments.
International Journal of Parallel, Emergent and Distributed
Systems, (ahead-of-print):1–13, 2014.